01 — AI Doc2XML 02 — GxP NLP Classifier 03 — Churn Prediction 04 — Jira Reporter 05 — VATSA Research Coming Soon
Project 01

AI Doc2XML
Dual-Agent System

Containerised Open Source

An agentic AI system that automates extraction of structured data from pharmaceutical regulatory documents and converts it into XML format. Built using LangChain to orchestrate two specialised agents working in sequence — directly inspired by document processing challenges observed at GSK.

Supports both local Ollama models for privacy-sensitive environments and Azure OpenAI for production-grade accuracy. Containerised with Docker and deployed live on Azure Container Apps via Azure Container Registry.

Manual extraction of structured data from pharmaceutical regulatory certificates is slow, error-prone, and does not scale. Two specialised agents rather than one general-purpose agent improves accuracy and auditability.

PythonLangChainMulti-Agent ArchitectureAzure OpenAIOllama (Local LLM)DockerAzure Container AppsAzure Container RegistryGradio
Architecture — Agent Flow
01Document uploaded via Gradio UI
02Extractor Agent — reads document, identifies fields, extracts structured data
03Reviewer Agent — validates extraction, flags gaps or anomalies
04XML output generated from validated structured data
05User downloads XML file
Key Technical Decisions
  • Two agents not one: A reviewer agent catches extraction errors before output — more reliable for complex documents.
  • Local + Cloud LLM: Ollama for environments where data cannot leave premises; Azure OpenAI for production accuracy.
  • Docker containerisation: Solved cross-environment Python dependency conflicts preventing direct Azure deployment.
  • Azure Container Apps: Serverless scaling — zero cost when idle, scales on demand without managing infrastructure.
What I Learned
  • Dependency management is the biggest practical barrier to Python app deployment — Docker eliminates it
  • Multi-agent systems need clear boundaries — agents that try to do too much become unreliable
  • Environment variable management via Azure secrets is essential for API key security in production
Project 02

GxP Quality Intelligence
NLP Classifier + XAI

GitHub Notebook

An end-to-end NLP classification system for pharmaceutical deviation reports. Classifies each deviation as critical, major, or minor — a task currently done manually by quality teams. Includes an Explainable AI layer using SHAP to make model decisions auditable for GxP regulatory acceptance.

In a GxP environment, a model that cannot explain its decisions will not pass regulatory review. SHAP values show exactly which words drove the classification — making the model auditable by quality and regulatory teams.

Manual deviation classification is subjective, slow, and inconsistent across reviewers. An automated system with explainability reduces classification time and provides a consistent, auditable decision trail.

PythonScikit-learnLogistic RegressionTF-IDFFAISSSHAPExplainable AINLPPandasMatplotlib
Pipeline Architecture
01Raw deviation text input
02Text preprocessing — tokenisation, stopwords, lemmatisation
03TF-IDF feature extraction
04Logistic Regression classification
05FAISS similarity search for related historical deviations
06SHAP explainability — word-level decision attribution
Key Technical Decisions
  • TF-IDF over embeddings: With a small dataset, TF-IDF generalises better than large embedding models that would overfit.
  • Logistic Regression over XGBoost: More interpretable and works well with TF-IDF sparse features — explainability was the priority.
  • SHAP for XAI: Shows which specific words drove each classification — essential for GxP audit acceptance.
  • FAISS retrieval: Surfaces similar historical deviations to give reviewers context alongside the classification.
Why Explainability Matters in GxP
  • GxP regulations require documented evidence for decisions affecting product quality
  • A black-box model cannot be validated under GAMP 5 principles
  • SHAP word-level attribution provides the audit trail regulators require
  • Human-in-the-loop design — model assists, quality expert approves
Project 03

Customer Churn Prediction
End-to-End ML

Live on HuggingFace Open Source

A complete end-to-end machine learning project predicting customer churn probability. The core challenge — and the most important technical decision — is handling the severe class imbalance in churn datasets where non-churners vastly outnumber churners.

Most naive approaches achieve high accuracy by predicting everything as non-churn. This project evaluates on F1 score and AUC-ROC on the minority class — the correct methodology for imbalanced classification problems.

PythonScikit-learnPandasStreamlitFastAPIHuggingFace SpacesSMOTEEDAGit
ML Pipeline
01EDA — distributions, correlations, missing values, class balance
02Feature engineering — encoding, scaling, feature selection
03Class imbalance handling — SMOTE oversampling
04Model training — multiple classifiers compared
05Evaluation — F1 and AUC-ROC on minority class
06FastAPI serving + Streamlit UI + HuggingFace deployment
Key Technical Decisions
  • SMOTE not class weights: Minority class too small for weights alone — SMOTE generates synthetic minority samples for better balance.
  • F1 and AUC-ROC not accuracy: A model predicting all non-churn achieves 95% accuracy but is completely useless.
  • FastAPI backend: Separates model serving from UI — API can be consumed independently of the Streamlit frontend.
Project 04

Jira Sprint Reporter
Python Automation

Deployed at GSK Open Source

A Python automation tool connecting to the Jira API, extracting sprint data, performing analytics, generating visual reports, and delivering them automatically via email. Built to eliminate manual sprint reporting for a 25+ member cross-functional team at GSK.

PythonJira REST APIPandasMatplotlibAutomationSMTP EmailGit
How It Works
01Connects to Jira REST API with authentication
02Extracts sprint issues, statuses, assignees, story points
03Pandas transformation — velocity, completion rate, blockers
04Matplotlib visualisations — burndown, status breakdown, velocity trend
05HTML report generated and delivered via automated email
Business Impact
  • Eliminated hours of manual data extraction every sprint
  • Consistent report format across all sprints — no human variation
  • Stakeholders receive reports automatically without chasing anyone
  • Deployed and used live at GSK for 25+ member team
Research Project 05 — In Progress

VATSA
Unified Five-Modality AI Architecture

V-Module Complete Preprint Published Open Source

A unified five-modality AI architecture for human-level perception and action — Video, Audio, Text, Sensory, Action. Each modality encoder projects into a shared 512-dimensional latent space for cross-modal fusion. Long-term mission: a safe embodied AI that can operate alongside humans.

The Visual Module (V-Module) is complete. Audio, Text, Sensory, and Action modules are in the roadmap across the DBA research timeline (2025–2028). Architecture published as a preprint on Zenodo, April 2026.

EfficientNet-B0 trained with three-stage transfer learning on CIFAR-10: 79% frozen → 94% fine-tuned → 96.31% deep unfreeze (4 layers, 40 epochs). Integrated with YOLOv8 for real-time object detection at 22 FPS live stream, generating 512-dim embeddings at 1,336 embeddings/sec at batch 16. GPU footprint: 63.7 MB.

SAMOS — Safety-Aware Multi-Output Selector

A proposed novel output routing mechanism for safe parallel multi-modal output generation in physically embodied AI systems. Instead of a softmax that forces one winner, SAMOS uses independent sigmoid activations per output head — allowing Text, Audio, Action, Feeling, and Video outputs to activate simultaneously when appropriate.

Three core components: (1) Learnable per-modality thresholds — not fixed at 0.5, learned from safety-weighted loss; (2) Asymmetric safety-weighted loss function — false activation of the Action head (physical harm risk) is penalised far more heavily than false activation of the Feeling head; (3) Uncertainty-aware gating — when uncertain, the system defaults to the safer option per modality.

The ethical principle — the robot must never harm a human, even accidentally, even when uncertain — is encoded directly into the mathematical architecture, not bolted on as a filter afterward.

PyTorchEfficientNet-B0Transfer LearningYOLOv8Computer VisionMultimodal AIEmbeddingsMixed PrecisionSAMOSVLA Models
V-Module Performance
96.31%
CIFAR-10 Accuracy
22 FPS
Real-time YOLOv8 Stream
1,336
Embeddings/sec @ Batch 16
63.7 MB
GPU Footprint
Full VATSA Pipeline (Target Architecture)
INVideo · Audio · Text · Sensory inputs
01Five modality encoders → 512-dim embeddings each
02Cross-modal fusion transformer
03Unified situational representation
04SAMOS — Safety-Aware Multi-Output Selector
OUTText · Audio · Action · Feeling · Video (parallel, async)
Research Roadmap
  • 2025–2026 (Year 1): Complete Audio and Text encoders. Medical RAG assistant as applied GenAI project.
  • 2026–2027 (Year 2): Sensory encoder. Cross-modal fusion transformer. SAMOS prototype.
  • 2027–2028 (Year 3): Action module. VLA integration. DBA dissertation on VATSA and SAMOS in safety-critical environments.
Open Research Problems
  • Can SAMOS thresholds adapt dynamically based on environment risk level?
  • Can output heads coordinate naturally through the shared latent space without explicit synchronisation?
  • How should the feeling output head be formally defined — affective state vector, physiological simulation, or social signal generator?
// Coming Soon
Project 06 — In Progress
Medical Assistant RAG

A RAG-based medical assistant built on open-source LLMs (Mistral / Llama via Ollama) and public medical datasets (PubMedQA / MedQuAD). Applying embeddings and transformer knowledge from the PGP programme to build and deploy a live clinical Q&A assistant on HuggingFace Spaces. Framed as a pharmaceutical/clinical document assistant — directly connecting GSK regulated AI experience to hands-on GenAI engineering.

LangChainRAGMistral / LlamaPubMedQAFAISSHuggingFace
Project 07 — Planned
LangGraph Multi-Agent Compliance System

Stateful multi-agent system using LangGraph — three specialised agents (document reader, classifier, compliance reporter) with conditional routing and state management. Direct application of agentic AI to pharmaceutical regulatory compliance workflows. Extension of the AI Doc2XML architecture into a more complex stateful pipeline.

LangGraphMulti-AgentAzure OpenAIState Management