Portfolio — Harshal Bhambhani

Project 01 · Standard Chartered GBS · Jul–Dec 2025

LMA Clause
Identification Tool

A production NLP pipeline on Dataiku DSS that reads 100+ page Loan Market Association agreements and automatically flags every legally-significant clause — sanctions, indemnities, governing law — that compliance auditors must review. Ships with confidence calibration, semantic search, and an auditor-friendly threshold tuning UI.

NLP Fine-Tuning Production

The Problem

Auditors at a global bank were manually reading 100+ page loan agreements to find ~30 specific compliance-critical clauses. One missed sanctions clause = regulatory failure. The work was slow, error-prone, and didn't scale.

The Approach

Train a small, fast language model (DistilBERT) to recognise each clause type, then add a second-stage semantic search layer that confirms the match against gold-standard reference clauses. Filter low-confidence predictions before showing to humans.

The Impact

90% reduction in audit time. 98.6% recall on rare clauses (the ones that matter most). 95%+ of false positives filtered before reaching auditors. Legal team can review 10× more agreements in the same window.

90%

Audit time reduced

98.6%

Recall on rare clauses

95%+

False positives filtered

0.93

Weighted F1 score

250

Token window size

System Architecture — How the pipeline thinks

End-to-end clause identification pipeline

A 100+ page PDF enters on the left. By the time it exits on the right, every legally-significant clause has been classified, scored for confidence, and matched against a gold-standard reference — ready for auditor review.

Step-by-step workflow

01

Ingest the agreement

A 100+ page LMA PDF lands in the Dataiku dataset. Text is extracted page-by-page, preserving section markers so we can later cite exact clause locations.

Dataiku DSS · PyMuPDF

02

Slice into context windows

The full text is sliced into 250-token windows with 50-token overlap. The overlap ensures clauses that span window boundaries aren't cut in half. An earlier 75-token version produced 1.0 F1 — a red flag for data leakage.

HuggingFace tokenizer · sliding window

03

Classify with DistilBERT

A fine-tuned DistilBERT classifies each window into one of N clause types (or "irrelevant"). DistilBERT was chosen for being 40% smaller than BERT while retaining 97% of accuracy — critical for batch-processing 100+ page documents at speed.

DistilBERT · WeightedTrainer (20× boost)

04

Calibrate confidence

For each prediction, we check the model's softmax confidence. The "minimum correct confidence" threshold is calibrated per clause type using validation data. Anything below the threshold is dropped before reaching humans.

Custom thresholding · KDE rejected

05

Verify with SBERT

Surviving predictions are embedded with a fine-tuned SBERT and cosine-matched against a curated library of gold-standard reference clauses. This catches synonym rephrasings the classifier alone might rank low.

SBERT · MultipleNegativesRankingLoss

06

Surface to auditors

Output is rendered in a Dataiku web app: each detected clause shown with its type, confidence, page reference, and the matching gold-standard text. A separate UI lets non-technical auditors recalibrate thresholds without touching code.

Dataiku Web App · threshold UI

Technical decisions

Why DistilBERT: Encoder-only, optimised for classification. 40% smaller than BERT, 97% of its accuracy — letting us batch-process documents fast.
Why 250-token windows: 75-token windows produced a suspicious 1.0 F1, indicating data leakage. 250 tokens preserved context without bleeding.
WeightedTrainer: 90%+ of document text is irrelevant. A 20× class weight boost prevents the model from defaulting to "irrelevant".
Minimum-correct-confidence: KDE thresholding rejected (no clean incorrect-probability curve). Empirical, data-driven cut-off used instead.
SBERT fine-tuning: MultipleNegativesRankingLoss on legal anchor-positive pairs — model learns synonym clauses semantically, not lexically.

Key results

Train accuracy: 0.96 · Val accuracy: 0.93 · Weighted F1: 0.93
Recall on rare clauses: 98.6% — critical because a missed sanctions clause is a compliance failure
False positive reduction: 95%+ filtered — auditors only review high-confidence, relevant predictions
Deployment: Dataiku DSS pipeline + threshold-calibration web app for non-technical auditors
Business impact: Legal team reviews 10× more agreements in the same time window

Key Learning

The single most important lesson: a perfect F1 score is a red flag, not a success. When 75-token windows produced 1.0 F1, I suspected data leakage — clause text was appearing in both training and validation chunks. Reducing to 250-token windows with careful document-level splits fixed the leakage and produced honest, deployable metrics. Trusting suspicious results would have shipped a broken model into a regulated environment.

⊕ View on GitHub → github.com/Harshaaalll/lma-clause-identifier

Project 02 · Standard Chartered GBS · Aug–Nov 2025

AutoTransFlow:
Multilingual Document AI

A layout-preserving multilingual PDF translation system. Translates financial and legal documents into 200 languages while keeping the exact spatial structure — headings, tables, columns, signatures — intact. Runs entirely on local models with zero external API calls.

Computer Vision Translation Document AI

The Problem

Legal documents must be readable across 200+ jurisdictions. Google Translate destroys layout (turns tables into paragraphs) — and bank policy forbids sending sensitive PDFs to external APIs. We needed a translator that runs internally and preserves every visual structure.

The Approach

Treat each PDF page as an image first, text second. A YOLO-based layout model finds every text block's bounding box; then NLLB-200 translates the text inside each box; finally each translation is re-rendered at its original coordinates.

The Impact

200 languages supported · 0 external API calls · 100% layout fidelity. The translated PDF looks identical to the original — same tables, same columns, same signatures — just in the target language.

200

Languages supported

0

External API calls

100%

Layout preserved

0.25

Detection threshold

System Architecture — Vision-first translation

Layout-aware PDF translation pipeline

The trick: solve a translation problem with a computer vision model first. By detecting the geometry of every text block before translating, we know where each translated word must go on the page.

Step-by-step workflow

01

Duplicate & embed fonts

BabelDoc creates a copy of the source PDF and embeds fonts compatible with the target language (Devanagari, CJK, Arabic, etc.) so translated glyphs render correctly.

BabelDoc · font embedding

02

Render to pixmap

Each page is rasterised to an image and normalised to [0,1] pixel intensity — the input format expected by the layout model.

PyMuPDF · NumPy

03

Detect layout with YOLO

doc-layout-yolo predicts a bounding box and class (heading, paragraph, table, caption) for every text block. Confidence threshold of 0.25 keeps recall high — false detections are cheaper to filter than missing text.

doc-layout-yolo · 0.25 threshold

04

Translate per block

For each box, the text is extracted and fed to NLLB-200 with forced_bos_token_id set to the target language code (e.g. hin_Deva). The model runs entirely on internal hardware.

NLLB-200 · 600M · local inference

05

Re-render at original coords

Translated text is drawn back into each bounding box at its original (x, y, w, h) — preserving the visual structure exactly. Auto font-size adjustment handles language length variance.

PyMuPDF · text-fit logic

06

Stitch & deliver

The original page and its translated counterpart are appended in the final PDF, giving auditors a side-by-side reference. Modular backend lets translation engine be swapped (e.g. Argos Translate).

Final PDF assembly

Why not pdfplumber or PyMuPDF text extraction?

Traditional PDF libraries extract text as a linear stream — no understanding of visual structure
Cannot differentiate headings from paragraphs, tables from captions
Spatial relationships between text blocks are completely lost
Output would be a long unformatted text block — legally unusable for financial documents
Solution: Treat the page as an image, use computer vision to understand layout first

Why NLLB-200 over Google Translate?

Data privacy: Standard Chartered policy prohibits sending documents to external cloud APIs
NLLB-200 runs entirely on internal servers — zero data leaves the network
forced_bos_token_id: Steers decoder to target language (e.g. eng_Latn, hin_Deva)
Covers 200 languages including low-resource languages that Google handles poorly
Modular design — translation backend can be swapped (Argos Translate as alternative)

Key Insight

The critical innovation is treating document translation as a computer vision problem first, not a text problem. By using doc-layout-yolo to map the visual structure before extracting text, the system knows exactly where every word lives on the page — not just what it says. This makes it possible to put translated text back in exactly the right position, preserving the legal and structural integrity of the document.

⊕ View on GitHub → github.com/Harshaaalll/autotransflow

Project 03 · DG Liger Consulting · Jun–Jul 2024

RAG Chatbot with
Conversational Memory

Production retrieval-augmented generation chatbot built over proprietary PDF documents. Multi-turn conversation support via LangChain's ConversationalRetrievalChain, with local StableLM Zephyr 3B inference — no external API calls, no data leakage.

RAG LangChain LLM

The Problem

A consulting firm had hundreds of proprietary client PDFs. Analysts wasted hours searching through them for specific facts. They couldn't use ChatGPT — client confidentiality forbids cloud APIs.

The Approach

Build a chatbot that retrieves before it generates. Embed all PDFs into a local vector database, retrieve the most relevant chunks for each question, and let a small local LLM produce a grounded answer — with full conversation memory.

The Impact

Analysts query proprietary docs in seconds. Zero data leaves the network. Multi-turn dialogue means follow-up questions like "and what about the fees?" work naturally. Runs on standard hardware via 4-bit quantisation.

500

Chunk size (chars)

3B

StableLM params

Q4_K_M

Quantisation level

384

Embedding dimensions

System Architecture — Two-phase RAG pipeline

Indexing phase (one-time) + Query phase (per question)

A RAG system has two distinct workflows. The top row runs once when documents are ingested. The bottom row runs every time a user asks a question.

Step-by-step workflow

01

Load & chunk documents

LangChain document loaders pull in PDFs and web pages. RecursiveCharacterTextSplitter slices them into 500-character chunks with 50-character overlap, respecting sentence boundaries.

LangChain · RecursiveCharacterTextSplitter

02

Embed every chunk

all-MiniLM-L6-v2 turns each chunk into a 384-dim vector. Small, fast, and surprisingly accurate — chosen so embedding hundreds of documents takes minutes not hours on CPU.

sentence-transformers · 384-dim

03

Build FAISS index

Vectors are stored in an in-memory FAISS index for sub-millisecond similarity search. The index is persisted to disk so the system can boot instantly on subsequent runs.

FAISS · IndexFlatL2

04

Rewrite follow-ups

When a user asks "and what about the fees?", that question alone has no context. ConversationalRetrievalChain combines the question with chat history and rewrites it into a standalone query the retriever can act on.

ConversationalRetrievalChain · BufferMemory

05

Retrieve top-k chunks

The rewritten question is embedded with the same model, and FAISS returns the top-k most similar chunks from the index. These become the LLM's grounding context.

FAISS similarity_search · top-k

06

Generate grounded answer

StableLM Zephyr 3B (4-bit quantised, GGUF) receives the question + retrieved chunks and produces an answer grounded in those chunks. Source chunks are surfaced as citations.

StableLM Zephyr 3B · llama.cpp

Why ConversationalRetrievalChain matters

Standard RetrievalQA treats each question in isolation
Follow-up questions like "What about the fees?" have no context without history
ConversationalRetrievalChain rephrases follow-up questions into standalone queries before searching
ConversationBufferMemory stores the full conversation history
Result: the chatbot maintains coherent multi-turn dialogue across a session

Why StableLM Zephyr 3B?

Local inference: No API costs, no latency from external calls, no data privacy risk
Q4_K_M quantisation: 4-bit weights reduce model from ~6GB to ~2GB — fits on standard hardware
GGUF format: Optimised for CPU inference via llama.cpp
Instruction-tuned: Zephyr variant fine-tuned to follow instructions and stay grounded
Trade-off acknowledged: smaller than GPT-4, but sufficient for grounded Q&A over retrieved context

Key Insight

The most important design decision was local inference over API calls. By running StableLM Zephyr locally in GGUF format, the system has no ongoing API costs, no latency from network calls, and no risk of sending proprietary client documents to external servers. For a consulting firm handling sensitive client PDFs, this isn't optional — it's the only architecturally correct choice.

⊕ View on GitHub → github.com/Harshaaalll/rag-chatbot

Project 04 · Datathon · November 2025

Medical Bill OCR &
Fraud Detection

An end-to-end pipeline for extracting structured data from scanned hospital bills and automatically detecting fraudulent claims using four independent anomaly-detection algorithms running in parallel.

OCR Fraud Detection Computer Vision

The Problem

Insurance companies process millions of scanned hospital bills. Manual review is slow; fraudsters exploit this with inflated totals, duplicated submissions, and impossible service combinations. We needed an automated triage system.

The Approach

Two-stage pipeline. Stage 1: aggressive image cleanup → OCR → LLM extraction turns noisy scans into clean JSON. Stage 2: four independent fraud checks vote on a risk score that routes the claim to ACCEPT / REVIEW / REJECT.

The Impact

90%+ extraction accuracy on degraded scans. Four fraud signals catch issues a single check would miss. Risk-scored output means humans only spend time on the actually suspicious 10% — not all 100%.

90%+

Extraction accuracy

4

Fraud checks

0.8

Rejection threshold

30

Day SHA-256 window

300

DPI target resolution

System Architecture — Extraction + parallel fraud detection

Two-stage pipeline: image → structured data → risk score

The four fraud checks run in parallel, not sequence — each catches a different attack pattern. Their outputs are weighted into a single risk score that drives an automated routing decision.

Step-by-step workflow

01

Aggressive image cleanup

Real bills are blurry, skewed, and shadowed. We deskew via Hough transform, denoise with bilateral filter (preserves letter edges), apply CLAHE for local contrast, and upscale 4× to hit 300 DPI — the resolution OCR engines expect.

OpenCV · Hough · CLAHE · bilateral

02

OCR with PaddleOCR

PaddleOCR runs an angle classifier first (handles rotated docs), then text detection and recognition. Outputs raw text strings with bounding boxes — no structure yet, just words.

PaddleOCR · DB detector · CRNN

03

LLM-based extraction

Qwen 7B (via Ollama, local) receives raw OCR text and returns a strict JSON schema: patient name, line items, amounts, total, date. The LLM handles formatting variance across different hospital templates.

Qwen 7B · Ollama · JSON schema

04

Run all 4 fraud checks in parallel

IQR (per-bill outliers), reconciliation (sum vs declared total), SHA-256 (duplicate submission across 30-day window), and pattern analysis (impossible combos) each run independently and produce a sub-score.

NumPy · hashlib · rule engine

05

Aggregate to risk score

The four sub-scores are combined into a single weighted risk score in [0, 1]. Each check has a tunable weight, calibrated against historical fraud cases.

Weighted ensemble

06

Route the claim

< 0.4 → auto-accept, 0.4–0.8 → human review, ≥ 0.8 → auto-reject. Humans only see the 10–15% of bills the system is uncertain about — saving the bulk of review time.

Threshold routing

The 4 fraud detection checks

IQR Amount Anomaly: Flags charges exceeding 95th percentile × 1.5 on that bill. Adapts to bill-level context, not global averages.
Reconciliation Check: Sums all line items and compares to declared total. Any >1% discrepancy is flagged — inflated totals are a classic fraud signal.
SHA-256 Duplicate Detection: Hashes bill content, checks against 30-day cache. Prevents the same bill from being resubmitted to multiple insurers.
Pattern Analysis: Detects same service billed twice, unusually high quantities, and suspicious service combinations in one visit.

Image preprocessing pipeline

Deskewing: Hough Line Transform detects text angle and rotates image to straighten it
Bilateral filter: Removes noise while preserving letter edge sharpness (unlike regular blur)
CLAHE: Local contrast enhancement — brightens dark corners without overexposing the rest
Adaptive thresholding: Region-specific black/white conversion — handles shadows correctly
4× DPI scaling: Upscales toward 300 DPI for OCR accuracy

Key Learning

No single fraud signal is reliable on its own — but four independent signals combined are nearly impossible to fool. A clever fraudster might bypass IQR by spreading inflated amounts across many small line items, but that breaks reconciliation. They might avoid reconciliation issues but get caught by SHA-256 duplication. Defence-in-depth, applied to model design.

⊕ View on GitHub → github.com/Harshaaalll/medical-bill-ocr-fraud

Project 05 · BITS Hyderabad · Aug–Nov 2024

Multilingual ASR
in Low-Resource Languages

Automatic speech recognition for Urdu — a low-resource language. Zero-shot Whisper transcription with a two-stage post-processing pipeline including IndicBERT-based MLM error correction.

Speech AI NLP Audio Processing

The Problem

Speech recognition for English is excellent. For low-resource languages like Urdu, it's poor — and there's not enough labelled data to train a new model from scratch. We needed a working Urdu transcriber without millions of labelled hours.

The Approach

Use Whisper zero-shot for the acoustic part, then fix its mistakes using a second model that understands Urdu language. IndicBERT acts as a linguistic spell-checker over the transcript, replacing phonetically-similar words that don't fit the context.

The Impact

14% reduction in Word Error Rate over baseline Whisper alone. Pipeline works on noisy real-world recordings thanks to the audio cleanup stage. Generalisable architecture — swap IndicBERT for any MLM and it works for other languages.

14%

WER reduction

-10dB

Noise reduction

16kHz

Resampled rate

0.4

MLM threshold

System Architecture — Acoustic + linguistic two-pass

Specialist-models architecture: Whisper hears, IndicBERT reads

Instead of one giant model trying to do everything, we let two specialists each do what they're best at — Whisper handles audio, IndicBERT handles language. Their combination beats either alone.

Step-by-step workflow

01

Standardise the audio

Convert to mono (Whisper expects mono), normalise to -20dBFS so loudness is consistent across recordings, trim leading/trailing silence, and resample to 16kHz — Whisper's training rate.

pydub · ffmpeg backend

02

Subtract background noise

noisereduce estimates the noise floor from silent regions and subtracts it spectrally from the speech signal — about -10dB reduction without distorting voice.

librosa · noisereduce

03

Transcribe with Whisper

Whisper (small) is run zero-shot with a language token forcing Urdu output. The result is a transcript with the right number of words but plausible-sounding errors — phonetically close, semantically wrong.

openai/whisper-small

04

Mask each token

For every word in the transcript, we hide it (replace with [MASK]) and ask IndicBERT to score how likely the original word is given the surrounding context.

HuggingFace fill-mask

05

Threshold & replace

If IndicBERT's probability for the original word is below 0.4, that word probably doesn't fit — we replace it with the highest-probability alternative IndicBERT suggests.

argmax over vocab

06

Measure WER

Word Error Rate is computed against gold-standard transcripts. The two-stage pipeline produces ~14% lower WER than baseline Whisper alone — significant for a low-resource language.

jiwer · WER metric

Audio preprocessing rationale

Mono conversion: Whisper expects mono. Stereo creates phase issues that confuse the model.
-20dBFS normalisation: Standardises loudness across mics and environments.
Silence trimming: Removes leading/trailing silence that adds nothing but inflates WER.
16kHz resampling: Whisper was trained on 16kHz. Mismatched sample rates degrade quality.
Spectral subtraction: Estimates noise floor in silence, subtracts from speech. ~ -10dB reduction.

Why IndicBERT as the corrector?

Whisper makes phonetically plausible errors — e.g. transcribing a similar-sounding word
MLM (Masked Language Model): IndicBERT is pre-trained to predict masked tokens from context
For each transcribed word, check: given the surrounding context, is this the most likely word?
If confidence for the transcribed word is below 0.4, replace with the highest-probability alternative
IndicBERT is specifically trained on Indian languages — understands Urdu context better than generic models

Key Learning

The two-stage design — ASR first, then MLM correction — is more powerful than trying to build a perfect ASR model. Whisper handles the acoustic modelling, IndicBERT handles linguistic correction. This separation of concerns lets each model do what it's best at: Whisper is excellent at converting audio to text, IndicBERT is excellent at deciding whether a word makes sense in context. Combining specialist models often beats one general model trying to do everything.

⊕ View on GitHub → github.com/Harshaaalll/multilingual-asr

Project 06 · Independent · Jun–Jul 2024

Market Analysis
Using LLMs

An automated financial news sentiment pipeline. Scrapes, summarises, and analyses 500+ articles using BART and RoBERTa, producing real-time sentiment scores served via a Flask API with a Streamlit dashboard.

LLM Sentiment FinTech

The Problem

Markets move on news, but no human can read 500+ articles a day. Existing sentiment APIs are either too generic (consumer Twitter sentiment) or too expensive at scale. We needed a financial-domain pipeline running locally.

The Approach

Three-stage pipeline. Scrape with BeautifulSoup. Summarise with two-pass BART (handles articles longer than the model's 1024-token limit). Score with RoBERTa trained on financial text — using a continuous compound score, not a discrete class.

The Impact

500+ articles/day processed. 70% length reduction via summarisation while keeping financial signals. Real-time API + Streamlit dashboard surfaces sentiment trends live. SMOTE applied correctly — only on training data, no leakage.

500+

Articles processed

70%

Text length reduction

67%

Sentiment accuracy

4

BART beam size

55:46

Class split (pre-SMOTE)

System Architecture — Scrape → summarise → score → serve

Two-pass summarisation pipeline with continuous sentiment scoring

The interesting part is the two-pass BART in the middle. Many financial articles exceed BART's 1024-token context window, so we summarise chunks first, then summarise the chunk-summaries.

Step-by-step workflow

01

Scrape financial news

BeautifulSoup pulls 500+ articles from financial sources daily — headlines, body text, publish date, ticker mentions.

BeautifulSoup · requests

02

Chunk long articles

Articles exceeding BART's 1024-token limit are split via LangChain's unstructured text workflow into manageable chunks.

LangChain · token-aware splitter

03

Two-pass BART summarisation

Pass 1 summarises each chunk independently. Pass 2 summarises the combined chunk-summaries into a final article summary. 4-beam search produces higher quality than greedy decoding by considering 4 alternatives at each step.

facebook/bart-large-cnn · 4-beam

04

Score sentiment with RoBERTa

cardiffnlp/twitter-roberta-base-sentiment scores the summary. Instead of argmax (one class), we compute a softmax-weighted compound score across positive/neutral/negative — a continuous signal more useful for tracking trends.

RoBERTa · softmax compound

05

Balance training data — correctly

For the supervised classifier, classes were 55:46. SMOTE was applied only after train/test split, on the training portion. Applying it earlier would leak synthetic samples into validation and inflate accuracy.

imblearn · SMOTE (post-split)

06

Serve via API + dashboard

A Flask endpoint returns sentiment-per-ticker for downstream consumers. A Streamlit dashboard shows live sentiment trends — green when positive sentiment dominates, red when negative.

Flask · Streamlit · WebSocket

Why two-pass BART summarisation?

Financial news articles often exceed BART's 1024-token context limit
Pass 1: Split article into chunks, summarise each independently
Pass 2: Summarise the combined chunk-summaries into a final summary
4-beam search produces better quality than greedy decoding — considers 4 possible next tokens at each step
Result: 70% length reduction while preserving key financial signals (earnings, guidance, risk factors)

Why RoBERTa over BERT for sentiment?

cardiffnlp/twitter-roberta-base-sentiment is trained on financial Twitter data — closer domain than generic models
Compound score: Instead of argmax (pick one class), compute softmax-weighted sum across positive/neutral/negative
This gives a continuous sentiment score rather than a discrete class — more useful for tracking sentiment trends over time
SMOTE only on training data — a common mistake is applying SMOTE before splitting, which leaks synthetic samples into validation. Applied correctly here.

Key Learning

The most important data-science lesson from this project: SMOTE must only be applied to training data, never before the train-test split. Applying SMOTE on the full dataset before splitting creates data leakage — synthetic samples generated from real training examples end up in the test set, inflating accuracy metrics. The correct workflow: split first, then apply SMOTE only on the training portion.

⊕ View on GitHub → github.com/Harshaaalll/market-analysis-llm

Multiple projects.All built to ship.

Multiple projects.
All built to ship.