Portfolio — Harshal Bhambhani

Project 01 · Standard Chartered GBS · Jul–Dec 2025

LMA Clause
Identification Tool

Production NLP pipeline on Dataiku DSS that automatically identifies and classifies legal clauses in Loan Market Association (LMA) agreements — reducing manual compliance audit time by 90%.

NLP Fine-Tuning Production

90%

Audit time reduced

98.6%

Recall on rare clauses

95%+

False positives filtered

0.93

Weighted F1 score

250

Token window size

📄

PDF Input

100+ page LMA agreements

✂️

ETL Chunking

250-token windows 50-token overlap

🧠
DistilBERT
Fine-tuned, WeightedTrainer 20x boost

🎯
Confidence Filter
Min correct confidence method

🔍

SBERT Search

Cosine similarity gold-standard match

✅

Audit Output

Classified clauses + confidence

Technical decisions

Why DistilBERT: Encoder-only, optimised for classification. 40% smaller than BERT, 97% of its accuracy.
Why 250-token windows: 75-token windows caused data leakage (1.0 F1 = red flag). 250 tokens preserved context without bleeding.
WeightedTrainer: 90%+ of document text is irrelevant. Class weight boost prevents the model from always predicting "irrelevant".
Minimum correct confidence: KDE threshold rejected (no incorrect probability curve). Data-driven method used instead.
SBERT fine-tuning: MultipleNegativesRankingLoss on legal anchor-positive pairs — understood synonym clauses semantically.

Key results

Train accuracy: 0.96 | Val accuracy: 0.93 | Weighted F1: 0.93
Recall on rare clauses: 98.6% — critical because a missed sanctions clause is a compliance failure
False positive reduction: 95%+ filtered — auditors only review high-confidence, relevant predictions
Deployment: Dataiku DSS pipeline with separate threshold calibration app for non-technical auditors
Business impact: Legal team could review 10x more agreements in the same time

Key Learning

The most important lesson from this project: a perfect F1 score is a red flag, not a success. When 75-token windows produced 1.0 F1, I suspected data leakage — clause text was appearing in both training and validation chunks. Reducing to 250-token windows with careful splits fixed the leakage and produced honest metrics. Trusting suspicious results would have shipped a broken model to production.

⊕ View on GitHub → github.com/Harshaaalll/lma-clause-identifier

Project 02 · Standard Chartered GBS · Aug–Nov 2025

AutoTransFlow:
Multilingual Document AI

Layout-preserving multilingual PDF translation system. Translates financial and legal documents into 200 languages while maintaining exact spatial structure — built entirely on local models with zero external API calls.

Computer Vision Translation Document AI

200

Languages supported

0

External API calls

100%

Layout preserved

0.25

Confidence threshold

📑

PDF → Duplicate

BabelDoc font embedding for target language

🖼️

Pixmap Convert

Page → image [0,1] normalised

📐
doc-layout-yolo
Bounding boxes per text block > 0.25

🌍
NLLB-200
Facebook 600M, local inference

🎯

Re-render

Text placed at original coordinates

📋

Final PDF

Original + translated appended

Why not pdfplumber or PyMuPDF?

Traditional PDF libraries extract text as a linear stream — no understanding of visual structure
Cannot differentiate headings from paragraphs, tables from captions
Spatial relationship between text blocks is completely lost
Output would be a long unformatted text block — legally unusable for financial documents
Solution: Treat the page as an image, use computer vision to understand layout first

Why NLLB-200 over Google Translate?

Data privacy requirement: Standard Chartered policy prohibits sending documents to external cloud APIs
NLLB-200 runs entirely on internal servers — zero data leaves the network
forced_bos_token_id: Steers decoder to target language (e.g. eng_Latn, hin_Deva)
Covers 200 languages including low-resource languages that Google handles poorly
Modular design — translation backend can be swapped (Argos Translate as alternative)

Key Insight

The critical innovation is treating document translation as a computer vision problem first, not a text problem. By using doc-layout-yolo to map the visual structure before extracting text, the system knows exactly where every word lives on the page — not just what it says. This makes it possible to put translated text back in exactly the right position, preserving the legal and structural integrity of the document.

⊕ View on GitHub → github.com/Harshaaalll/autotransflow

Project 03 · DG Liger Consulting · Jun–Jul 2024

RAG Chatbot with
Conversational Memory

Production retrieval-augmented generation chatbot built over proprietary PDF documents. Multi-turn conversation support with LangChain's ConversationalRetrievalChain and local StableLM inference.

RAG LangChain LLM

500

Chunk size (chars)

3B

StableLM params

Q4_K_M

Quantisation level

384

Embedding dimensions

📁

LangChain Loaders

PDFs + web pages

✂️

Text Splitting

Recursive 500 chars 50 overlap

🔢
all-MiniLM-L6-v2
384-dim sentence embeddings

🗄️
FAISS Index
In-memory similarity search

🤖

StableLM Zephyr 3B

GGUF quantised local inference

💬

ConversationalChain

BufferMemory multi-turn

Why ConversationalRetrievalChain matters

Standard RetrievalQA treats each question in isolation
Follow-up questions like "What about the fees?" have no context without history
ConversationalRetrievalChain rephrases follow-up questions into standalone queries before searching
ConversationBufferMemory stores the full conversation history
Result: the chatbot maintains coherent multi-turn dialogue across a session

Why StableLM Zephyr 3B?

Local inference: No API costs, no latency from external calls, no data privacy risk
Q4_K_M quantisation: 4-bit weights reduce model from ~6GB to ~2GB — fits on standard hardware
GGUF format: Optimised for CPU inference via llama.cpp
Instruction-tuned: Zephyr variant is fine-tuned for following instructions and staying grounded
Trade-off acknowledged: smaller than GPT-4, but sufficient for grounded Q&A over retrieved context

Key Insight

The most important design decision was local inference over API calls. By running StableLM Zephyr locally in GGUF format, the system has no ongoing API costs, no latency from network calls, and no risk of sending proprietary client documents to external servers. For a consulting firm handling sensitive client PDFs, this isn't optional — it's the only architecturally correct choice.

⊕ View on GitHub → github.com/Harshaaalll/rag-chatbot

Project 04 · Datathon · November 2025

Medical Bill OCR &
Fraud Detection

End-to-end pipeline for extracting structured data from scanned hospital bills and automatically detecting fraudulent claims using four independent anomaly detection algorithms.

OCR Fraud Detection Computer Vision

90%+

Extraction accuracy

4

Fraud checks

0.8

Rejection threshold

30

Day SHA-256 window

300

DPI target resolution

🖼️

Scanned Bill

PNG / PDF input

🔧
OpenCV Preprocessing
Deskew, CLAHE, bilateral filter, 4x DPI

👁️
PaddleOCR
Angle classifier + text detection

🤖

Qwen 7B

Structured JSON extraction via Ollama

🚨
4 Fraud Checks
IQR + Reconciliation + SHA-256 + Pattern

📊

Risk Score

0–1 → ACCEPT / REVIEW / REJECT

The 4 fraud detection checks

IQR Amount Anomaly: Flags charges exceeding 95th percentile × 1.5 on that bill. Adapts to bill-level context not global averages.
Reconciliation Check: Sums all line items and compares to declared total. Any >1% discrepancy is flagged — inflated totals are a classic fraud signal.
SHA-256 Duplicate Detection: Hashes bill content, checks against 30-day cache. Prevents same bill from being resubmitted to multiple insurers.
Pattern Analysis: Detects same service billed twice, unusually high quantities, and suspicious service combinations in one visit.

Image preprocessing pipeline

Deskewing: Hough Line Transform detects text angle and rotates image to straighten it
Bilateral filter: Removes noise while preserving letter edge sharpness (unlike regular blur)
CLAHE: Local contrast enhancement — brightens dark corners without overexposing the rest
Adaptive thresholding: Region-specific black/white conversion — handles shadows correctly
4x DPI scaling: Upscales from 72 DPI to 300 DPI for OCR accuracy

Risk Score Decision Matrix

< 0.4

ACCEPT

0.4–0.6

CAUTION

0.6–0.8

REVIEW

> 0.8

REJECT

⊕ View on GitHub → github.com/Harshaaalll/medical-bill-ocr-fraud

Project 05 · BITS Hyderabad · Aug–Nov 2024

Multilingual ASR
in Low-Resource Languages

Automatic speech recognition system for Urdu — a low-resource language. Zero-shot Whisper transcription with a two-stage post-processing pipeline including IndicBERT-based MLM error correction.

Speech AI NLP Audio Processing

14%

WER reduction

-10dB

Noise reduction

16kHz

Resampled rate

0.4

MLM probability threshold

🎵

Raw Audio

Variable format input

🔊
pydub Pipeline
Mono, -20dBFS norm, silence trim, 16kHz

📉

Noise Reduction

librosa + noisereduce spectral subtraction

🎤
Whisper (small)
Zero-shot Urdu, language biasing

🔤
IndicBERT MLM
Corrects phonetic substitutions > 0.4

📝

Transcription

WER-tracked output

Audio preprocessing rationale

Mono conversion: Whisper expects mono audio. Stereo creates phase issues that confuse the model.
-20dBFS normalisation: Standardises loudness across recordings from different microphones and environments.
Silence trimming: Removes leading/trailing silence that contributes to WER without adding information.
16kHz resampling: Whisper was trained on 16kHz audio. Mismatched sample rates degrade quality.
Spectral subtraction: Estimates noise floor during silence periods, subtracts from speech signal. -10dB reduction.

Why IndicBERT as corrector?

Whisper makes phonetically plausible errors — e.g. transcribing a similar-sounding word
MLM (Masked Language Model): IndicBERT is pre-trained to predict masked tokens from context
For each transcribed word, check: given the surrounding context, is this the most likely word?
If the model's confidence for the transcribed word is below 0.4, replace with the highest-probability alternative
IndicBERT specifically trained on Indian languages — understands Urdu context better than general models

Key Learning

The two-stage design — ASR first, then MLM correction — is more powerful than trying to build a perfect ASR model. Whisper handles the acoustic modelling, IndicBERT handles linguistic correction. This separation of concerns means each model does what it's best at: Whisper is excellent at converting audio to text, IndicBERT is excellent at understanding whether a word makes sense in context. Combining specialist models often beats one general model trying to do everything.

⊕ View on GitHub → github.com/Harshaaalll/multilingual-asr

Project 06 · Independent · Jun–Jul 2024

Market Analysis
Using LLMs

Automated financial news sentiment pipeline. Scraped, summarised, and analysed 500+ articles using BART and RoBERTa, producing real-time sentiment scores served via Flask API with Streamlit dashboard.

LLM Sentiment FinTech

500+

Articles processed

70%

Text length reduction

67%

Sentiment accuracy

4

BART beam size

55:46

Class split (pre-SMOTE)

🕸️

BeautifulSoup

500+ financial articles scraped

✂️

LangChain Chunking

Unstructured text workflow

📰
BART Two-Pass
facebook/bart-large-cnn, 4-beam, 70% reduction

💭
RoBERTa Sentiment
cardiffnlp Twitter sentiment, softmax weighted

⚖️

SMOTE Balancing

On training set only, 55:46 → balanced

🚀

Flask API

Real-time + Streamlit dashboard

Why two-pass BART summarisation?

Financial news articles often exceed BART's 1024-token context limit
Pass 1: Split article into chunks, summarise each independently
Pass 2: Summarise the combined chunk summaries into a final summary
4-beam search produces better quality than greedy decoding — considers 4 possible next tokens at each step
Result: 70% length reduction while preserving key financial signals (earnings, guidance, risk factors)

Why RoBERTa over BERT for sentiment?

cardiffnlp/twitter-roberta-base-sentiment is trained on financial Twitter data — closer domain than generic sentiment models
Compound score: Instead of argmax (pick one class), compute softmax weighted sum across positive/neutral/negative probabilities
This gives a continuous sentiment score rather than a discrete class — more useful for tracking sentiment trends over time
SMOTE only on training data — a common mistake is applying SMOTE before splitting, which leaks synthetic samples into validation. Applied correctly here.

Key Learning

The most important data science lesson from this project: SMOTE must only be applied to training data, never before the train-test split. Applying SMOTE on the full dataset before splitting creates data leakage — synthetic samples generated from real training examples end up in the test set, inflating accuracy metrics. The correct workflow: split first, then apply SMOTE only on the training portion.

⊕ View on GitHub → github.com/Harshaaalll/market-analysis-llm

Six projects.All production-minded.

Six projects.
All production-minded.