01 · LMA Clause Tool 02 · AutoTransFlow 03 · RAG Chatbot 04 · Medical Bill OCR 05 · Multilingual ASR 06 · Market Analysis
Project 01 · Standard Chartered GBS · Jul–Dec 2025
LMA Clause
Identification Tool
Production NLP pipeline on Dataiku DSS that automatically identifies and classifies legal clauses in Loan Market Association (LMA) agreements — reducing manual compliance audit time by 90%.
NLP Fine-Tuning Production
90%
Audit time reduced
98.6%
Recall on rare clauses
95%+
False positives filtered
0.93
Weighted F1 score
250
Token window size
📄
PDF Input
100+ page LMA agreements
✂️
ETL Chunking
250-token windows 50-token overlap
🧠
DistilBERT
Fine-tuned, WeightedTrainer 20x boost
🎯
Confidence Filter
Min correct confidence method
🔍
SBERT Search
Cosine similarity gold-standard match
Audit Output
Classified clauses + confidence
Technical decisions
  • Why DistilBERT: Encoder-only, optimised for classification. 40% smaller than BERT, 97% of its accuracy.
  • Why 250-token windows: 75-token windows caused data leakage (1.0 F1 = red flag). 250 tokens preserved context without bleeding.
  • WeightedTrainer: 90%+ of document text is irrelevant. Class weight boost prevents the model from always predicting "irrelevant".
  • Minimum correct confidence: KDE threshold rejected (no incorrect probability curve). Data-driven method used instead.
  • SBERT fine-tuning: MultipleNegativesRankingLoss on legal anchor-positive pairs — understood synonym clauses semantically.
Key results
  • Train accuracy: 0.96 | Val accuracy: 0.93 | Weighted F1: 0.93
  • Recall on rare clauses: 98.6% — critical because a missed sanctions clause is a compliance failure
  • False positive reduction: 95%+ filtered — auditors only review high-confidence, relevant predictions
  • Deployment: Dataiku DSS pipeline with separate threshold calibration app for non-technical auditors
  • Business impact: Legal team could review 10x more agreements in the same time
Key Learning
The most important lesson from this project: a perfect F1 score is a red flag, not a success. When 75-token windows produced 1.0 F1, I suspected data leakage — clause text was appearing in both training and validation chunks. Reducing to 250-token windows with careful splits fixed the leakage and produced honest metrics. Trusting suspicious results would have shipped a broken model to production.
⊕ View on GitHub → github.com/Harshaaalll/lma-clause-identifier
Project 02 · Standard Chartered GBS · Aug–Nov 2025
AutoTransFlow:
Multilingual Document AI
Layout-preserving multilingual PDF translation system. Translates financial and legal documents into 200 languages while maintaining exact spatial structure — built entirely on local models with zero external API calls.
Computer Vision Translation Document AI
200
Languages supported
0
External API calls
100%
Layout preserved
0.25
Confidence threshold
📑
PDF → Duplicate
BabelDoc font embedding for target language
🖼️
Pixmap Convert
Page → image [0,1] normalised
📐
doc-layout-yolo
Bounding boxes per text block > 0.25
🌍
NLLB-200
Facebook 600M, local inference
🎯
Re-render
Text placed at original coordinates
📋
Final PDF
Original + translated appended
Why not pdfplumber or PyMuPDF?
  • Traditional PDF libraries extract text as a linear stream — no understanding of visual structure
  • Cannot differentiate headings from paragraphs, tables from captions
  • Spatial relationship between text blocks is completely lost
  • Output would be a long unformatted text block — legally unusable for financial documents
  • Solution: Treat the page as an image, use computer vision to understand layout first
Why NLLB-200 over Google Translate?
  • Data privacy requirement: Standard Chartered policy prohibits sending documents to external cloud APIs
  • NLLB-200 runs entirely on internal servers — zero data leaves the network
  • forced_bos_token_id: Steers decoder to target language (e.g. eng_Latn, hin_Deva)
  • Covers 200 languages including low-resource languages that Google handles poorly
  • Modular design — translation backend can be swapped (Argos Translate as alternative)
Key Insight
The critical innovation is treating document translation as a computer vision problem first, not a text problem. By using doc-layout-yolo to map the visual structure before extracting text, the system knows exactly where every word lives on the page — not just what it says. This makes it possible to put translated text back in exactly the right position, preserving the legal and structural integrity of the document.
⊕ View on GitHub → github.com/Harshaaalll/autotransflow
Project 03 · DG Liger Consulting · Jun–Jul 2024
RAG Chatbot with
Conversational Memory
Production retrieval-augmented generation chatbot built over proprietary PDF documents. Multi-turn conversation support with LangChain's ConversationalRetrievalChain and local StableLM inference.
RAG LangChain LLM
500
Chunk size (chars)
3B
StableLM params
Q4_K_M
Quantisation level
384
Embedding dimensions
📁
LangChain Loaders
PDFs + web pages
✂️
Text Splitting
Recursive 500 chars 50 overlap
🔢
all-MiniLM-L6-v2
384-dim sentence embeddings
🗄️
FAISS Index
In-memory similarity search
🤖
StableLM Zephyr 3B
GGUF quantised local inference
💬
ConversationalChain
BufferMemory multi-turn
Why ConversationalRetrievalChain matters
  • Standard RetrievalQA treats each question in isolation
  • Follow-up questions like "What about the fees?" have no context without history
  • ConversationalRetrievalChain rephrases follow-up questions into standalone queries before searching
  • ConversationBufferMemory stores the full conversation history
  • Result: the chatbot maintains coherent multi-turn dialogue across a session
Why StableLM Zephyr 3B?
  • Local inference: No API costs, no latency from external calls, no data privacy risk
  • Q4_K_M quantisation: 4-bit weights reduce model from ~6GB to ~2GB — fits on standard hardware
  • GGUF format: Optimised for CPU inference via llama.cpp
  • Instruction-tuned: Zephyr variant is fine-tuned for following instructions and staying grounded
  • Trade-off acknowledged: smaller than GPT-4, but sufficient for grounded Q&A over retrieved context
Key Insight
The most important design decision was local inference over API calls. By running StableLM Zephyr locally in GGUF format, the system has no ongoing API costs, no latency from network calls, and no risk of sending proprietary client documents to external servers. For a consulting firm handling sensitive client PDFs, this isn't optional — it's the only architecturally correct choice.
⊕ View on GitHub → github.com/Harshaaalll/rag-chatbot
Project 04 · Datathon · November 2025
Medical Bill OCR &
Fraud Detection
End-to-end pipeline for extracting structured data from scanned hospital bills and automatically detecting fraudulent claims using four independent anomaly detection algorithms.
OCR Fraud Detection Computer Vision
90%+
Extraction accuracy
4
Fraud checks
0.8
Rejection threshold
30
Day SHA-256 window
300
DPI target resolution
🖼️
Scanned Bill
PNG / PDF input
🔧
OpenCV Preprocessing
Deskew, CLAHE, bilateral filter, 4x DPI
👁️
PaddleOCR
Angle classifier + text detection
🤖
Qwen 7B
Structured JSON extraction via Ollama
🚨
4 Fraud Checks
IQR + Reconciliation + SHA-256 + Pattern
📊
Risk Score
0–1 → ACCEPT / REVIEW / REJECT
The 4 fraud detection checks
  • IQR Amount Anomaly: Flags charges exceeding 95th percentile × 1.5 on that bill. Adapts to bill-level context not global averages.
  • Reconciliation Check: Sums all line items and compares to declared total. Any >1% discrepancy is flagged — inflated totals are a classic fraud signal.
  • SHA-256 Duplicate Detection: Hashes bill content, checks against 30-day cache. Prevents same bill from being resubmitted to multiple insurers.
  • Pattern Analysis: Detects same service billed twice, unusually high quantities, and suspicious service combinations in one visit.
Image preprocessing pipeline
  • Deskewing: Hough Line Transform detects text angle and rotates image to straighten it
  • Bilateral filter: Removes noise while preserving letter edge sharpness (unlike regular blur)
  • CLAHE: Local contrast enhancement — brightens dark corners without overexposing the rest
  • Adaptive thresholding: Region-specific black/white conversion — handles shadows correctly
  • 4x DPI scaling: Upscales from 72 DPI to 300 DPI for OCR accuracy
Risk Score Decision Matrix
< 0.4
ACCEPT
0.4–0.6
CAUTION
0.6–0.8
REVIEW
> 0.8
REJECT
⊕ View on GitHub → github.com/Harshaaalll/medical-bill-ocr-fraud
Project 05 · BITS Hyderabad · Aug–Nov 2024
Multilingual ASR
in Low-Resource Languages
Automatic speech recognition system for Urdu — a low-resource language. Zero-shot Whisper transcription with a two-stage post-processing pipeline including IndicBERT-based MLM error correction.
Speech AI NLP Audio Processing
14%
WER reduction
-10dB
Noise reduction
16kHz
Resampled rate
0.4
MLM probability threshold
🎵
Raw Audio
Variable format input
🔊
pydub Pipeline
Mono, -20dBFS norm, silence trim, 16kHz
📉
Noise Reduction
librosa + noisereduce spectral subtraction
🎤
Whisper (small)
Zero-shot Urdu, language biasing
🔤
IndicBERT MLM
Corrects phonetic substitutions > 0.4
📝
Transcription
WER-tracked output
Audio preprocessing rationale
  • Mono conversion: Whisper expects mono audio. Stereo creates phase issues that confuse the model.
  • -20dBFS normalisation: Standardises loudness across recordings from different microphones and environments.
  • Silence trimming: Removes leading/trailing silence that contributes to WER without adding information.
  • 16kHz resampling: Whisper was trained on 16kHz audio. Mismatched sample rates degrade quality.
  • Spectral subtraction: Estimates noise floor during silence periods, subtracts from speech signal. -10dB reduction.
Why IndicBERT as corrector?
  • Whisper makes phonetically plausible errors — e.g. transcribing a similar-sounding word
  • MLM (Masked Language Model): IndicBERT is pre-trained to predict masked tokens from context
  • For each transcribed word, check: given the surrounding context, is this the most likely word?
  • If the model's confidence for the transcribed word is below 0.4, replace with the highest-probability alternative
  • IndicBERT specifically trained on Indian languages — understands Urdu context better than general models
Key Learning
The two-stage design — ASR first, then MLM correction — is more powerful than trying to build a perfect ASR model. Whisper handles the acoustic modelling, IndicBERT handles linguistic correction. This separation of concerns means each model does what it's best at: Whisper is excellent at converting audio to text, IndicBERT is excellent at understanding whether a word makes sense in context. Combining specialist models often beats one general model trying to do everything.
⊕ View on GitHub → github.com/Harshaaalll/multilingual-asr
Project 06 · Independent · Jun–Jul 2024
Market Analysis
Using LLMs
Automated financial news sentiment pipeline. Scraped, summarised, and analysed 500+ articles using BART and RoBERTa, producing real-time sentiment scores served via Flask API with Streamlit dashboard.
LLM Sentiment FinTech
500+
Articles processed
70%
Text length reduction
67%
Sentiment accuracy
4
BART beam size
55:46
Class split (pre-SMOTE)
🕸️
BeautifulSoup
500+ financial articles scraped
✂️
LangChain Chunking
Unstructured text workflow
📰
BART Two-Pass
facebook/bart-large-cnn, 4-beam, 70% reduction
💭
RoBERTa Sentiment
cardiffnlp Twitter sentiment, softmax weighted
⚖️
SMOTE Balancing
On training set only, 55:46 → balanced
🚀
Flask API
Real-time + Streamlit dashboard
Why two-pass BART summarisation?
  • Financial news articles often exceed BART's 1024-token context limit
  • Pass 1: Split article into chunks, summarise each independently
  • Pass 2: Summarise the combined chunk summaries into a final summary
  • 4-beam search produces better quality than greedy decoding — considers 4 possible next tokens at each step
  • Result: 70% length reduction while preserving key financial signals (earnings, guidance, risk factors)
Why RoBERTa over BERT for sentiment?
  • cardiffnlp/twitter-roberta-base-sentiment is trained on financial Twitter data — closer domain than generic sentiment models
  • Compound score: Instead of argmax (pick one class), compute softmax weighted sum across positive/neutral/negative probabilities
  • This gives a continuous sentiment score rather than a discrete class — more useful for tracking sentiment trends over time
  • SMOTE only on training data — a common mistake is applying SMOTE before splitting, which leaks synthetic samples into validation. Applied correctly here.
Key Learning
The most important data science lesson from this project: SMOTE must only be applied to training data, never before the train-test split. Applying SMOTE on the full dataset before splitting creates data leakage — synthetic samples generated from real training examples end up in the test set, inflating accuracy metrics. The correct workflow: split first, then apply SMOTE only on the training portion.
⊕ View on GitHub → github.com/Harshaaalll/market-analysis-llm