01 · LMA Clause Tool 02 · AutoTransFlow 03 · RAG Chatbot 04 · Medical Bill OCR 05 · Multilingual ASR 06 · Market Analysis
Project 01 · Standard Chartered GBS · Jul–Dec 2025
LMA Clause
Identification Tool
A production NLP pipeline on Dataiku DSS that reads 100+ page Loan Market Association agreements and automatically flags every legally-significant clause — sanctions, indemnities, governing law — that compliance auditors must review. Ships with confidence calibration, semantic search, and an auditor-friendly threshold tuning UI.
NLP Fine-Tuning Production
The Problem
Auditors at a global bank were manually reading 100+ page loan agreements to find ~30 specific compliance-critical clauses. One missed sanctions clause = regulatory failure. The work was slow, error-prone, and didn't scale.
The Approach
Train a small, fast language model (DistilBERT) to recognise each clause type, then add a second-stage semantic search layer that confirms the match against gold-standard reference clauses. Filter low-confidence predictions before showing to humans.
The Impact
90% reduction in audit time. 98.6% recall on rare clauses (the ones that matter most). 95%+ of false positives filtered before reaching auditors. Legal team can review 10× more agreements in the same window.
90%
Audit time reduced
98.6%
Recall on rare clauses
95%+
False positives filtered
0.93
Weighted F1 score
250
Token window size
System Architecture — How the pipeline thinks
End-to-end clause identification pipeline
A 100+ page PDF enters on the left. By the time it exits on the right, every legally-significant clause has been classified, scored for confidence, and matched against a gold-standard reference — ready for auditor review.
INPUT PREP ML CORE VERIFY OUTPUT 📄 LMA Agreement 100+ page PDF ✂️ ETL Chunking 250-token window · 50 overlap 🧠 DistilBERT Classifier Fine-tuned on labelled clauses WeightedTrainer · 20× class boost handles 90%+ irrelevant text 🎯 Confidence Filter Min-correct-confidence threshold Drops uncertain predictions filters 95%+ false positives 🔍 SBERT Semantic Search Fine-tuned w/ MultipleNegatives Cosine vs gold-standard library catches synonym phrasings ✅ Auditor Dashboard Clause · type · confidence · page ref Threshold-tunable per auditor Threshold calibration feedback loop (auditor recalibrates per loan type)
Step-by-step workflow
01
Ingest the agreement
A 100+ page LMA PDF lands in the Dataiku dataset. Text is extracted page-by-page, preserving section markers so we can later cite exact clause locations.
Dataiku DSS · PyMuPDF
02
Slice into context windows
The full text is sliced into 250-token windows with 50-token overlap. The overlap ensures clauses that span window boundaries aren't cut in half. An earlier 75-token version produced 1.0 F1 — a red flag for data leakage.
HuggingFace tokenizer · sliding window
03
Classify with DistilBERT
A fine-tuned DistilBERT classifies each window into one of N clause types (or "irrelevant"). DistilBERT was chosen for being 40% smaller than BERT while retaining 97% of accuracy — critical for batch-processing 100+ page documents at speed.
DistilBERT · WeightedTrainer (20× boost)
04
Calibrate confidence
For each prediction, we check the model's softmax confidence. The "minimum correct confidence" threshold is calibrated per clause type using validation data. Anything below the threshold is dropped before reaching humans.
Custom thresholding · KDE rejected
05
Verify with SBERT
Surviving predictions are embedded with a fine-tuned SBERT and cosine-matched against a curated library of gold-standard reference clauses. This catches synonym rephrasings the classifier alone might rank low.
SBERT · MultipleNegativesRankingLoss
06
Surface to auditors
Output is rendered in a Dataiku web app: each detected clause shown with its type, confidence, page reference, and the matching gold-standard text. A separate UI lets non-technical auditors recalibrate thresholds without touching code.
Dataiku Web App · threshold UI
Technical decisions
  • Why DistilBERT: Encoder-only, optimised for classification. 40% smaller than BERT, 97% of its accuracy — letting us batch-process documents fast.
  • Why 250-token windows: 75-token windows produced a suspicious 1.0 F1, indicating data leakage. 250 tokens preserved context without bleeding.
  • WeightedTrainer: 90%+ of document text is irrelevant. A 20× class weight boost prevents the model from defaulting to "irrelevant".
  • Minimum-correct-confidence: KDE thresholding rejected (no clean incorrect-probability curve). Empirical, data-driven cut-off used instead.
  • SBERT fine-tuning: MultipleNegativesRankingLoss on legal anchor-positive pairs — model learns synonym clauses semantically, not lexically.
Key results
  • Train accuracy: 0.96 · Val accuracy: 0.93 · Weighted F1: 0.93
  • Recall on rare clauses: 98.6% — critical because a missed sanctions clause is a compliance failure
  • False positive reduction: 95%+ filtered — auditors only review high-confidence, relevant predictions
  • Deployment: Dataiku DSS pipeline + threshold-calibration web app for non-technical auditors
  • Business impact: Legal team reviews 10× more agreements in the same time window
Key Learning
The single most important lesson: a perfect F1 score is a red flag, not a success. When 75-token windows produced 1.0 F1, I suspected data leakage — clause text was appearing in both training and validation chunks. Reducing to 250-token windows with careful document-level splits fixed the leakage and produced honest, deployable metrics. Trusting suspicious results would have shipped a broken model into a regulated environment.
⊕ View on GitHub → github.com/Harshaaalll/lma-clause-identifier
Project 02 · Standard Chartered GBS · Aug–Nov 2025
AutoTransFlow:
Multilingual Document AI
A layout-preserving multilingual PDF translation system. Translates financial and legal documents into 200 languages while keeping the exact spatial structure — headings, tables, columns, signatures — intact. Runs entirely on local models with zero external API calls.
Computer Vision Translation Document AI
The Problem
Legal documents must be readable across 200+ jurisdictions. Google Translate destroys layout (turns tables into paragraphs) — and bank policy forbids sending sensitive PDFs to external APIs. We needed a translator that runs internally and preserves every visual structure.
The Approach
Treat each PDF page as an image first, text second. A YOLO-based layout model finds every text block's bounding box; then NLLB-200 translates the text inside each box; finally each translation is re-rendered at its original coordinates.
The Impact
200 languages supported · 0 external API calls · 100% layout fidelity. The translated PDF looks identical to the original — same tables, same columns, same signatures — just in the target language.
200
Languages supported
0
External API calls
100%
Layout preserved
0.25
Detection threshold
System Architecture — Vision-first translation
Layout-aware PDF translation pipeline
The trick: solve a translation problem with a computer vision model first. By detecting the geometry of every text block before translating, we know where each translated word must go on the page.
📑 Source PDF Financial / legal English origin 🖼️ Render Page PDF → image Normalised [0,1] 📐 doc-layout-yolo Detects every text block Outputs (x,y,w,h, type) Confidence > 0.25 heading / para / table 🌍 NLLB-200 600M params · local forced_bos_token_id 200 target languages translates each block 🎯 Re-render BabelDoc font embed Original coordinates "What it says" "What it looks like" "Where text lives" "What it means" "Reassemble doc" The vision-first insight: Traditional PDF tools extract text linearly — losing structure. We treat each page as an image, solve the geometry first with computer vision, then translate inside known boxes. 📋 Final PDF: original document with translated text in identical layout — appended after source page
Step-by-step workflow
01
Duplicate & embed fonts
BabelDoc creates a copy of the source PDF and embeds fonts compatible with the target language (Devanagari, CJK, Arabic, etc.) so translated glyphs render correctly.
BabelDoc · font embedding
02
Render to pixmap
Each page is rasterised to an image and normalised to [0,1] pixel intensity — the input format expected by the layout model.
PyMuPDF · NumPy
03
Detect layout with YOLO
doc-layout-yolo predicts a bounding box and class (heading, paragraph, table, caption) for every text block. Confidence threshold of 0.25 keeps recall high — false detections are cheaper to filter than missing text.
doc-layout-yolo · 0.25 threshold
04
Translate per block
For each box, the text is extracted and fed to NLLB-200 with forced_bos_token_id set to the target language code (e.g. hin_Deva). The model runs entirely on internal hardware.
NLLB-200 · 600M · local inference
05
Re-render at original coords
Translated text is drawn back into each bounding box at its original (x, y, w, h) — preserving the visual structure exactly. Auto font-size adjustment handles language length variance.
PyMuPDF · text-fit logic
06
Stitch & deliver
The original page and its translated counterpart are appended in the final PDF, giving auditors a side-by-side reference. Modular backend lets translation engine be swapped (e.g. Argos Translate).
Final PDF assembly
Why not pdfplumber or PyMuPDF text extraction?
  • Traditional PDF libraries extract text as a linear stream — no understanding of visual structure
  • Cannot differentiate headings from paragraphs, tables from captions
  • Spatial relationships between text blocks are completely lost
  • Output would be a long unformatted text block — legally unusable for financial documents
  • Solution: Treat the page as an image, use computer vision to understand layout first
Why NLLB-200 over Google Translate?
  • Data privacy: Standard Chartered policy prohibits sending documents to external cloud APIs
  • NLLB-200 runs entirely on internal servers — zero data leaves the network
  • forced_bos_token_id: Steers decoder to target language (e.g. eng_Latn, hin_Deva)
  • Covers 200 languages including low-resource languages that Google handles poorly
  • Modular design — translation backend can be swapped (Argos Translate as alternative)
Key Insight
The critical innovation is treating document translation as a computer vision problem first, not a text problem. By using doc-layout-yolo to map the visual structure before extracting text, the system knows exactly where every word lives on the page — not just what it says. This makes it possible to put translated text back in exactly the right position, preserving the legal and structural integrity of the document.
⊕ View on GitHub → github.com/Harshaaalll/autotransflow
Project 03 · DG Liger Consulting · Jun–Jul 2024
RAG Chatbot with
Conversational Memory
Production retrieval-augmented generation chatbot built over proprietary PDF documents. Multi-turn conversation support via LangChain's ConversationalRetrievalChain, with local StableLM Zephyr 3B inference — no external API calls, no data leakage.
RAG LangChain LLM
The Problem
A consulting firm had hundreds of proprietary client PDFs. Analysts wasted hours searching through them for specific facts. They couldn't use ChatGPT — client confidentiality forbids cloud APIs.
The Approach
Build a chatbot that retrieves before it generates. Embed all PDFs into a local vector database, retrieve the most relevant chunks for each question, and let a small local LLM produce a grounded answer — with full conversation memory.
The Impact
Analysts query proprietary docs in seconds. Zero data leaves the network. Multi-turn dialogue means follow-up questions like "and what about the fees?" work naturally. Runs on standard hardware via 4-bit quantisation.
500
Chunk size (chars)
3B
StableLM params
Q4_K_M
Quantisation level
384
Embedding dimensions
System Architecture — Two-phase RAG pipeline
Indexing phase (one-time) + Query phase (per question)
A RAG system has two distinct workflows. The top row runs once when documents are ingested. The bottom row runs every time a user asks a question.
PHASE 1 · INDEXING (RUN ONCE PER DOCUMENT) 📁 PDF Documents Proprietary client docs + web pages ✂️ Recursive Splitter 500 chars · 50 overlap Sentence-boundary aware 🔢 Embeddings all-MiniLM-L6-v2 384-dim vectors 🗄️ FAISS Index In-memory ANN search Persisted to disk retrieved at query time ↓ PHASE 2 · QUERY (RUN PER USER QUESTION) 💬 User Question "What about the fees?" often ambiguous! 🔄 Question Rewriter + ConvBufferMemory → standalone query 🔍 FAISS Retrieve Top-k similar chunks Cosine similarity 🤖 StableLM Zephyr 3B GGUF Q4_K_M · local Grounded generation 💡 Grounded Answer + Source citations Stored to memory Conversation memory loop — every Q&A pair informs the next question's rewriting
Step-by-step workflow
01
Load & chunk documents
LangChain document loaders pull in PDFs and web pages. RecursiveCharacterTextSplitter slices them into 500-character chunks with 50-character overlap, respecting sentence boundaries.
LangChain · RecursiveCharacterTextSplitter
02
Embed every chunk
all-MiniLM-L6-v2 turns each chunk into a 384-dim vector. Small, fast, and surprisingly accurate — chosen so embedding hundreds of documents takes minutes not hours on CPU.
sentence-transformers · 384-dim
03
Build FAISS index
Vectors are stored in an in-memory FAISS index for sub-millisecond similarity search. The index is persisted to disk so the system can boot instantly on subsequent runs.
FAISS · IndexFlatL2
04
Rewrite follow-ups
When a user asks "and what about the fees?", that question alone has no context. ConversationalRetrievalChain combines the question with chat history and rewrites it into a standalone query the retriever can act on.
ConversationalRetrievalChain · BufferMemory
05
Retrieve top-k chunks
The rewritten question is embedded with the same model, and FAISS returns the top-k most similar chunks from the index. These become the LLM's grounding context.
FAISS similarity_search · top-k
06
Generate grounded answer
StableLM Zephyr 3B (4-bit quantised, GGUF) receives the question + retrieved chunks and produces an answer grounded in those chunks. Source chunks are surfaced as citations.
StableLM Zephyr 3B · llama.cpp
Why ConversationalRetrievalChain matters
  • Standard RetrievalQA treats each question in isolation
  • Follow-up questions like "What about the fees?" have no context without history
  • ConversationalRetrievalChain rephrases follow-up questions into standalone queries before searching
  • ConversationBufferMemory stores the full conversation history
  • Result: the chatbot maintains coherent multi-turn dialogue across a session
Why StableLM Zephyr 3B?
  • Local inference: No API costs, no latency from external calls, no data privacy risk
  • Q4_K_M quantisation: 4-bit weights reduce model from ~6GB to ~2GB — fits on standard hardware
  • GGUF format: Optimised for CPU inference via llama.cpp
  • Instruction-tuned: Zephyr variant fine-tuned to follow instructions and stay grounded
  • Trade-off acknowledged: smaller than GPT-4, but sufficient for grounded Q&A over retrieved context
Key Insight
The most important design decision was local inference over API calls. By running StableLM Zephyr locally in GGUF format, the system has no ongoing API costs, no latency from network calls, and no risk of sending proprietary client documents to external servers. For a consulting firm handling sensitive client PDFs, this isn't optional — it's the only architecturally correct choice.
⊕ View on GitHub → github.com/Harshaaalll/rag-chatbot
Project 04 · Datathon · November 2025
Medical Bill OCR &
Fraud Detection
An end-to-end pipeline for extracting structured data from scanned hospital bills and automatically detecting fraudulent claims using four independent anomaly-detection algorithms running in parallel.
OCR Fraud Detection Computer Vision
The Problem
Insurance companies process millions of scanned hospital bills. Manual review is slow; fraudsters exploit this with inflated totals, duplicated submissions, and impossible service combinations. We needed an automated triage system.
The Approach
Two-stage pipeline. Stage 1: aggressive image cleanup → OCR → LLM extraction turns noisy scans into clean JSON. Stage 2: four independent fraud checks vote on a risk score that routes the claim to ACCEPT / REVIEW / REJECT.
The Impact
90%+ extraction accuracy on degraded scans. Four fraud signals catch issues a single check would miss. Risk-scored output means humans only spend time on the actually suspicious 10% — not all 100%.
90%+
Extraction accuracy
4
Fraud checks
0.8
Rejection threshold
30
Day SHA-256 window
300
DPI target resolution
System Architecture — Extraction + parallel fraud detection
Two-stage pipeline: image → structured data → risk score
The four fraud checks run in parallel, not sequence — each catches a different attack pattern. Their outputs are weighted into a single risk score that drives an automated routing decision.
STAGE 1 · DATA EXTRACTION 🖼️ Scanned Bill PNG / PDF Often blurry, skewed 🔧 OpenCV Cleanup Deskew · CLAHE · bilateral 4× DPI upscale → 300 DPI 👁️ PaddleOCR Angle classifier first Text detection + recognition 🤖 Qwen 7B (Ollama) Raw text → structured JSON Patient · items · totals 📋 Structured Bill JSON: patient, items[], amounts, total, date STAGE 2 · PARALLEL FRAUD CHECKS (4 INDEPENDENT SIGNALS) ① IQR Anomaly Flag charges > 95th percentile × 1.5 Bill-level context, not global Catches: inflated single line items ② Reconciliation Σ(line items) vs declared total >1% mismatch = flagged Catches: inflated grand totals ③ SHA-256 Dedup Hash content, check 30-day cache Reject exact duplicates Catches: same bill, multiple insurers ④ Pattern Analysis Same service billed 2× · impossible combos · spike qtys Catches: structural fraud patterns 📊 Weighted Risk Score (0–1) Each check contributes a sub-score; weighted sum produces final risk ✓ ACCEPT risk < 0.4 ⚠ MANUAL REVIEW 0.4 ≤ risk < 0.8 ✗ REJECT risk ≥ 0.8
Step-by-step workflow
01
Aggressive image cleanup
Real bills are blurry, skewed, and shadowed. We deskew via Hough transform, denoise with bilateral filter (preserves letter edges), apply CLAHE for local contrast, and upscale 4× to hit 300 DPI — the resolution OCR engines expect.
OpenCV · Hough · CLAHE · bilateral
02
OCR with PaddleOCR
PaddleOCR runs an angle classifier first (handles rotated docs), then text detection and recognition. Outputs raw text strings with bounding boxes — no structure yet, just words.
PaddleOCR · DB detector · CRNN
03
LLM-based extraction
Qwen 7B (via Ollama, local) receives raw OCR text and returns a strict JSON schema: patient name, line items, amounts, total, date. The LLM handles formatting variance across different hospital templates.
Qwen 7B · Ollama · JSON schema
04
Run all 4 fraud checks in parallel
IQR (per-bill outliers), reconciliation (sum vs declared total), SHA-256 (duplicate submission across 30-day window), and pattern analysis (impossible combos) each run independently and produce a sub-score.
NumPy · hashlib · rule engine
05
Aggregate to risk score
The four sub-scores are combined into a single weighted risk score in [0, 1]. Each check has a tunable weight, calibrated against historical fraud cases.
Weighted ensemble
06
Route the claim
< 0.4 → auto-accept, 0.4–0.8 → human review, ≥ 0.8 → auto-reject. Humans only see the 10–15% of bills the system is uncertain about — saving the bulk of review time.
Threshold routing
The 4 fraud detection checks
  • IQR Amount Anomaly: Flags charges exceeding 95th percentile × 1.5 on that bill. Adapts to bill-level context, not global averages.
  • Reconciliation Check: Sums all line items and compares to declared total. Any >1% discrepancy is flagged — inflated totals are a classic fraud signal.
  • SHA-256 Duplicate Detection: Hashes bill content, checks against 30-day cache. Prevents the same bill from being resubmitted to multiple insurers.
  • Pattern Analysis: Detects same service billed twice, unusually high quantities, and suspicious service combinations in one visit.
Image preprocessing pipeline
  • Deskewing: Hough Line Transform detects text angle and rotates image to straighten it
  • Bilateral filter: Removes noise while preserving letter edge sharpness (unlike regular blur)
  • CLAHE: Local contrast enhancement — brightens dark corners without overexposing the rest
  • Adaptive thresholding: Region-specific black/white conversion — handles shadows correctly
  • 4× DPI scaling: Upscales toward 300 DPI for OCR accuracy
Key Learning
No single fraud signal is reliable on its own — but four independent signals combined are nearly impossible to fool. A clever fraudster might bypass IQR by spreading inflated amounts across many small line items, but that breaks reconciliation. They might avoid reconciliation issues but get caught by SHA-256 duplication. Defence-in-depth, applied to model design.
⊕ View on GitHub → github.com/Harshaaalll/medical-bill-ocr-fraud
Project 05 · BITS Hyderabad · Aug–Nov 2024
Multilingual ASR
in Low-Resource Languages
Automatic speech recognition for Urdu — a low-resource language. Zero-shot Whisper transcription with a two-stage post-processing pipeline including IndicBERT-based MLM error correction.
Speech AI NLP Audio Processing
The Problem
Speech recognition for English is excellent. For low-resource languages like Urdu, it's poor — and there's not enough labelled data to train a new model from scratch. We needed a working Urdu transcriber without millions of labelled hours.
The Approach
Use Whisper zero-shot for the acoustic part, then fix its mistakes using a second model that understands Urdu language. IndicBERT acts as a linguistic spell-checker over the transcript, replacing phonetically-similar words that don't fit the context.
The Impact
14% reduction in Word Error Rate over baseline Whisper alone. Pipeline works on noisy real-world recordings thanks to the audio cleanup stage. Generalisable architecture — swap IndicBERT for any MLM and it works for other languages.
14%
WER reduction
-10dB
Noise reduction
16kHz
Resampled rate
0.4
MLM threshold
System Architecture — Acoustic + linguistic two-pass
Specialist-models architecture: Whisper hears, IndicBERT reads
Instead of one giant model trying to do everything, we let two specialists each do what they're best at — Whisper handles audio, IndicBERT handles language. Their combination beats either alone.
PHASE 1 · AUDIO CONDITIONING 🎵 Raw Audio Variable format / quality 🔊 pydub Pipeline Mono · -20dBFS norm silence trim · 16kHz 📉 Denoise Spectral subtraction ~ -10dB noise floor PHASE 2 · ACOUSTIC MODEL (HEAR) 🎤 Whisper (small) Zero-shot Urdu Language biasing token 📝 Raw Transcript Has phonetic substitutions e.g. similar-sounding wrong word PHASE 3 · LINGUISTIC CORRECTION (READ) For each token... Mask the word Show context to IndicBERT Get probability of word "Is this word likely here?" 🔤 IndicBERT MLM Pre-trained on Indian langs Predicts P(word | context) Threshold: 0.4 below → suggest replacement P < 0.4 ? replace token ✓ Final Transcription WER tracked vs reference ↓ 14% over Whisper alone Iterate over every token in the transcript
Step-by-step workflow
01
Standardise the audio
Convert to mono (Whisper expects mono), normalise to -20dBFS so loudness is consistent across recordings, trim leading/trailing silence, and resample to 16kHz — Whisper's training rate.
pydub · ffmpeg backend
02
Subtract background noise
noisereduce estimates the noise floor from silent regions and subtracts it spectrally from the speech signal — about -10dB reduction without distorting voice.
librosa · noisereduce
03
Transcribe with Whisper
Whisper (small) is run zero-shot with a language token forcing Urdu output. The result is a transcript with the right number of words but plausible-sounding errors — phonetically close, semantically wrong.
openai/whisper-small
04
Mask each token
For every word in the transcript, we hide it (replace with [MASK]) and ask IndicBERT to score how likely the original word is given the surrounding context.
HuggingFace fill-mask
05
Threshold & replace
If IndicBERT's probability for the original word is below 0.4, that word probably doesn't fit — we replace it with the highest-probability alternative IndicBERT suggests.
argmax over vocab
06
Measure WER
Word Error Rate is computed against gold-standard transcripts. The two-stage pipeline produces ~14% lower WER than baseline Whisper alone — significant for a low-resource language.
jiwer · WER metric
Audio preprocessing rationale
  • Mono conversion: Whisper expects mono. Stereo creates phase issues that confuse the model.
  • -20dBFS normalisation: Standardises loudness across mics and environments.
  • Silence trimming: Removes leading/trailing silence that adds nothing but inflates WER.
  • 16kHz resampling: Whisper was trained on 16kHz. Mismatched sample rates degrade quality.
  • Spectral subtraction: Estimates noise floor in silence, subtracts from speech. ~ -10dB reduction.
Why IndicBERT as the corrector?
  • Whisper makes phonetically plausible errors — e.g. transcribing a similar-sounding word
  • MLM (Masked Language Model): IndicBERT is pre-trained to predict masked tokens from context
  • For each transcribed word, check: given the surrounding context, is this the most likely word?
  • If confidence for the transcribed word is below 0.4, replace with the highest-probability alternative
  • IndicBERT is specifically trained on Indian languages — understands Urdu context better than generic models
Key Learning
The two-stage design — ASR first, then MLM correction — is more powerful than trying to build a perfect ASR model. Whisper handles the acoustic modelling, IndicBERT handles linguistic correction. This separation of concerns lets each model do what it's best at: Whisper is excellent at converting audio to text, IndicBERT is excellent at deciding whether a word makes sense in context. Combining specialist models often beats one general model trying to do everything.
⊕ View on GitHub → github.com/Harshaaalll/multilingual-asr
Project 06 · Independent · Jun–Jul 2024
Market Analysis
Using LLMs
An automated financial news sentiment pipeline. Scrapes, summarises, and analyses 500+ articles using BART and RoBERTa, producing real-time sentiment scores served via a Flask API with a Streamlit dashboard.
LLM Sentiment FinTech
The Problem
Markets move on news, but no human can read 500+ articles a day. Existing sentiment APIs are either too generic (consumer Twitter sentiment) or too expensive at scale. We needed a financial-domain pipeline running locally.
The Approach
Three-stage pipeline. Scrape with BeautifulSoup. Summarise with two-pass BART (handles articles longer than the model's 1024-token limit). Score with RoBERTa trained on financial text — using a continuous compound score, not a discrete class.
The Impact
500+ articles/day processed. 70% length reduction via summarisation while keeping financial signals. Real-time API + Streamlit dashboard surfaces sentiment trends live. SMOTE applied correctly — only on training data, no leakage.
500+
Articles processed
70%
Text length reduction
67%
Sentiment accuracy
4
BART beam size
55:46
Class split (pre-SMOTE)
System Architecture — Scrape → summarise → score → serve
Two-pass summarisation pipeline with continuous sentiment scoring
The interesting part is the two-pass BART in the middle. Many financial articles exceed BART's 1024-token context window, so we summarise chunks first, then summarise the chunk-summaries.
🕸️ BeautifulSoup 500+ articles scraped Financial sources ✂️ LangChain Split Article → chunks ≤ 1024 tokens each 📰 Two-Pass BART facebook/bart-large-cnn · 4-beam Pass 1: summarise each chunk → N chunk-summaries Pass 2: summarise the summaries → 70% length reduction 📑 Final Summary Earnings · guidance · risk ~ 30% of original 💭 RoBERTa Sentiment cardiffnlp/twitter-roberta-base softmax → compound score continuous, not discrete ⚖️ SMOTE Balancing Pre-SMOTE class split: 55:46 Applied AFTER train/test split no synthetic data leakage 🚀 Serving Layer Flask REST API · /sentiment Streamlit live dashboard Real-time market signals ⚠ CRITICAL: SMOTE applied to training set ONLY — applying it before split would leak synthetic samples into the test set Output sample: {"ticker": "AAPL", "compound": +0.62, "headline": "...", "trend_24h": "↑"}
Step-by-step workflow
01
Scrape financial news
BeautifulSoup pulls 500+ articles from financial sources daily — headlines, body text, publish date, ticker mentions.
BeautifulSoup · requests
02
Chunk long articles
Articles exceeding BART's 1024-token limit are split via LangChain's unstructured text workflow into manageable chunks.
LangChain · token-aware splitter
03
Two-pass BART summarisation
Pass 1 summarises each chunk independently. Pass 2 summarises the combined chunk-summaries into a final article summary. 4-beam search produces higher quality than greedy decoding by considering 4 alternatives at each step.
facebook/bart-large-cnn · 4-beam
04
Score sentiment with RoBERTa
cardiffnlp/twitter-roberta-base-sentiment scores the summary. Instead of argmax (one class), we compute a softmax-weighted compound score across positive/neutral/negative — a continuous signal more useful for tracking trends.
RoBERTa · softmax compound
05
Balance training data — correctly
For the supervised classifier, classes were 55:46. SMOTE was applied only after train/test split, on the training portion. Applying it earlier would leak synthetic samples into validation and inflate accuracy.
imblearn · SMOTE (post-split)
06
Serve via API + dashboard
A Flask endpoint returns sentiment-per-ticker for downstream consumers. A Streamlit dashboard shows live sentiment trends — green when positive sentiment dominates, red when negative.
Flask · Streamlit · WebSocket
Why two-pass BART summarisation?
  • Financial news articles often exceed BART's 1024-token context limit
  • Pass 1: Split article into chunks, summarise each independently
  • Pass 2: Summarise the combined chunk-summaries into a final summary
  • 4-beam search produces better quality than greedy decoding — considers 4 possible next tokens at each step
  • Result: 70% length reduction while preserving key financial signals (earnings, guidance, risk factors)
Why RoBERTa over BERT for sentiment?
  • cardiffnlp/twitter-roberta-base-sentiment is trained on financial Twitter data — closer domain than generic models
  • Compound score: Instead of argmax (pick one class), compute softmax-weighted sum across positive/neutral/negative
  • This gives a continuous sentiment score rather than a discrete class — more useful for tracking sentiment trends over time
  • SMOTE only on training data — a common mistake is applying SMOTE before splitting, which leaks synthetic samples into validation. Applied correctly here.
Key Learning
The most important data-science lesson from this project: SMOTE must only be applied to training data, never before the train-test split. Applying SMOTE on the full dataset before splitting creates data leakage — synthetic samples generated from real training examples end up in the test set, inflating accuracy metrics. The correct workflow: split first, then apply SMOTE only on the training portion.
⊕ View on GitHub → github.com/Harshaaalll/market-analysis-llm