🕸️
BeautifulSoup
500+ financial articles scraped
✂️
LangChain Chunking
Unstructured text workflow
📰
BART Two-Pass
facebook/bart-large-cnn, 4-beam, 70% reduction
💭
RoBERTa Sentiment
cardiffnlp Twitter sentiment, softmax weighted
⚖️
SMOTE Balancing
On training set only, 55:46 → balanced
🚀
Flask API
Real-time + Streamlit dashboard
Key Learning
The most important data science lesson from this project: SMOTE must only be applied to training data, never before the train-test split. Applying SMOTE on the full dataset before splitting creates data leakage — synthetic samples generated from real training examples end up in the test set, inflating accuracy metrics. The correct workflow: split first, then apply SMOTE only on the training portion.