Category:
CS4973
Client:
Code-Switching Detection in Bilingual Dialogue
A deep learning project for detecting language-switch boundaries in English-Chinese bilingual conversations using BiLSTM neural networks.
🎯 Project Overview
Code-switching occurs when bilingual speakers alternate between languages within a single conversation. This project tackles token-level code-switch boundary detection—predicting whether a language switch occurs between consecutive tokens in bilingual dialogue.
Key Results
Model | Accuracy | Precision | Recall | F1-score | ROC-AUC |
|---|---|---|---|---|---|
Logistic Regression | 0.711 | 0.583 | 0.766 | 0.662 | 0.795 |
Random Forest | 0.763 | 0.649 | 0.778 | 0.708 | 0.831 |
BiLSTM (Final) | 0.834 | 0.748 | 0.828 | 0.786 | 0.915 |
📊 Dataset
Total examples: 118,965 token-boundary classification instances
Train/test split: 80/20 (stratified)
Class distribution: 63% no-switch, 37% switch (balanced)
Vocabulary size: 3,094 unique tokens
Max sequence length: 50 tokens
Data Generation
Synthetic dialogues generated using OpenAI GPT-3.5/GPT-4 with:
V5 prompt template with parameter sweeps
4 domains: Tech/Professional (25%), Casual/Social (30%), Family/Intimate (25%), Narrative (20%)
5 persona pairs: colleagues, friends, siblings, family, romantic
Agentic validation pipeline: Language ratio checks, switch frequency validation, naturalness scoring
~80% acceptance rate after quality filtering
See data_generation_v5.py for details.
🏗️ Architecture
BiLSTM Classifier
Why BiLSTM?
Sequential modeling: Captures token order (unlike bag-of-words)
Bidirectionality: Reads context left-to-right AND right-to-left
Long-range dependencies: Handles distant context relationships
Better than baselines: +7.1% accuracy over Random Forest
Training
Hyperparameters:
Learning rate: 1e-3 (AdamW optimizer)
Batch size: 256
Loss function: BCEWithLogitsLoss with pos_weight=1.72 (class imbalance)
Gradient clipping: max_norm=1.0
Early stopping: patience=5 epochs
Training dynamics:
Peak validation AUC: 0.9202 (Epoch 5)
Final test ROC-AUC: 0.915
Training time: ~20 minutes (15 epochs with early stopping)
📁 Project Structure
🚀 Quick Start
Installation
Generate Synthetic Data
This generates ~2,000 synthetic dialogues with parameter sweeps across domains and personas.
Train Models
Output:
Trained model metrics (accuracy, precision, recall, F1, ROC-AUC)
Confusion matrix heatmap saved to
figures/confusion_matrix_bilstm.pngPredictions on test set saved to
data/processed/8+ analytical visualizations in
figures/
📈 Key Findings
Model Performance
BiLSTM achieves 91.5% ROC-AUC, outperforming simpler baselines
Bidirectional context crucial: Random Forest (bag-of-words) misses 5.0% more switches
Early stopping at Epoch 5 prevented overfitting (val AUC plateau at 0.9202)
Error Analysis
False Positives (16.1%): High vocabulary diversity in mono-lingual utterances
Example: "我们今天讨论工程师的职责和团队的目标" (all Chinese, no switch)
Model incorrectly predicts switch due to diverse tokens
False Negatives (17.0%): Subtle, natural technical term switches
Example: "我们需要 deploy 这个功能" (Mandarin + English technical term)
Model misses contextually smooth switches
Linguistic Insights
Language balance: 63% Chinese, 37% English tokens (realistic code-switching ratio)
Switch frequency: 37% of token boundaries are switches (high code-mixing)
Domain variation: Tech contexts default to English; casual contexts more balanced
POS patterns: Nouns and verbs have distinct language preferences
🔬 Ethical Analysis
Detected Bias
Domain Confinement: 68% of generated data was tech/professional domain
Risk: Model overfits to formal language, fails on casual speech
Mitigation: V5 prompt enforces domain stratification (25% each for 4 domains)
English Authority Skew: Technical terms default to English
Reflects real-world bias but risks overfitting
Addressed via diverse persona generation and casualness checks
Mitigation Strategy
Prompt refinement: V2→V5 added explicit domain percentages
Data filtering: Vocabulary audits, casualness checks, manual validation
Rebalancing: Oversampled informal/family dialogues to 40% of dataset
Post-mitigation: Domain diversity improved from 58% → 78%
🔮 Future Work
Short-term
Multilingual embeddings: Replace random embeddings with mBERT
Expected improvement: +5% ROC-AUC
Captures cross-lingual semantic relationships
Threshold optimization: Tune decision boundary per domain
Logistics: Different domains may need different thresholds
Real data validation: Test on authentic bilingual dialogue
Current: Synthetic data only
Long-term
More language pairs: Extend to Hindi-English, Spanish-English, etc.
Phonetic code-switching: Handle pronunciation-driven switches
Contextual semantics: Why do speakers switch? (pragmatic analysis)
Dialect variation: Account for regional/generational differences
📚 References
Papers
Solorio, T., & Liu, Y. (2008). "Learning to Predict Code-Switching Points." EMNLP
King, B., & Abney, S. (2013). "Labeling the Languages in Code-Mixed Training Data without Explicit Language IDs." HLT-NAACL
Data & Code
OpenAI GPT API for synthetic data generation
PyTorch for BiLSTM implementation
scikit-learn for baselines and evaluation
📄 License
This project is licensed under the MIT License. See LICENSE file for details.
👥 Authors
Keith Yao & Arav Goyal
🤝 Contributing
Contributions welcome! Please:
Fork the repository
Create a feature branch (
git checkout -b feature/my-feature)Commit changes (
git commit -am 'Add feature')Push to branch (
git push origin feature/my-feature)Open a Pull Request
📞 Contact
For questions or feedback:
Open an issue on GitHub
Email: yao.kei@northeastern.edu
Last Updated: December 2025
Dataset Version: V5 (2,000 synthetic dialogues, 4-domain stratified)
Model Version: BiLSTM (2-layer, 256-dim hidden, ROC-AUC 0.915)

