CodeSwitcher

Back to Projects

Category:

CS4973

Client:

Preview

Code-Switching Detection in Bilingual Dialogue

A deep learning project for detecting language-switch boundaries in English-Chinese bilingual conversations using BiLSTM neural networks.

🎯 Project Overview

Code-switching occurs when bilingual speakers alternate between languages within a single conversation. This project tackles token-level code-switch boundary detection—predicting whether a language switch occurs between consecutive tokens in bilingual dialogue.

Key Results

Model	Accuracy	Precision	Recall	F1-score	ROC-AUC
Logistic Regression	0.711	0.583	0.766	0.662	0.795
Random Forest	0.763	0.649	0.778	0.708	0.831
BiLSTM (Final)	0.834	0.748	0.828	0.786	0.915

📊 Dataset

Total examples: 118,965 token-boundary classification instances
Train/test split: 80/20 (stratified)
Class distribution: 63% no-switch, 37% switch (balanced)
Vocabulary size: 3,094 unique tokens
Max sequence length: 50 tokens

Data Generation

Synthetic dialogues generated using OpenAI GPT-3.5/GPT-4 with:

V5 prompt template with parameter sweeps
4 domains: Tech/Professional (25%), Casual/Social (30%), Family/Intimate (25%), Narrative (20%)
5 persona pairs: colleagues, friends, siblings, family, romantic
Agentic validation pipeline: Language ratio checks, switch frequency validation, naturalness scoring
~80% acceptance rate after quality filtering

See data_generation_v5.py for details.

🏗️ Architecture

BiLSTM Classifier

Input: [token_1, token_2, ..., token_50]
  ↓
Embedding Layer (vocab_size=3094 → embed_dim=128)
  ↓
BiLSTM (2 layers, hidden_dim=256, bidirectional, dropout=0.3)
  ↓
Pooling: Mean + Last Hidden State (combined)
  ↓
FC Layers: [Dense(256→128) + ReLU + Dropout] → [Dense(128→1)]
  ↓
Sigmoid Output: probability ∈ [0, 1]

Why BiLSTM?

Sequential modeling: Captures token order (unlike bag-of-words)
Bidirectionality: Reads context left-to-right AND right-to-left
Long-range dependencies: Handles distant context relationships
Better than baselines: +7.1% accuracy over Random Forest

Training

Hyperparameters:

Learning rate: 1e-3 (AdamW optimizer)
Batch size: 256
Loss function: BCEWithLogitsLoss with pos_weight=1.72 (class imbalance)
Gradient clipping: max_norm=1.0
Early stopping: patience=5 epochs

Training dynamics:

Peak validation AUC: 0.9202 (Epoch 5)
Final test ROC-AUC: 0.915
Training time: ~20 minutes (15 epochs with early stopping)

📁 Project Structure

🚀 Quick Start

Installation

# Clone repository
git clone https://github.com/yourusername/code-switching-bilstm.git
cd code-switching-bilstm

# Install dependencies
pip install -r

Generate Synthetic Data

# Requires OpenAI API key in .env file
export API_KEY="sk-..."

This generates ~2,000 synthetic dialogues with parameter sweeps across domains and personas.

Train Models

# End-to-end pipeline: data processing + train all 3 models + generate visualizations
python scripts/main.py

# Or train individual models:

Output:

Trained model metrics (accuracy, precision, recall, F1, ROC-AUC)
Confusion matrix heatmap saved to figures/confusion_matrix_bilstm.png
Predictions on test set saved to data/processed/
8+ analytical visualizations in figures/

📈 Key Findings

Model Performance

BiLSTM achieves 91.5% ROC-AUC, outperforming simpler baselines
Bidirectional context crucial: Random Forest (bag-of-words) misses 5.0% more switches
Early stopping at Epoch 5 prevented overfitting (val AUC plateau at 0.9202)

Error Analysis

False Positives (16.1%): High vocabulary diversity in mono-lingual utterances

Example: "我们今天讨论工程师的职责和团队的目标" (all Chinese, no switch)
Model incorrectly predicts switch due to diverse tokens

False Negatives (17.0%): Subtle, natural technical term switches

Example: "我们需要 deploy 这个功能" (Mandarin + English technical term)
Model misses contextually smooth switches

Linguistic Insights

Language balance: 63% Chinese, 37% English tokens (realistic code-switching ratio)
Switch frequency: 37% of token boundaries are switches (high code-mixing)
Domain variation: Tech contexts default to English; casual contexts more balanced
POS patterns: Nouns and verbs have distinct language preferences

🔬 Ethical Analysis

Detected Bias

Domain Confinement: 68% of generated data was tech/professional domain

Risk: Model overfits to formal language, fails on casual speech
Mitigation: V5 prompt enforces domain stratification (25% each for 4 domains)

English Authority Skew: Technical terms default to English

Reflects real-world bias but risks overfitting
Addressed via diverse persona generation and casualness checks

Mitigation Strategy

Prompt refinement: V2→V5 added explicit domain percentages
Data filtering: Vocabulary audits, casualness checks, manual validation
Rebalancing: Oversampled informal/family dialogues to 40% of dataset
Post-mitigation: Domain diversity improved from 58% → 78%

🔮 Future Work

Short-term

Multilingual embeddings: Replace random embeddings with mBERT
- Expected improvement: +5% ROC-AUC
- Captures cross-lingual semantic relationships
Threshold optimization: Tune decision boundary per domain
- Logistics: Different domains may need different thresholds
Real data validation: Test on authentic bilingual dialogue
- Current: Synthetic data only

Long-term

More language pairs: Extend to Hindi-English, Spanish-English, etc.
Phonetic code-switching: Handle pronunciation-driven switches
Contextual semantics: Why do speakers switch? (pragmatic analysis)
Dialect variation: Account for regional/generational differences

📚 References

Papers

Solorio, T., & Liu, Y. (2008). "Learning to Predict Code-Switching Points." EMNLP
King, B., & Abney, S. (2013). "Labeling the Languages in Code-Mixed Training Data without Explicit Language IDs." HLT-NAACL

Data & Code

OpenAI GPT API for synthetic data generation
PyTorch for BiLSTM implementation
scikit-learn for baselines and evaluation

📄 License

This project is licensed under the MIT License. See LICENSE file for details.

👥 Authors

Keith Yao & Arav Goyal

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Commit changes (git commit -am 'Add feature')
Push to branch (git push origin feature/my-feature)
Open a Pull Request

📞 Contact

For questions or feedback:

Open an issue on GitHub
Email: yao.kei@northeastern.edu

Last Updated: December 2025
Dataset Version: V5 (2,000 synthetic dialogues, 4-domain stratified)
Model Version: BiLSTM (2-layer, 256-dim hidden, ROC-AUC 0.915)

›

Other Projects

View Project

Willow CRM

View Project

Willow CRM

View Project

Willow CRM

View Project

Transformer (From Scratch)

View Project

Transformer (From Scratch)

View Project

Transformer (From Scratch)

View All Projects

Do you have any project idea you want to discuss about?

Contact Me

Do you have any project idea you want to discuss about?

Contact Me

Do you have any project idea you want to discuss about?

Contact Me