Binary Classification Tutorial#
Learn how to build a binary sentiment classifier for product reviews.
Learning Objectives#
By the end of this tutorial, you will be able to:
Create and train a WordPiece tokenizer
Configure a binary classification model
Train the model with validation data
Make predictions and evaluate performance
Understand the complete workflow from data to predictions
Prerequisites#
Basic Python knowledge
torchTextClassifiers installed
Familiarity with classification concepts
Overview#
In this tutorial, we’ll build a sentiment classifier that predicts whether a product review is positive or negative. We’ll use:
Dataset: Product reviews (30 training, 8 validation, 10 test samples)
Task: Binary classification (positive vs. negative)
Tokenizer: WordPiece
Architecture: Simple text embedder + classification head
Complete Code#
Here’s the complete code we’ll walk through:
import os
import numpy as np
import torch
from torchTextClassifiers import ModelConfig, TrainingConfig, torchTextClassifiers
from torchTextClassifiers.tokenizers import WordPieceTokenizer
# For Mac M1/M2 users
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
# Step 1: Prepare Data
X_train = np.array([
"I love this product! It's amazing and works perfectly.",
"This is terrible. Worst purchase ever made.",
"Great quality and fast shipping. Highly recommend!",
"Poor quality, broke after one day. Very disappointed.",
"Excellent customer service and great value for money.",
"Overpriced and doesn't work as advertised.",
# ... (30 total samples)
])
y_train = np.array([1, 0, 1, 0, 1, 0, ...]) # 1=positive, 0=negative
X_val = np.array([
"Good product, satisfied with purchase.",
"Not worth the money, poor quality.",
# ... (8 total samples)
])
y_val = np.array([1, 0, ...])
X_test = np.array([
"This is an amazing product with great features!",
"Completely disappointed with this purchase.",
# ... (10 total samples)
])
y_test = np.array([1, 0, ...])
# Step 2: Create and Train Tokenizer
tokenizer = WordPieceTokenizer(vocab_size=5000, output_dim=128)
tokenizer.train(X_train.tolist())
# Step 3: Configure Model
model_config = ModelConfig(
embedding_dim=50,
num_classes=2
)
# Step 4: Create Classifier
classifier = torchTextClassifiers(
tokenizer=tokenizer,
model_config=model_config
)
# Step 5: Train Model
training_config = TrainingConfig(
num_epochs=20,
batch_size=4,
lr=1e-3,
patience_early_stopping=5,
num_workers=0,
)
classifier.train(
X_train, y_train,
X_val, y_val,
training_config=training_config,
verbose=True
)
# Step 6: Make Predictions
result = classifier.predict(X_test)
predictions = result["prediction"].squeeze().numpy()
confidence = result["confidence"].squeeze().numpy()
# Step 7: Evaluate
accuracy = (predictions == y_test).mean()
print(f"Test accuracy: {accuracy:.3f}")
Step-by-Step Walkthrough#
Step 1: Prepare Your Data#
First, organize your data into training, validation, and test sets:
X_train = np.array([
"I love this product! It's amazing and works perfectly.",
"This is terrible. Worst purchase ever made.",
# ... more samples
])
y_train = np.array([1, 0, ...]) # Binary labels
Key Points:
Training set: Used to train the model (30 samples)
Validation set: Used for early stopping and hyperparameter tuning (8 samples)
Test set: Used for final evaluation (10 samples)
Labels: 0 = negative, 1 = positive
Tip
For real projects, use at least hundreds of samples per class. This example uses small numbers for demonstration.
Step 2: Create and Train Tokenizer#
The tokenizer converts text into numerical tokens:
tokenizer = WordPieceTokenizer(vocab_size=5000, output_dim=128)
tokenizer.train(X_train.tolist())
Parameters:
vocab_size: Maximum vocabulary size (5000 subwords)output_dim: Output dimension for tokenized sequences (128 tokens max)
What happens during training:
Analyzes the training corpus
Learns common subwords and character combinations
Builds a vocabulary of frequent patterns
Note
The tokenizer only sees the training data, never validation or test data, to avoid data leakage.
Step 3: Configure the Model#
Define your model architecture:
model_config = ModelConfig(
embedding_dim=50,
num_classes=2
)
Parameters:
embedding_dim: Dimension of learned text embeddings (50)num_classes: Number of output classes (2 for binary classification)
Architecture:
The model will have:
Embedding layer: Maps tokens to 50-dimensional vectors
Pooling: Averages token embeddings
Classification head: Linear layer outputting 2 logits
Step 4: Create the Classifier#
Instantiate the classifier with the tokenizer and configuration:
classifier = torchTextClassifiers(
tokenizer=tokenizer,
model_config=model_config
)
This creates the complete pipeline: tokenizer → embedder → classifier.
Step 5: Configure and Run Training#
Set up training hyperparameters:
training_config = TrainingConfig(
num_epochs=20, # Maximum training epochs
batch_size=4, # Samples per batch
lr=1e-3, # Learning rate
patience_early_stopping=5, # Stop if no improvement for 5 epochs
num_workers=0, # Data loading workers
)
Key Hyperparameters:
num_epochs: How many times to iterate through the dataset
batch_size: Smaller = more updates but slower; larger = faster but less stable
lr (learning rate): How big the optimization steps are
patience_early_stopping: Prevents overfitting by stopping early
Train the model:
classifier.train(
X_train, y_train, # Training data
X_val, y_val, # Validation data
training_config=training_config,
verbose=True # Show training progress
)
Expected Output:
Epoch 0: 100%|██████████| 8/8 [00:00<00:00, 25.32it/s, v_num=0]
Epoch 1: 100%|██████████| 8/8 [00:00<00:00, 28.41it/s, v_num=0]
...
Tip
Watch the validation metrics during training. If validation loss increases while training loss decreases, you may be overfitting.
Step 6: Make Predictions#
Use the trained model to predict on new data:
result = classifier.predict(X_test)
predictions = result["prediction"].squeeze().numpy()
confidence = result["confidence"].squeeze().numpy()
Output:
predictions: Predicted class labels (0 or 1)confidence: Confidence scores (0-1 range)
Example output:
predictions = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
confidence = [0.95, 0.88, 0.92, 0.76, 0.98, 0.85, 0.91, 0.79, 0.94, 0.87]
Step 7: Evaluate Performance#
Calculate accuracy:
accuracy = (predictions == y_test).mean()
print(f"Test accuracy: {accuracy:.3f}")
Show detailed results:
for i, (text, pred, true) in enumerate(zip(X_test, predictions, y_test)):
sentiment = "Positive" if pred == 1 else "Negative"
correct = "✅" if pred == true else "❌"
print(f"{i+1}. {correct} Predicted: {sentiment}")
print(f" Text: {text[:50]}...")
Example output:
1. ✅ Predicted: Positive
Text: This is an amazing product with great features...
2. ✅ Predicted: Negative
Text: Completely disappointed with this purchase...
Test accuracy: 0.900
Understanding the Results#
What Does Good Performance Look Like?#
Accuracy > 0.80: Good for simple binary classification
Accuracy > 0.90: Excellent performance
Confidence scores high: Model is certain about predictions
When to Worry#
Accuracy < 0.60: Model barely better than random guessing
Validation loss increasing: Possible overfitting
Low confidence scores: Model is uncertain
Customization Options#
Using Different Tokenizers#
Try the NGramTokenizer (FastText-style):
from torchTextClassifiers.tokenizers import NGramTokenizer
tokenizer = NGramTokenizer(
vocab_size=5000,
min_n=3, # Minimum n-gram size
max_n=6, # Maximum n-gram size
)
tokenizer.train(X_train.tolist())
Adjusting Model Size#
For better performance with more data:
model_config = ModelConfig(
embedding_dim=128, # Larger embeddings
num_classes=2
)
Training Longer#
training_config = TrainingConfig(
num_epochs=50, # More epochs
batch_size=16, # Larger batches
lr=5e-4, # Lower learning rate
patience_early_stopping=10, # More patience
)
Using GPU#
If you have a GPU:
training_config = TrainingConfig(
...
accelerator="gpu", # Use GPU
)
Common Issues and Solutions#
Issue: Low Accuracy#
Solutions:
Increase
embedding_dim(e.g., 128 or 256)Train for more epochs
Collect more training data
Try different learning rates (1e-4, 5e-4, 1e-3)
Issue: Model Overfitting#
Symptoms: High training accuracy, low validation accuracy
Solutions:
Reduce
embedding_dimAdd more training data
Reduce
patience_early_stoppingfor earlier stoppingUse data augmentation
Issue: Training Too Slow#
Solutions:
Increase
batch_size(if memory allows)Reduce
num_epochsUse
accelerator="gpu"Increase
num_workers(for data loading)
Next Steps#
Now that you’ve built a binary classifier, you can:
Try multiclass classification: See Multiclass Classification Tutorial
Add categorical features: Learn about mixed features
Use explainability: Understand which words drive predictions
Explore architecture: Read Architecture Overview
Complete Working Example#
You can find the complete working example in the repository:
Summary#
In this tutorial, you learned:
✅ How to prepare training, validation, and test data
✅ How to create and train a WordPiece tokenizer
✅ How to configure a binary classification model
✅ How to train the model with early stopping
✅ How to make predictions and evaluate performance
✅ How to customize hyperparameters
You’re now ready to build your own text classifiers!