Architecture Overview#

torchTextClassifiers is a modular, component-based framework for text classification. Rather than a black box, it provides clear, reusable components that you can understand, configure, and compose.

The Pipeline#

At its core, torchTextClassifiers processes data through a simple pipeline:

Data Flow:

Text is tokenized into numerical tokens
Tokens are embedded into dense vectors (with optional attention)
Categorical variables (optional) are embedded separately
All embeddings are combined
Classification head produces final predictions

Component 1: Tokenizer#

Purpose: Convert text strings into numerical tokens that the model can process.

Available Tokenizers#

torchTextClassifiers supports three tokenization strategies:

NGramTokenizer (FastText-style)#

Character n-gram tokenization for robustness to typos and rare words.

from torchTextClassifiers.tokenizers import NGramTokenizer

tokenizer = NGramTokenizer(
    vocab_size=10000,
    min_n=3,  # Minimum n-gram size
    max_n=6,  # Maximum n-gram size
)
tokenizer.train(training_texts)

When to use:

Text with typos or non-standard spellings
Morphologically rich languages
Limited training data

WordPieceTokenizer#

Subword tokenization for balanced vocabulary coverage.

from torchTextClassifiers.tokenizers import WordPieceTokenizer

tokenizer = WordPieceTokenizer(vocab_size=5000)
tokenizer.train(training_texts)

When to use:

Standard text classification
Moderate vocabulary size
Good balance of coverage and granularity

HuggingFaceTokenizer#

Use pre-trained tokenizers from HuggingFace.

from torchTextClassifiers.tokenizers import HuggingFaceTokenizer
from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = HuggingFaceTokenizer(tokenizer=hf_tokenizer)

When to use:

Transfer learning from pre-trained models
Need specific language support
Want to leverage existing tokenizers

Tokenizer Output#

All tokenizers produce the same output format:

output = tokenizer(["Hello world!", "Text classification"])
# output.input_ids: Token indices (batch_size, seq_len)
# output.attention_mask: Attention mask (batch_size, seq_len)

Component 2: Text Embedder#

Purpose: Convert tokens into dense, semantic embeddings that capture meaning.

Basic Text Embedding#

from torchTextClassifiers.model.components import TextEmbedder, TextEmbedderConfig

config = TextEmbedderConfig(
    vocab_size=5000,
    embedding_dim=128,
)
embedder = TextEmbedder(config)

# Forward pass
text_features = embedder(token_ids)  # Shape: (batch_size, 128)

How it works:

Looks up embedding for each token
Averages embeddings across the sequence
Produces a fixed-size vector per sample

With Self-Attention (Optional)#

Add transformer-style self-attention for better contextual understanding:

from torchTextClassifiers.model.components import AttentionConfig

attention_config = AttentionConfig(
    n_embd=128,
    n_head=4,      # Number of attention heads
    n_layer=2,     # Number of transformer blocks
    dropout=0.1,
)

config = TextEmbedderConfig(
    vocab_size=5000,
    embedding_dim=128,
    attention_config=attention_config,  # Add attention
)
embedder = TextEmbedder(config)

When to use attention:

Long documents where context matters
Tasks requiring understanding of word relationships
When you have sufficient training data

Configuration:

embedding_dim: Size of embedding vectors (e.g., 64, 128, 256)
n_head: Number of attention heads (typically 4, 8, or 16)
n_layer: Depth of transformer (start with 2-3)

Component 3: Categorical Variable Handler#

Purpose: Process categorical features (like user demographics, product categories) alongside text.

When to Use#

Add categorical features when you have structured data that complements text:

User age, location, or demographics
Product categories or attributes
Document metadata (source, type, etc.)

Setup#

from torchTextClassifiers.model.components import (
    CategoricalVariableNet,
    CategoricalForwardType
)

# Example: 3 categorical variables
# - Variable 1: 10 possible values
# - Variable 2: 5 possible values
# - Variable 3: 20 possible values

cat_handler = CategoricalVariableNet(
    vocabulary_sizes=[10, 5, 20],
    embedding_dims=[8, 4, 16],    # Embedding size for each
    forward_type=CategoricalForwardType.AVERAGE_AND_CONCAT
)

Combination Strategies#

The forward_type controls how categorical embeddings are combined:

AVERAGE_AND_CONCAT#

Average all categorical embeddings, then concatenate with text:

forward_type=CategoricalForwardType.AVERAGE_AND_CONCAT

Output size: text_embedding_dim + sum(categorical_embedding_dims)/n_categoricals

When to use: When categorical variables are equally important

CONCATENATE_ALL#

Concatenate each categorical embedding separately:

forward_type=CategoricalForwardType.CONCATENATE_ALL

Output size: text_embedding_dim + sum(categorical_embedding_dims)

When to use: When each categorical variable has unique importance

SUM_TO_TEXT#

Sum all categorical embeddings, then concatenate:

forward_type=CategoricalForwardType.SUM_TO_TEXT

Output size: text_embedding_dim + categorical_embedding_dim

When to use: To minimize output dimension

Example with Data#

# Text data
texts = ["Sample 1", "Sample 2"]

# Categorical data: (n_samples, n_categorical_variables)
categorical = np.array([
    [5, 2, 14],  # Sample 1: cat1=5, cat2=2, cat3=14
    [3, 1, 8],   # Sample 2: cat1=3, cat2=1, cat3=8
])

# Process
cat_features = cat_handler(categorical)  # Shape: (2, total_emb_dim)

Component 4: Classification Head#

Purpose: Take the combined features and produce class predictions.

Simple Classification#

from torchTextClassifiers.model.components import ClassificationHead

head = ClassificationHead(
    input_dim=152,      # 128 (text) + 24 (categorical)
    num_classes=5,      # Number of output classes
)

logits = head(combined_features)  # Shape: (batch_size, 5)

Custom Classification Head#

For more complex classification, provide your own architecture:

import torch.nn as nn

custom_head = nn.Sequential(
    nn.Linear(152, 64),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(64, 5)
)

head = ClassificationHead(net=custom_head)

Complete Architecture#

Full Model Assembly#

The framework automatically combines all components:

from torchTextClassifiers.model import TextClassificationModel

model = TextClassificationModel(
    text_embedder=text_embedder,
    categorical_variable_net=cat_handler,  # Optional
    classification_head=head,
)

# Forward pass
logits = model(token_ids, categorical_data)

Usage Examples#

Example 1: Text-Only Classification#

Simple sentiment analysis with just text:

from torchTextClassifiers import torchTextClassifiers, ModelConfig, TrainingConfig
from torchTextClassifiers.tokenizers import WordPieceTokenizer

# 1. Create tokenizer
tokenizer = WordPieceTokenizer(vocab_size=5000)
tokenizer.train(texts)

# 2. Configure model
model_config = ModelConfig(
    embedding_dim=128,
    num_classes=2,  # Binary classification
)

# 3. Train
classifier = torchTextClassifiers(tokenizer=tokenizer, model_config=model_config)
training_config = TrainingConfig(num_epochs=10, batch_size=32, lr=1e-3)
classifier.train(texts, labels, training_config=training_config)

# 4. Predict
predictions = classifier.predict(new_texts)

Example 2: Mixed Features (Text + Categorical)#

Product classification using both description and category:

import numpy as np

# Text + categorical data
texts = ["Product description...", "Another product..."]
categorical = np.array([
    [3, 1],  # Product 1: category=3, brand=1
    [5, 2],  # Product 2: category=5, brand=2
])
labels = [0, 1]

# Configure model with categorical features
model_config = ModelConfig(
    embedding_dim=128,
    num_classes=3,
    categorical_vocabulary_sizes=[10, 5],  # 10 categories, 5 brands
    categorical_embedding_dims=[8, 4],
)

# Train
classifier = torchTextClassifiers(tokenizer=tokenizer, model_config=model_config)
classifier.train(
    X_text=texts,
    y=labels,
    X_categorical=categorical,
    training_config=training_config
)

Example 3: With Attention#

For longer documents or complex text:

from torchTextClassifiers.model.components import AttentionConfig

# Add attention for better understanding
attention_config = AttentionConfig(
    n_embd=128,
    n_head=8,
    n_layer=3,
    dropout=0.1,
)

model_config = ModelConfig(
    embedding_dim=128,
    num_classes=5,
    attention_config=attention_config,  # Enable attention
)

classifier = torchTextClassifiers(tokenizer=tokenizer, model_config=model_config)

Example 4: Custom Components#

For maximum flexibility, compose components manually:

from torch import nn
from torchTextClassifiers.model.components import TextEmbedder, ClassificationHead

# Create custom model
class CustomClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.text_embedder = TextEmbedder(text_config)
        self.custom_layer = nn.Linear(128, 64)
        self.head = ClassificationHead(64, num_classes)

    def forward(self, input_ids):
        text_features = self.text_embedder(input_ids)
        custom_features = self.custom_layer(text_features)
        return self.head(custom_features)

Using the High-Level API#

For most users, the torchTextClassifiers wrapper handles all the complexity:

from torchTextClassifiers import torchTextClassifiers, ModelConfig, TrainingConfig

# Simple 3-step process:
# 1. Create tokenizer and train it
# 2. Configure model architecture
# 3. Train and predict

classifier = torchTextClassifiers(tokenizer=tokenizer, model_config=model_config)
classifier.train(texts, labels, training_config=training_config)
predictions = classifier.predict(new_texts)

What the wrapper does:

Creates all components automatically
Sets up PyTorch Lightning training
Handles data loading and batching
Provides simple train/predict interface
Manages configurations

When to use the wrapper:

Standard classification tasks
Quick experimentation
Don’t need custom architecture
Want simplicity over control

For Advanced Users#

Direct PyTorch Usage#

All components are standard torch.nn.Module objects:

# All components work with standard PyTorch
isinstance(text_embedder, nn.Module)  # True
isinstance(cat_handler, nn.Module)  # True
isinstance(head, nn.Module)  # True

# Use in any PyTorch code
model = TextClassificationModel(text_embedder, cat_handler, head)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Standard PyTorch training loop
for batch in dataloader:
    optimizer.zero_grad()
    logits = model(batch.input_ids, batch.categorical)
    loss = criterion(logits, batch.labels)
    loss.backward()
    optimizer.step()

PyTorch Lightning Integration#

For automated training with advanced features:

from torchTextClassifiers.model import TextClassificationModule
from pytorch_lightning import Trainer

# Wrap model in Lightning module
lightning_module = TextClassificationModule(
    model=model,
    loss=nn.CrossEntropyLoss(),
    optimizer=torch.optim.Adam,
    lr=1e-3,
)

# Use Lightning Trainer
trainer = Trainer(
    max_epochs=20,
    accelerator="gpu",
    devices=4,  # Multi-GPU
    callbacks=[EarlyStopping(), ModelCheckpoint()],
)
trainer.fit(lightning_module, train_dataloader, val_dataloader)

Design Philosophy#

Modularity#

Each component is independent and can be used separately:

# Use just the tokenizer
tokenizer = NGramTokenizer()

# Use just the embedder
embedder = TextEmbedder(config)

# Use just the classifier head
head = ClassificationHead(input_dim, num_classes)

Flexibility#

Mix and match components for your use case:

# Text only
model = TextClassificationModel(text_embedder, None, head)

# Text + categorical
model = TextClassificationModel(text_embedder, cat_handler, head)

# Custom combination
model = MyCustomModel(text_embedder, my_layer, head)

Simplicity#

Sensible defaults for quick starts:

# Minimal configuration
model_config = ModelConfig(embedding_dim=128, num_classes=2)

# Or detailed configuration
model_config = ModelConfig(
    embedding_dim=256,
    num_classes=10,
    categorical_vocabulary_sizes=[50, 20, 100],
    categorical_embedding_dims=[32, 16, 64],
    attention_config=AttentionConfig(n_embd=256, n_head=8, n_layer=4),
)

Extensibility#

Easy to add custom components:

class MyCustomEmbedder(nn.Module):
    def __init__(self):
        super().__init__()
        # Your custom implementation

    def forward(self, input_ids):
        # Your custom forward pass
        return embeddings

# Use with existing components
model = TextClassificationModel(
    text_embedder=MyCustomEmbedder(),
    classification_head=head,
)

Configuration Guide#

Choosing Embedding Dimension#

Task Complexity	Data Size	Recommended embedding_dim
Simple (binary)	< 1K samples	32-64
Medium (3-5 classes)	1K-10K samples	64-128
Complex (10+ classes)	10K-100K samples	128-256
Very complex	> 100K samples	256-512

Attention Configuration#

Document Length	Recommended Setup
Short (< 50 tokens)	No attention needed
Medium (50-200 tokens)	n_layer=2, n_head=4
Long (200-512 tokens)	n_layer=3-4, n_head=8
Very long (> 512 tokens)	n_layer=4-6, n_head=8-16

Categorical Embedding Size#

Rule of thumb: embedding_dim ≈ min(50, vocabulary_size // 2)

# For categorical variable with 100 unique values:
categorical_embedding_dim = min(50, 100 // 2) = 50

# For categorical variable with 10 unique values:
categorical_embedding_dim = min(50, 10 // 2) = 5

Summary#

torchTextClassifiers provides a component-based pipeline for text classification:

Tokenizer → Converts text to tokens
Text Embedder → Creates semantic embeddings (with optional attention)
Categorical Handler → Processes additional features (optional)
Classification Head → Produces predictions

Key Benefits:

Clear data flow through intuitive components
Mix and match for your specific needs
Start simple, add complexity as needed
Full PyTorch compatibility

Next Steps#

Tutorials: See Tutorials for step-by-step guides
API Reference: Check API Reference for detailed documentation
Examples: Explore complete examples in the repository

Ready to build your classifier? Start with Quick Start!