torchTextClassifiers Wrapper#
The main wrapper class for text classification tasks.
Main Class#
- class torchTextClassifiers.torchTextClassifiers.torchTextClassifiers(tokenizer, model_config, ragged_multilabel=False)[source]#
Bases:
objectGeneric text classifier framework supporting multiple architectures.
Given a tokenizer and model configuration, this class initializes: - Text embedding layer (if needed) - Categorical variable embedding network (if categorical variables are provided) - Classification head The resulting model can be trained using PyTorch Lightning and used for predictions.
Methods
- __init__(tokenizer, model_config, ragged_multilabel=False)[source]#
Initialize the torchTextClassifiers instance.
- Parameters:
tokenizer (
BaseTokenizer) – A tokenizer instance for text preprocessingmodel_config (
ModelConfig) – Configuration parameters for the text classification model
Example
>>> from torchTextClassifiers import ModelConfig, TrainingConfig, torchTextClassifiers >>> # Assume tokenizer is a trained BaseTokenizer instance >>> model_config = ModelConfig( ... embedding_dim=10, ... categorical_vocabulary_sizes=[30, 25], ... categorical_embedding_dims=[10, 5], ... num_classes=10, ... ) >>> ttc = torchTextClassifiers( ... tokenizer=tokenizer, ... model_config=model_config, ... )
- train(X_train, y_train, training_config, X_val=None, y_val=None, verbose=False)[source]#
Train the classifier using PyTorch Lightning.
This method handles the complete training process including: - Data validation and preprocessing - Dataset and DataLoader creation - PyTorch Lightning trainer setup with callbacks - Model training with early stopping - Best model loading after training
- Parameters:
X_train (
ndarray) – Training input datay_train (
ndarray) – Training labelstraining_config (
TrainingConfig) – Configuration parameters for trainingverbose (
bool) – Whether to print training progress information
- Return type:
Example
>>> training_config = TrainingConfig( ... lr=1e-3, ... batch_size=4, ... num_epochs=1, ... ) >>> ttc.train( ... X_train=X, ... y_train=Y, ... X_val=X, ... y_val=Y, ... training_config=training_config, ... )
- predict(X_test, top_k=1, explain=False)[source]#
- Parameters:
X_test (np.ndarray) – input data to predict on, shape (N,d) where the first column is text and the rest are categorical variables
top_k (int) – for each sentence, return the top_k most likely predictions (default: 1)
explain (bool) – launch gradient integration to have an explanation of the prediction (default: False)
- Returns: A dictionary containing the following fields:
predictions (torch.Tensor, shape (len(text), top_k)): A tensor containing the top_k most likely codes to the query.
confidence (torch.Tensor, shape (len(text), top_k)): A tensor array containing the corresponding confidence scores.
- if explain is True:
attributions (torch.Tensor, shape (len(text), top_k, seq_len)): A tensor containing the attributions for each token in the text.
Usage Example#
from torchTextClassifiers import torchTextClassifiers, ModelConfig, TrainingConfig
from torchTextClassifiers.tokenizers import WordPieceTokenizer
# Create tokenizer
tokenizer = WordPieceTokenizer()
tokenizer.train(texts, vocab_size=1000)
# Configure model
model_config = ModelConfig(embedding_dim=64, num_classes=2)
training_config = TrainingConfig(num_epochs=10, batch_size=16, lr=1e-3)
# Create and train classifier
classifier = torchTextClassifiers(tokenizer=tokenizer, model_config=model_config)
classifier.train(X_text=texts, y=labels, training_config=training_config)
# Make predictions
predictions = classifier.predict(new_texts)
probabilities = classifier.predict_proba(new_texts)
See Also#
Configuration Classes - Configuration classes
Tokenizers - Tokenizer options
Core Models - Underlying PyTorch models