torchTextClassifiers Wrapper#

The main wrapper class for text classification tasks.

Main Class#

class torchTextClassifiers.torchTextClassifiers.torchTextClassifiers(tokenizer, model_config, ragged_multilabel=False, value_encoder=None)[source]#

Bases: object

Generic text classifier framework supporting multiple architectures.

Given a tokenizer and model configuration, this class initializes: - Text embedding layer (if needed) - Categorical variable embedding network (if categorical variables are provided) - Classification head The resulting model can be trained using PyTorch Lightning and used for predictions.

Methods

train

Train the classifier using PyTorch Lightning.

predict

save

Save the complete torchTextClassifiers instance to disk.

load

Load a torchTextClassifiers instance from disk.

__init__(tokenizer, model_config, ragged_multilabel=False, value_encoder=None)[source]#

Initialize the torchTextClassifiers instance.

Parameters:
  • tokenizer (BaseTokenizer) – A tokenizer instance for text preprocessing

  • model_config (ModelConfig) – Configuration parameters for the text classification model

  • ragged_multilabel (bool) – Whether to use ragged multilabel classification

  • value_encoder (ValueEncoder | None) – Optional ValueEncoder for encoding raw string (or mixed) categorical values to integers. Build it beforehand from DictEncoder or sklearn LabelEncoder instances and pass it here. If None, categorical columns in X must already be integer-encoded.

Example

>>> from torchTextClassifiers import ModelConfig, TrainingConfig, torchTextClassifiers
>>> from torchTextClassifiers.value_encoder import ValueEncoder, DictEncoder
>>> # Build one DictEncoder per categorical feature
>>> encoders = {str(i): DictEncoder({v: j for j, v in enumerate(sorted(set(X_categorical[:, i])))})
...             for i in range(X_categorical.shape[1])}
>>> encoder = ValueEncoder(encoders)
>>> model_config = ModelConfig(
...     embedding_dim=10,
...     categorical_vocabulary_sizes=encoder.vocabulary_sizes,
...     categorical_embedding_dims=[10, 5],
...     num_classes=10,
... )
>>> ttc = torchTextClassifiers(
...     tokenizer=tokenizer,
...     model_config=model_config,
...     value_encoder=encoder,
... )
classmethod from_model(tokenizer, pytorch_model, value_encoder=None, ragged_multilabel=False)[source]#

Initialize torchTextClassifiers from a custom pre-built PyTorch model.

Use this when the standard TextClassificationModel (built automatically from ModelConfig) cannot express your architecture — for example when you need multiple classification heads, shared encoders across tasks, or any other custom topology. The wrapper then provides the usual predict / save / load interface around your model.

Required interface for pytorch_model:

  1. ``forward`` signature — the model must accept exactly these keyword arguments (extra **kwargs are forwarded but ignored by the wrapper):

    def forward(
        self,
        input_ids: torch.Tensor,        # (batch, seq_len)  Long
        attention_mask: torch.Tensor,   # (batch, seq_len)  int
        categorical_vars: torch.Tensor, # (batch, n_cats)   Long  — may be None
        **kwargs,
    ) -> torch.Tensor | list[torch.Tensor]:
        ...
    

    The return value must be raw logits (not softmaxed). For standard single-task classification return a tensor of shape (batch, num_classes). For multi-task classification you may return a list of such tensors, one per task.

  2. ``num_classes`` attribute — must be an int (single task) or a list[int] (multi-task, one entry per task head).

  3. ``categorical_variable_net`` attribute — the CategoricalVariableNet module used by the model, or None if no categorical features are used. The wrapper reads categorical_variable_net.categorical_vocabulary_sizes to set up the data pipeline.

See torchTextClassifiers.contrib for ready-made example architectures (MultiLevelTextClassificationModel, MultiLevelCrossEntropyLoss) that follow this interface.

Parameters:
  • tokenizer (BaseTokenizer) – A tokenizer instance for text preprocessing.

  • pytorch_model (Module) – A pre-built PyTorch model satisfying the interface above.

  • value_encoder (ValueEncoder | None) – Optional ValueEncoder for encoding raw string (or mixed) categorical values to integers. Build it from DictEncoder or sklearn LabelEncoder instances and pass it here. If None, categorical columns in X must already be integer-encoded.

  • ragged_multilabel (bool | None) – Set to True for ragged multi-label targets (variable number of labels per sample).

Returns:

An instance of torchTextClassifiers wrapping the provided model.

train(X_train, y_train, training_config, X_val=None, y_val=None, verbose=False)[source]#

Train the classifier using PyTorch Lightning.

This method handles the complete training process including: - Data validation and preprocessing - Dataset and DataLoader creation - PyTorch Lightning trainer setup with callbacks - Model training with early stopping - Best model loading after training

Note on Checkpoints:

After training, the best model checkpoint is automatically loaded. This checkpoint contains the full training state (model weights, optimizer, and scheduler state). Loading uses weights_only=False as the checkpoint is self-generated and trusted.

Parameters:
  • X_train (ndarray) – Training input data

  • y_train (ndarray) – Training labels

  • X_val (ndarray | None) – Validation input data

  • y_val (ndarray | None) – Validation labels

  • training_config (TrainingConfig) – Configuration parameters for training

  • verbose (bool) – Whether to print training progress information

Return type:

None

Example

>>> training_config = TrainingConfig(
...     lr=1e-3,
...     batch_size=4,
...     num_epochs=1,
... )
>>> ttc.train(
...     X_train=X,
...     y_train=Y,
...     X_val=X,
...     y_val=Y,
...     training_config=training_config,
... )
predict(X_test, raw_categorical_inputs=True, top_k=1, explain_with_label_attention=False, explain_with_captum=False)[source]#
Parameters:
  • X_test (np.ndarray) – input data to predict on, shape (N,d) where the first column is text and the rest are categorical variables

  • top_k (int) – for each sentence, return the top_k most likely predictions (default: 1)

  • explain_with_label_attention (bool) – if enabled, use attention matrix labels x tokens to have an explanation of the prediction (default: False)

  • explain_with_captum (bool) – launch gradient integration with Captum for explanation (default: False)

Returns: A dictionary containing the following fields:
  • predictions (torch.Tensor, shape (len(text), top_k)): A tensor containing the top_k most likely codes to the query.

  • confidence (torch.Tensor, shape (len(text), top_k)): A tensor array containing the corresponding confidence scores.

  • if explain is True:
    • attributions (torch.Tensor, shape (len(text), top_k, seq_len)): A tensor containing the attributions for each token in the text.

save(path)[source]#

Save the complete torchTextClassifiers instance to disk.

This saves: - Model configuration - Tokenizer state - PyTorch Lightning checkpoint (if trained) - All other instance attributes

Parameters:

path (str | Path) – Directory path where the model will be saved

Return type:

None

Example

>>> ttc = torchTextClassifiers(tokenizer, model_config)
>>> ttc.train(X_train, y_train, training_config)
>>> ttc.save("my_model")
classmethod load(path, device='auto')[source]#

Load a torchTextClassifiers instance from disk.

Parameters:
  • path (str | Path) – Directory path where the model was saved

  • device (str) – Device to load the model on (‘auto’, ‘cpu’, ‘cuda’, etc.)

Return type:

torchTextClassifiers

Returns:

Loaded torchTextClassifiers instance

Example

>>> loaded_ttc = torchTextClassifiers.load("my_model")
>>> predictions = loaded_ttc.predict(X_test)

Usage Example#

from torchTextClassifiers import torchTextClassifiers, ModelConfig, TrainingConfig
from torchTextClassifiers.tokenizers import WordPieceTokenizer

# Create tokenizer
tokenizer = WordPieceTokenizer()
tokenizer.train(texts, vocab_size=1000)

# Configure model
model_config = ModelConfig(embedding_dim=64, num_classes=2)
training_config = TrainingConfig(num_epochs=10, batch_size=16, lr=1e-3)

# Create and train classifier
classifier = torchTextClassifiers(tokenizer=tokenizer, model_config=model_config)
classifier.train(X_text=texts, y=labels, training_config=training_config)

# Make predictions
predictions = classifier.predict(new_texts)
probabilities = classifier.predict_proba(new_texts)

See Also#