Dataset#

PyTorch Dataset classes for data loading.

TextClassificationDataset#

PyTorch Dataset for text classification with optional categorical features.

class torchTextClassifiers.dataset.dataset.TextClassificationDataset(texts, categorical_variables, tokenizer, labels=None, ragged_multilabel=False)[source]#

Bases: Dataset

Features:

  • Support for text data

  • Optional categorical variables

  • Optional labels (for inference)

  • Multilabel support with ragged arrays

  • Integration with tokenizers

__init__(texts, categorical_variables, tokenizer, labels=None, ragged_multilabel=False)[source]#
collate_fn(batch)[source]#
create_dataloader(batch_size, shuffle=False, drop_last=False, num_workers=3, pin_memory=False, persistent_workers=True, **kwargs)[source]#

Parameters#

class torchTextClassifiers.dataset.TextClassificationDataset(X_text, y, tokenizer, X_categorical=None)[source]#
Parameters:
  • X_text (Union[List[str], np.ndarray]) – Text samples (list or array of strings)

  • y (Optional[Union[List[int], np.ndarray]]) – Labels (optional for inference)

  • tokenizer (BaseTokenizer) – Tokenizer instance

  • X_categorical (Optional[np.ndarray]) – Categorical features (optional)

Example Usage#

Basic Text Dataset#

from torchTextClassifiers.dataset import TextClassificationDataset
from torchTextClassifiers.tokenizers import WordPieceTokenizer
import numpy as np

# Prepare data
texts = ["Text sample 1", "Text sample 2", "Text sample 3"]
labels = [0, 1, 0]

# Create tokenizer
tokenizer = WordPieceTokenizer()
tokenizer.train(texts, vocab_size=1000)

# Create dataset
dataset = TextClassificationDataset(
    X_text=texts,
    y=labels,
    tokenizer=tokenizer
)

# Use with DataLoader
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch in dataloader:
    input_ids, labels_batch = batch
    # Train model...

Mixed Features Dataset#

import numpy as np

# Text data
texts = ["Sample 1", "Sample 2", "Sample 3"]
labels = [0, 1, 2]

# Categorical data (3 samples, 2 categorical variables)
categorical = np.array([
    [5, 2],   # Sample 1: cat1=5, cat2=2
    [3, 1],   # Sample 2: cat1=3, cat2=1
    [7, 0],   # Sample 3: cat1=7, cat2=0
])

# Create dataset
dataset = TextClassificationDataset(
    X_text=texts,
    y=labels,
    tokenizer=tokenizer,
    X_categorical=categorical
)

# Batch returns: (input_ids, categorical_features, labels)
for batch in dataloader:
    input_ids, cat_features, labels_batch = batch
    # Train model with mixed features...

Inference Dataset#

For inference without labels:

# Create dataset without labels
inference_dataset = TextClassificationDataset(
    X_text=test_texts,
    y=None,  # No labels for inference
    tokenizer=tokenizer
)

# Batch returns only features (no labels)
for batch in dataloader:
    input_ids = batch
    # Make predictions...

Multilabel Dataset#

For multilabel classification:

# Multilabel targets (ragged arrays supported)
texts = ["Sample 1", "Sample 2", "Sample 3"]
labels = [
    [0, 1],     # Sample 1 has labels 0 and 1
    [2],        # Sample 2 has only label 2
    [0, 1, 2],  # Sample 3 has all three labels
]

# Create dataset
dataset = TextClassificationDataset(
    X_text=texts,
    y=labels,
    tokenizer=tokenizer
)

# Dataset handles ragged label arrays automatically

DataLoader Integration#

The dataset integrates seamlessly with PyTorch DataLoader:

from torch.utils.data import DataLoader

# Create dataset
dataset = TextClassificationDataset(X_text, y, tokenizer)

# Create dataloader
dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True  # For GPU training
)

# Iterate
for batch_idx, batch in enumerate(dataloader):
    # Process batch...
    pass

Batch Format#

The dataset returns different batch formats depending on configuration:

Text only:

input_ids = batch
# Shape: (batch_size, seq_len)

Text + labels:

input_ids, labels = batch
# input_ids shape: (batch_size, seq_len)
# labels shape: (batch_size,)

Text + categorical + labels:

input_ids, categorical_features, labels = batch
# input_ids shape: (batch_size, seq_len)
# categorical_features shape: (batch_size, num_categorical_vars)
# labels shape: (batch_size,)

Custom Collation#

For advanced use cases, you can provide a custom collate function:

def custom_collate_fn(batch):
    # Custom batching logic
    ...
    return custom_batch

dataloader = DataLoader(
    dataset,
    batch_size=32,
    collate_fn=custom_collate_fn
)

Memory Considerations#

For large datasets:

1. Use generators:

def text_generator():
    for text in large_text_file:
        yield text.strip()

X_text = list(text_generator())

2. Increase num_workers:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8  # Parallel data loading
)

3. Pin memory for GPU:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    pin_memory=True  # Faster GPU transfer
)

See Also#