Tokenizers#

Text tokenization classes for converting text to numerical tokens.

Base Classes#

BaseTokenizer#

Abstract base class for all tokenizers.

class torchTextClassifiers.tokenizers.base.BaseTokenizer(vocab_size, padding_idx, output_vectorized=False, output_dim=None)[source]#

Bases: ABC

__init__(vocab_size, padding_idx, output_vectorized=False, output_dim=None)[source]#

Base class for tokenizers. :type vocab_size: int :param vocab_size: Size of the vocabulary. :type vocab_size: int :type output_vectorized: bool :param output_vectorized: Whether the tokenizer outputs vectorized tokens.

True for instance for a TF-IDF tokenizer.

abstractmethod tokenize(text)[source]#

Tokenizes the raw input text into a list of tokens.

Return type:: TokenizerOutput

TokenizerOutput#

Output dataclass from tokenization.

class torchTextClassifiers.tokenizers.base.TokenizerOutput(input_ids, attention_mask, offset_mapping=None, word_ids=None)[source]#

Bases: object

Attributes

input_ids: torch.Tensor#: Token indices (batch_size, seq_len).

attention_mask: torch.Tensor#: Attention mask tensor (batch_size, seq_len).

offset_mapping: List[List[Tuple[int, int]]] | None#: Byte offsets for each token (optional, for explainability).

word_ids: List[List[int | None]] | None#: Word-level indices for each token (optional).

input_ids: Tensor#

attention_mask: Tensor#

offset_mapping: Optional[Tensor] = None#

word_ids: Optional[ndarray] = None#

to_dict()[source]#

Return type:: Dict[str, Any]

classmethod from_dict(data)[source]#

Return type:: TokenizerOutput

__init__(input_ids, attention_mask, offset_mapping=None, word_ids=None)#

Concrete Tokenizers#

NGramTokenizer#

FastText-style character n-gram tokenizer.

class torchTextClassifiers.tokenizers.ngram.NGramTokenizer(min_count, min_n, max_n, num_tokens, len_word_ngrams, training_text=None, preprocess=True, output_dim=None, **kwargs)[source]#

Bases: BaseTokenizer

Heavily optimized FastText N-gram tokenizer with: - Pre-computed subword cache for entire vocabulary - Vectorized batch encoding - Cached text normalization - Direct tensor operations - Optional offset mapping and word ID tracking

Features:

Character n-gram generation (customizable min/max n)
Subword caching for performance
Text cleaning and normalization (FastText style)
Hash-based tokenization
Support for special tokens, padding, truncation

PAD_TOKEN = '[PAD]'#

UNK_TOKEN = '[UNK]'#

EOS_TOKEN = '</s>'#

__init__(min_count, min_n, max_n, num_tokens, len_word_ngrams, training_text=None, preprocess=True, output_dim=None, **kwargs)[source]#

Base class for tokenizers. :param vocab_size: Size of the vocabulary. :type vocab_size: int :param output_vectorized: Whether the tokenizer outputs vectorized tokens.

True for instance for a TF-IDF tokenizer.

train(training_text)[source]#

Build vocabulary from training text.

tokenize(text, return_offsets_mapping=False, return_word_ids=False, **kwargs)[source]#

Optimized tokenization with vectorized operations.

Parameters:

text (Union[str, List[str]]) – Single string or list of strings to tokenize
padding – Padding strategy (‘longest’ or ‘max_length’)
max_length – Maximum sequence length
truncation – Whether to truncate sequences exceeding max_length
return_offsets_mapping (bool) – If True, return character offsets for each token
return_word_ids (bool) – If True, return word indices for each token

Return type:

TokenizerOutput

Returns:

TokenizerOutput with input_ids, attention_mask, and optionally offset_mapping and word_ids

decode(token_ids, skip_special_tokens=True)[source]#

Decode token IDs back to text.

Return type:: str

batch_decode(sequences, skip_special_tokens=True)[source]#

Decode multiple sequences.

Return type:: List[str]

save_pretrained(save_directory)[source]#

Save tokenizer configuration and vocabulary.

classmethod from_pretrained(directory)[source]#

Load tokenizer from saved configuration.

Example:

from torchTextClassifiers.tokenizers import NGramTokenizer

# Create tokenizer
tokenizer = NGramTokenizer(
    vocab_size=10000,
    min_n=3,  # Minimum n-gram size
    max_n=6,  # Maximum n-gram size
    output_dim=128
)

# Train on corpus
tokenizer.train(training_texts)

# Tokenize
output = tokenizer(["Hello world!", "Text classification"])

WordPieceTokenizer#

WordPiece subword tokenization.

class torchTextClassifiers.tokenizers.WordPiece.WordPieceTokenizer(vocab_size, trained=False, output_dim=None)[source]#

Bases: HuggingFaceTokenizer

Features:

Subword tokenization strategy
Vocabulary learning from corpus
Handles unknown words gracefully
Efficient encoding/decoding

__init__(vocab_size, trained=False, output_dim=None)[source]#

Largely inspired by https://huggingface.co/learn/llm-course/chapter6/8

train(training_corpus, save_path=None, filesystem=None, s3_save_path=None)[source]#

Example:

from torchTextClassifiers.tokenizers import WordPieceTokenizer

# Create tokenizer
tokenizer = WordPieceTokenizer(
    vocab_size=5000,
    output_dim=128
)

# Train on corpus
tokenizer.train(training_texts)

# Tokenize
output = tokenizer(["Hello world!", "Text classification"])

HuggingFaceTokenizer#

Wrapper for HuggingFace tokenizers.

class torchTextClassifiers.tokenizers.base.HuggingFaceTokenizer(vocab_size, output_dim=None, padding_idx=None, trained=False)[source]#

Bases: BaseTokenizer

Features:

Access to HuggingFace pre-trained tokenizers
Compatible with transformer models
Support for special tokens

__init__(vocab_size, output_dim=None, padding_idx=None, trained=False)[source]#

Base class for tokenizers. :type vocab_size: int :param vocab_size: Size of the vocabulary. :type vocab_size: int :param output_vectorized: Whether the tokenizer outputs vectorized tokens.

True for instance for a TF-IDF tokenizer.

tokenize(text, return_offsets_mapping=False, return_word_ids=False)[source]#

Tokenizes the raw input text into a list of tokens.

Return type:: list

classmethod load_from_pretrained(tokenizer_name, output_dim=None)[source]#

classmethod load(load_path)[source]#

classmethod load_from_s3(s3_path, filesystem)[source]#

train(*args, **kwargs)[source]#

Example:

from torchTextClassifiers.tokenizers import HuggingFaceTokenizer
from transformers import AutoTokenizer

# Load pre-trained tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Wrap in our interface
tokenizer = HuggingFaceTokenizer(
    tokenizer=hf_tokenizer,
    output_dim=128
)

# Tokenize
output = tokenizer(["Hello world!", "Text classification"])

Choosing a Tokenizer#

NGramTokenizer (FastText-style)

Use when:

You want character-level features
Your text has many misspellings or variations
You need fast training
You have limited vocabulary

WordPieceTokenizer

Use when:

You want subword-level features
Your vocabulary is large but manageable
You need good coverage with reasonable vocab size
You’re doing standard text classification

HuggingFaceTokenizer

Use when:

You want to use pre-trained tokenizers
You’re working with transformer models
You need specific language support
You want to fine-tune on top of BERT/RoBERTa/etc.

Tokenizer Comparison#

Feature	NGramTokenizer	WordPieceTokenizer	HuggingFaceTokenizer
Granularity	Character n-grams	Subwords	Subwords/Words
Training Speed	Fast	Medium	Pre-trained
Vocab Size	Configurable	Configurable	Pre-defined
OOV Handling	Excellent (char-level)	Good (subwords)	Good (subwords)
Memory	Efficient	Medium	Larger

Tokenizers#

Base Classes#

BaseTokenizer#

TokenizerOutput#

Concrete Tokenizers#

NGramTokenizer#

WordPieceTokenizer#

HuggingFaceTokenizer#

Choosing a Tokenizer#

Tokenizer Comparison#

See Also#