Tokenizers#
Text tokenization classes for converting text to numerical tokens.
Base Classes#
BaseTokenizer#
Abstract base class for all tokenizers.
- class torchTextClassifiers.tokenizers.base.BaseTokenizer(vocab_size, padding_idx, output_vectorized=False, output_dim=None)[source]#
Bases:
ABC- __init__(vocab_size, padding_idx, output_vectorized=False, output_dim=None)[source]#
Base class for tokenizers. :type vocab_size:
int:param vocab_size: Size of the vocabulary. :type vocab_size: int :type output_vectorized:bool:param output_vectorized: Whether the tokenizer outputs vectorized tokens.True for instance for a TF-IDF tokenizer.
TokenizerOutput#
Output dataclass from tokenization.
- class torchTextClassifiers.tokenizers.base.TokenizerOutput(input_ids, attention_mask, offset_mapping=None, word_ids=None)[source]#
Bases:
objectAttributes
- input_ids: torch.Tensor#
Token indices (batch_size, seq_len).
- attention_mask: torch.Tensor#
Attention mask tensor (batch_size, seq_len).
- offset_mapping: List[List[Tuple[int, int]]] | None#
Byte offsets for each token (optional, for explainability).
- __init__(input_ids, attention_mask, offset_mapping=None, word_ids=None)#
Concrete Tokenizers#
NGramTokenizer#
FastText-style character n-gram tokenizer.
- class torchTextClassifiers.tokenizers.ngram.NGramTokenizer(min_count, min_n, max_n, num_tokens, len_word_ngrams, training_text=None, preprocess=True, output_dim=None, **kwargs)[source]#
Bases:
BaseTokenizerHeavily optimized FastText N-gram tokenizer with: - Pre-computed subword cache for entire vocabulary - Vectorized batch encoding - Cached text normalization - Direct tensor operations - Optional offset mapping and word ID tracking
Features:
Character n-gram generation (customizable min/max n)
Subword caching for performance
Text cleaning and normalization (FastText style)
Hash-based tokenization
Support for special tokens, padding, truncation
- PAD_TOKEN = '[PAD]'#
- UNK_TOKEN = '[UNK]'#
- EOS_TOKEN = '</s>'#
- __init__(min_count, min_n, max_n, num_tokens, len_word_ngrams, training_text=None, preprocess=True, output_dim=None, **kwargs)[source]#
Base class for tokenizers. :param vocab_size: Size of the vocabulary. :type vocab_size: int :param output_vectorized: Whether the tokenizer outputs vectorized tokens.
True for instance for a TF-IDF tokenizer.
- tokenize(text, return_offsets_mapping=False, return_word_ids=False, **kwargs)[source]#
Optimized tokenization with vectorized operations.
- Parameters:
text (
Union[str,List[str]]) – Single string or list of strings to tokenizepadding – Padding strategy (‘longest’ or ‘max_length’)
max_length – Maximum sequence length
truncation – Whether to truncate sequences exceeding max_length
return_offsets_mapping (
bool) – If True, return character offsets for each tokenreturn_word_ids (
bool) – If True, return word indices for each token
- Return type:
- Returns:
TokenizerOutput with input_ids, attention_mask, and optionally offset_mapping and word_ids
Example:
from torchTextClassifiers.tokenizers import NGramTokenizer
# Create tokenizer
tokenizer = NGramTokenizer(
vocab_size=10000,
min_n=3, # Minimum n-gram size
max_n=6, # Maximum n-gram size
output_dim=128
)
# Train on corpus
tokenizer.train(training_texts)
# Tokenize
output = tokenizer(["Hello world!", "Text classification"])
WordPieceTokenizer#
WordPiece subword tokenization.
- class torchTextClassifiers.tokenizers.WordPiece.WordPieceTokenizer(vocab_size, trained=False, output_dim=None)[source]#
Bases:
HuggingFaceTokenizerFeatures:
Subword tokenization strategy
Vocabulary learning from corpus
Handles unknown words gracefully
Efficient encoding/decoding
- __init__(vocab_size, trained=False, output_dim=None)[source]#
Largely inspired by https://huggingface.co/learn/llm-course/chapter6/8
Example:
from torchTextClassifiers.tokenizers import WordPieceTokenizer
# Create tokenizer
tokenizer = WordPieceTokenizer(
vocab_size=5000,
output_dim=128
)
# Train on corpus
tokenizer.train(training_texts)
# Tokenize
output = tokenizer(["Hello world!", "Text classification"])
HuggingFaceTokenizer#
Wrapper for HuggingFace tokenizers.
- class torchTextClassifiers.tokenizers.base.HuggingFaceTokenizer(vocab_size, output_dim=None, padding_idx=None, trained=False)[source]#
Bases:
BaseTokenizerFeatures:
Access to HuggingFace pre-trained tokenizers
Compatible with transformer models
Support for special tokens
- __init__(vocab_size, output_dim=None, padding_idx=None, trained=False)[source]#
Base class for tokenizers. :type vocab_size:
int:param vocab_size: Size of the vocabulary. :type vocab_size: int :param output_vectorized: Whether the tokenizer outputs vectorized tokens.True for instance for a TF-IDF tokenizer.
Example:
from torchTextClassifiers.tokenizers import HuggingFaceTokenizer
from transformers import AutoTokenizer
# Load pre-trained tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Wrap in our interface
tokenizer = HuggingFaceTokenizer(
tokenizer=hf_tokenizer,
output_dim=128
)
# Tokenize
output = tokenizer(["Hello world!", "Text classification"])
Choosing a Tokenizer#
NGramTokenizer (FastText-style)
Use when:
You want character-level features
Your text has many misspellings or variations
You need fast training
You have limited vocabulary
WordPieceTokenizer
Use when:
You want subword-level features
Your vocabulary is large but manageable
You need good coverage with reasonable vocab size
You’re doing standard text classification
HuggingFaceTokenizer
Use when:
You want to use pre-trained tokenizers
You’re working with transformer models
You need specific language support
You want to fine-tune on top of BERT/RoBERTa/etc.
Tokenizer Comparison#
Feature |
NGramTokenizer |
WordPieceTokenizer |
HuggingFaceTokenizer |
|---|---|---|---|
Granularity |
Character n-grams |
Subwords |
Subwords/Words |
Training Speed |
Fast |
Medium |
Pre-trained |
Vocab Size |
Configurable |
Configurable |
Pre-defined |
OOV Handling |
Excellent (char-level) |
Good (subwords) |
Good (subwords) |
Memory |
Efficient |
Medium |
Larger |
See Also#
torchTextClassifiers Wrapper - Using tokenizers with the wrapper
Dataset - How tokenizers are used in datasets
Binary Classification Tutorial - Tokenizer tutorial