Tokenizers#

Text tokenization classes for converting text to numerical tokens.

Base Classes#

BaseTokenizer#

Abstract base class for all tokenizers.

class torchTextClassifiers.tokenizers.base.BaseTokenizer(vocab_size, padding_idx, output_vectorized=False, output_dim=None)[source]#

Bases: ABC

__init__(vocab_size, padding_idx, output_vectorized=False, output_dim=None)[source]#

Base class for tokenizers. :type vocab_size: int :param vocab_size: Size of the vocabulary. :type vocab_size: int :type output_vectorized: bool :param output_vectorized: Whether the tokenizer outputs vectorized tokens.

True for instance for a TF-IDF tokenizer.

abstract tokenize(text)[source]#

Tokenizes the raw input text into a list of tokens.

Return type:

TokenizerOutput

TokenizerOutput#

Output dataclass from tokenization.

class torchTextClassifiers.tokenizers.base.TokenizerOutput(input_ids, attention_mask, offset_mapping=None, word_ids=None)[source]#

Bases: object

Attributes

input_ids: torch.Tensor#

Token indices (batch_size, seq_len).

attention_mask: torch.Tensor#

Attention mask tensor (batch_size, seq_len).

offset_mapping: List[List[Tuple[int, int]]] | None#

Byte offsets for each token (optional, for explainability).

word_ids: List[List[int | None]] | None#

Word-level indices for each token (optional).

input_ids: Tensor#
attention_mask: Tensor#
offset_mapping: Optional[Tensor] = None#
word_ids: Optional[ndarray] = None#
to_dict()[source]#
Return type:

Dict[str, Any]

classmethod from_dict(data)[source]#
Return type:

TokenizerOutput

__init__(input_ids, attention_mask, offset_mapping=None, word_ids=None)#

Concrete Tokenizers#

NGramTokenizer#

FastText-style character n-gram tokenizer.

class torchTextClassifiers.tokenizers.ngram.NGramTokenizer(min_count, min_n, max_n, num_tokens, len_word_ngrams, training_text=None, preprocess=True, output_dim=None, **kwargs)[source]#

Bases: BaseTokenizer

Heavily optimized FastText N-gram tokenizer with: - Pre-computed subword cache for entire vocabulary - Vectorized batch encoding - Cached text normalization - Direct tensor operations - Optional offset mapping and word ID tracking

Features:

  • Character n-gram generation (customizable min/max n)

  • Subword caching for performance

  • Text cleaning and normalization (FastText style)

  • Hash-based tokenization

  • Support for special tokens, padding, truncation

PAD_TOKEN = '[PAD]'#
UNK_TOKEN = '[UNK]'#
EOS_TOKEN = '</s>'#
__init__(min_count, min_n, max_n, num_tokens, len_word_ngrams, training_text=None, preprocess=True, output_dim=None, **kwargs)[source]#

Base class for tokenizers. :param vocab_size: Size of the vocabulary. :type vocab_size: int :param output_vectorized: Whether the tokenizer outputs vectorized tokens.

True for instance for a TF-IDF tokenizer.

train(training_text)[source]#

Build vocabulary from training text.

tokenize(text, return_offsets_mapping=False, return_word_ids=False, **kwargs)[source]#

Optimized tokenization with vectorized operations.

Parameters:
  • text (Union[str, List[str]]) – Single string or list of strings to tokenize

  • padding – Padding strategy (‘longest’ or ‘max_length’)

  • max_length – Maximum sequence length

  • truncation – Whether to truncate sequences exceeding max_length

  • return_offsets_mapping (bool) – If True, return character offsets for each token

  • return_word_ids (bool) – If True, return word indices for each token

Return type:

TokenizerOutput

Returns:

TokenizerOutput with input_ids, attention_mask, and optionally offset_mapping and word_ids

decode(token_ids, skip_special_tokens=True)[source]#

Decode token IDs back to text.

Return type:

str

batch_decode(sequences, skip_special_tokens=True)[source]#

Decode multiple sequences.

Return type:

List[str]

save_pretrained(save_directory)[source]#

Save tokenizer configuration and vocabulary.

classmethod from_pretrained(directory)[source]#

Load tokenizer from saved configuration.

Example:

from torchTextClassifiers.tokenizers import NGramTokenizer

# Create tokenizer
tokenizer = NGramTokenizer(
    vocab_size=10000,
    min_n=3,  # Minimum n-gram size
    max_n=6,  # Maximum n-gram size
    output_dim=128
)

# Train on corpus
tokenizer.train(training_texts)

# Tokenize
output = tokenizer(["Hello world!", "Text classification"])

WordPieceTokenizer#

WordPiece subword tokenization.

class torchTextClassifiers.tokenizers.WordPiece.WordPieceTokenizer(vocab_size, trained=False, output_dim=None)[source]#

Bases: HuggingFaceTokenizer

Features:

  • Subword tokenization strategy

  • Vocabulary learning from corpus

  • Handles unknown words gracefully

  • Efficient encoding/decoding

__init__(vocab_size, trained=False, output_dim=None)[source]#

Largely inspired by https://huggingface.co/learn/llm-course/chapter6/8

train(training_corpus, save_path=None, filesystem=None, s3_save_path=None)[source]#

Example:

from torchTextClassifiers.tokenizers import WordPieceTokenizer

# Create tokenizer
tokenizer = WordPieceTokenizer(
    vocab_size=5000,
    output_dim=128
)

# Train on corpus
tokenizer.train(training_texts)

# Tokenize
output = tokenizer(["Hello world!", "Text classification"])

HuggingFaceTokenizer#

Wrapper for HuggingFace tokenizers.

class torchTextClassifiers.tokenizers.base.HuggingFaceTokenizer(vocab_size, output_dim=None, padding_idx=None, trained=False)[source]#

Bases: BaseTokenizer

Features:

  • Access to HuggingFace pre-trained tokenizers

  • Compatible with transformer models

  • Support for special tokens

__init__(vocab_size, output_dim=None, padding_idx=None, trained=False)[source]#

Base class for tokenizers. :type vocab_size: int :param vocab_size: Size of the vocabulary. :type vocab_size: int :param output_vectorized: Whether the tokenizer outputs vectorized tokens.

True for instance for a TF-IDF tokenizer.

tokenize(text, return_offsets_mapping=False, return_word_ids=False)[source]#

Tokenizes the raw input text into a list of tokens.

Return type:

list

classmethod load_from_pretrained(tokenizer_name, output_dim=None)[source]#
classmethod load(load_path)[source]#
classmethod load_from_s3(s3_path, filesystem)[source]#
train(*args, **kwargs)[source]#

Example:

from torchTextClassifiers.tokenizers import HuggingFaceTokenizer
from transformers import AutoTokenizer

# Load pre-trained tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Wrap in our interface
tokenizer = HuggingFaceTokenizer(
    tokenizer=hf_tokenizer,
    output_dim=128
)

# Tokenize
output = tokenizer(["Hello world!", "Text classification"])

Choosing a Tokenizer#

NGramTokenizer (FastText-style)

Use when:

  • You want character-level features

  • Your text has many misspellings or variations

  • You need fast training

  • You have limited vocabulary

WordPieceTokenizer

Use when:

  • You want subword-level features

  • Your vocabulary is large but manageable

  • You need good coverage with reasonable vocab size

  • You’re doing standard text classification

HuggingFaceTokenizer

Use when:

  • You want to use pre-trained tokenizers

  • You’re working with transformer models

  • You need specific language support

  • You want to fine-tune on top of BERT/RoBERTa/etc.

Tokenizer Comparison#

Feature

NGramTokenizer

WordPieceTokenizer

HuggingFaceTokenizer

Granularity

Character n-grams

Subwords

Subwords/Words

Training Speed

Fast

Medium

Pre-trained

Vocab Size

Configurable

Configurable

Pre-defined

OOV Handling

Excellent (char-level)

Good (subwords)

Good (subwords)

Memory

Efficient

Medium

Larger

See Also#