NTokenizers
Lightweight Stream Tokenizers for syntax highlighting and formatting. Perfect building block for render logic and handling AI responses.
NTokenizers sits in the middle of a tokenization pipeline — it takes raw source code or markup as a stream and emits a sequence of typed tokens that a downstream renderer can consume:
┌────────┐ stream ┌──────────────┐ tokens ┌────────────┐
│ source │ ────────────► │ NTokenizers │ ──────────► │ Renderer │ ──► styled output
└────────┘ └──────────────┘ └────────────┘
This separation of concerns means NTokenizers stays format-focused while rendering is delegated to the consumer — whether that is a console UI, a web component, or a custom formatter. The stream-first design ensures low memory usage and real-time compatibility with AI chat outputs, CI logs, or any scenario where data arrives incrementally.
Used by
- NTokenizers.Extensions.Spectre.Console Spectre.Console rendering extensions for NTokenizers, Style-rich console syntax highlighting.
Supported Formats
NTokenizers provides a collection of stream-capable tokenizers for processing structured text. Each tokenizer breaks down input into meaningful tokens as data arrives in real-time—ideal for large files or streaming data without loading everything into memory.
The library supports the following formats:
- Markup languages: Markdown, HTML
- Data formats: JSON, YAML, TOML, XML
- Programming languages: C#, C, C++, Go, Java, Kotlin, Python, Rust, SQL, Swift, TypeScript, CSS
The MarkdownTokenizer acts as a composite tokenizer, delegating code blocks to the appropriate sub-tokenizer based on the language tag. This allows seamless parsing of documents that mix multiple formats in a single pass.
Quick Start
Initialize any tokenizer and start parsing a stream:
// Use any tokenizer — replace [Language] with the target format
await [Language]Tokenizer.Create().ParseAsync(stream, onToken: async token =>
{
// Handle tokens as they arrive
});
Example with the JSON tokenizer:
await JsonTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
var value = Markup.Escape(token.Value);
var colored = token.TokenType switch
{
JsonTokenType.PropertyName => new Markup($"[cyan]{value}[/]"),
JsonTokenType.StringValue => new Markup($"[green]{value}[/]"),
JsonTokenType.Number => new Markup($"[magenta]{value}[/]"),
_ => new Markup(value)
};
AnsiConsole.Write(colored);
});
The MarkdownTokenizer delegates code blocks to sub-tokenizers automatically:
await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
if (token.Metadata is ICodeBlockMetadata codeBlock)
{
await codeBlock.RegisterInlineTokenHandler(inlineToken =>
{
// Handle code block tokens with syntax highlighting
});
}
// Handle regular markdown tokens
});
Overview
NTokenizers is a .NET library written in C# that provides tokenizers for processing structured text formats like Markdown, JSON, XML, HTML, YAML, TOML, SQL, TypeScript, CSS, C#, C, C++, Go, Java, Kotlin, Rust, Swift and Python. The Tokenize method is the core functionality that breaks down structured text into meaningful components (tokens) for processing. Its key feature is stream processing capability — it can handle data as it arrives in real-time, making it ideal for processing large files or streaming data without loading everything into memory at once.
Warning
These tokenizers are not validation-based and are primarily intended for prettifying, formatting, or visualizing structured text. They do not perform strict validation of the input format, so they may produce unexpected results when processing malformed or invalid XML, JSON, or HTML. Use them with caution when dealing with untrusted or poorly formatted input.
String output
var result = await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token => { /* handle tokens here */ });
In addition to streaming tokens, the original input is returned for convenience.
Code specific Tokenizers
Individual tokenizers are available for each supported format:
| Language | Page |
|---|---|
| Markdown | Markdown Tokenizer |
| HTML | HTML Tokenizer |
| JSON | JSON Tokenizer |
| YAML | YAML Tokenizer |
| TOML | TOML Tokenizer |
| XML | XML Tokenizer |
| C# | CSharp Tokenizer |
| C | C Tokenizer |
| C++ | C++ Tokenizer |
| Go | Go Tokenizer |
| Java | Java Tokenizer |
| Kotlin | Kotlin Tokenizer |
| Python | Python Tokenizer |
| Rust | Rust Tokenizer |
| SQL | SQL Tokenizer |
| Swift | Swift Tokenizer |
| TypeScript | TypeScript Tokenizer |
| CSS | CSS Tokenizer |
Features
- Stream Processing: Can handle large files or real-time data streams without loading everything into memory
- Real-time Parsing: Processes tokens as they are encountered
- Flexible Input: Supports various input sources including streams, readers, and strings
- Rich Token Information: Provides detailed token type information for precise handling
Especially suitable for parsing AI chat streams, NTokenizers excels at processing real-time tokenized data from AI models, enabling efficient handling of streaming responses and chat conversations without buffering entire responses.