NTokenizers

Lightweight Stream Tokenizers for syntax highlighting and formatting. Perfect building block for render logic and handling AI responses.

NTokenizers sits in the middle of a tokenization pipeline — it takes raw source code or markup as a stream and emits a sequence of typed tokens that a downstream renderer can consume:

 ┌────────┐    stream     ┌──────────────┐   tokens    ┌────────────┐
 │ source │ ────────────► │ NTokenizers  │ ──────────► │  Renderer  │ ──► styled output
 └────────┘               └──────────────┘             └────────────┘

This separation of concerns means NTokenizers stays format-focused while rendering is delegated to the consumer — whether that is a console UI, a web component, or a custom formatter. The stream-first design ensures low memory usage and real-time compatibility with AI chat outputs, CI logs, or any scenario where data arrives incrementally.

Used by

Supported Formats

NTokenizers provides a collection of stream-capable tokenizers for processing structured text. Each tokenizer breaks down input into meaningful tokens as data arrives in real-time—ideal for large files or streaming data without loading everything into memory.

The library supports the following formats:

The MarkdownTokenizer acts as a composite tokenizer, delegating code blocks to the appropriate sub-tokenizer based on the language tag. This allows seamless parsing of documents that mix multiple formats in a single pass.

Quick Start

Initialize any tokenizer and start parsing a stream:

// Use any tokenizer — replace [Language] with the target format
await [Language]Tokenizer.Create().ParseAsync(stream, onToken: async token =>
{
    // Handle tokens as they arrive
});

Example with the JSON tokenizer:

await JsonTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
    var value = Markup.Escape(token.Value);
    var colored = token.TokenType switch
    {
        JsonTokenType.PropertyName => new Markup($"[cyan]{value}[/]"),
        JsonTokenType.StringValue => new Markup($"[green]{value}[/]"),
        JsonTokenType.Number => new Markup($"[magenta]{value}[/]"),
        _ => new Markup(value)
    };
    AnsiConsole.Write(colored);
});

The MarkdownTokenizer delegates code blocks to sub-tokenizers automatically:

await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
    if (token.Metadata is ICodeBlockMetadata codeBlock)
    {
        await codeBlock.RegisterInlineTokenHandler(inlineToken =>
        {
            // Handle code block tokens with syntax highlighting
        });
    }
    // Handle regular markdown tokens
});

Overview

NTokenizers is a .NET library written in C# that provides tokenizers for processing structured text formats like Markdown, JSON, XML, HTML, YAML, TOML, SQL, TypeScript, CSS, C#, C, C++, Go, Java, Kotlin, Rust, Swift and Python. The Tokenize method is the core functionality that breaks down structured text into meaningful components (tokens) for processing. Its key feature is stream processing capability — it can handle data as it arrives in real-time, making it ideal for processing large files or streaming data without loading everything into memory at once.

Warning

These tokenizers are not validation-based and are primarily intended for prettifying, formatting, or visualizing structured text. They do not perform strict validation of the input format, so they may produce unexpected results when processing malformed or invalid XML, JSON, or HTML. Use them with caution when dealing with untrusted or poorly formatted input.

String output

var result = await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token => { /* handle tokens here */ });

In addition to streaming tokens, the original input is returned for convenience.

Code specific Tokenizers

Individual tokenizers are available for each supported format:

Language Page
Markdown Markdown Tokenizer
HTML HTML Tokenizer
JSON JSON Tokenizer
YAML YAML Tokenizer
TOML TOML Tokenizer
XML XML Tokenizer
C# CSharp Tokenizer
C C Tokenizer
C++ C++ Tokenizer
Go Go Tokenizer
Java Java Tokenizer
Kotlin Kotlin Tokenizer
Python Python Tokenizer
Rust Rust Tokenizer
SQL SQL Tokenizer
Swift Swift Tokenizer
TypeScript TypeScript Tokenizer
CSS CSS Tokenizer

Features

Especially suitable for parsing AI chat streams, NTokenizers excels at processing real-time tokenized data from AI models, enabling efficient handling of streaming responses and chat conversations without buffering entire responses.

"