NTokenizers Documentation

Welcome to the documentation for the NTokenizers library. This library provides a collection of stream-capable tokenizers for XML, JSON, Markdown, TypeScript, C# and SQL processing.

Kickoff token processing

// kickoff markdown tokenizer
await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token => { /* handle markdown-tokens here */ });

// kickoff csharp tokenizer
await CSharpTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle csharp-tokens here */ });

// kickoff json tokenizer
await JsonTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle json-tokens here */ });

// kickoff sql tokenizer
await SqlTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle sql-tokens here */ });

// kickoff typescript tokenizer
await TypescriptTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle typescript-tokens here */ });

// kickoff css tokenizer
await CssTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle css-tokens here */ });

// kickoff xml tokenizer
await XmlTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle xml-tokens here */ });

// kickoff yaml tokenizer
await YamlTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle yaml-tokens here */ });

NTokenizers.Extensions.Spectre.Console

Heads up: Want to see your tokenized text with syntax-style highlighting in the console? Check out our companion project NTokenizers.Extensions.Spectre.Console that brings your text streams to life with rich, color-aware output with the help of this library.

Overview

NTokenizers is a .NET library written in C# that provides tokenizers for processing structured text formats like Markdown, JSON, XML, SQL, Typescript and CSharp. The Tokenize method is the core functionality that breaks down structured text into meaningful components (tokens) for processing. Its key feature is stream processing capability - it can handle data as it arrives in real-time, making it ideal for processing large files or streaming data without loading everything into memory at once.

Warning

These tokenizers are not validation-based and are primarily intended for prettifying, formatting, or visualizing structured text. They do not perform strict validation of the input format, so they may produce unexpected results when processing malformed or invalid XML, JSON, or HTML. Use them with caution when dealing with untrusted or poorly formatted input.

Markdown Example

Here’s a simple example showing how to use the MarkdownTokenizer with a stream containing some markdown text and json inline code blocks:

await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
    if (token.Metadata is HeadingMetadata headingMetadata)
    {
        await headingMetadata.RegisterInlineTokenHandler( inlineToken =>
        {
            var value = Markup.Escape(inlineToken.Value);
            var colored = headingMetadata.Level != 1 ?
                new Markup($"[bold GreenYellow]{value}[/]") :
                new Markup($"[bold yellow]** {value} **[/]");
            AnsiConsole.Write(colored);
        });
        Debug.WriteLine("Written Heading inlines");
    }
    else if (token.Metadata is JsonCodeBlockMetadata jsonMetadata)
    {
        Console.WriteLine($"code: {jsonMetadata.Language}");
        await jsonMetadata.RegisterInlineTokenHandler( inlineToken =>
        {
            var value = Markup.Escape(inlineToken.Value);
            var colored = inlineToken.TokenType switch
            {
                JsonTokenType.StartObject => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.EndObject => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.StartArray => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.EndArray => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.PropertyName => new Markup($"[cyan]{value}[/]"),
                JsonTokenType.StringValue => new Markup($"[green]{value}[/]"),
                JsonTokenType.Number => new Markup($"[magenta]{value}[/]"),
                JsonTokenType.True => new Markup($"[orange1]{value}[/]"),
                JsonTokenType.False => new Markup($"[orange1]{value}[/]"),
                JsonTokenType.Null => new Markup($"[grey]{value}[/]"),
                JsonTokenType.Colon => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.Comma => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.Whitespace => new Markup($"[grey]{value}[/]"),
                _ => new Markup(value)
            };
            AnsiConsole.Write(colored);
        });
        AnsiConsole.WriteLine();
    }
    else
    {
        // Handle regular markdown tokens
        var value = Markup.Escape(token.Value);
        var colored = token.TokenType switch
        {
            MarkupTokenType.Text => new Markup($"{value}"),
            MarkupTokenType.Bold => new Markup($"[bold]{value}[/]"),
            MarkupTokenType.Italic => new Markup($"[italic]{value}[/]"),
            _ => new Markup(value)
        };

        AnsiConsole.Write(colored);
    }
});

This gives the following output:

markdownexample

String output

var result = await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token => { /* handle tokens here */ }

In addition to streaming tokens, the original input is returned for convenience.

Code specific Tokenizers

The Code specific tokenizers are also available see:

language	page
C#	CSharp Tokenizer
Json	Json Tokenizer
Sql	Sql Tokenizer
typescript/javascript	TypeScript Tokenizer
xml	Xml Tokenizer

Features

Stream Processing: Can handle large files or real-time data streams without loading everything into memory
Real-time Parsing: Processes tokens as they are encountered
Flexible Input: Supports various input sources including streams, readers, and strings
Rich Token Information: Provides detailed token type information for precise handling

Especially suitable for parsing AI chat streams, NTokenizers excels at processing real-time tokenized data from AI models, enabling efficient handling of streaming responses and chat conversations without buffering entire responses.