NTokenizers Documentation
Welcome to the documentation for the NTokenizers library. This library provides a collection of stream-capable tokenizers for XML, JSON, Markup, TypeScript, C# and SQL processing.
Kickoff token processing
// kickoff markup tokenizer
await MarkupTokenizer.Create().ParseAsync(stream, onToken: async token => { /* handle markup-tokens here */ });
// kickoff csharp tokenizer
await CSharpTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle csharp-tokens here */ });
// kickoff json tokenizer
await JsonTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle json-tokens here */ });
// kickoff sql tokenizer
await SqlTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle sql-tokens here */ });
// kickoff typescript tokenizer
await TypescriptTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle typescript-tokens here */ });
// kickoff xml tokenizer
await XmlTokenizer.Create().ParseAsync(stream, onToken: token => { /* handle xml-tokens here */ };
NTokenizers.Extensions.Spectre.Console
Heads up: Want to see your tokenized text with syntax-style highlighting in the console? Check out our companion project NTokenizers.Extensions.Spectre.Console that brings your text streams to life with rich, color-aware output with the help of this library.
Overview
NTokenizers is a .NET library written in C# that provides tokenizers for processing structured text formats like Markup, JSON, XML, SQL, Typescript and CSharp. The Tokenize method is the core functionality that breaks down structured text into meaningful components (tokens) for processing. Its key feature is stream processing capability - it can handle data as it arrives in real-time, making it ideal for processing large files or streaming data without loading everything into memory at once.
Warning
These tokenizers are not validation-based and are primarily intended for prettifying, formatting, or visualizing structured text. They do not perform strict validation of the input format, so they may produce unexpected results when processing malformed or invalid XML, JSON, or HTML. Use them with caution when dealing with untrusted or poorly formatted input.
Markup Example
Here’s a simple example showing how to use the MarkupTokenizer with a stream containing some markup text and json inline code blocks:
await MarkupTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
if (token.Metadata is HeadingMetadata headingMetadata)
{
await headingMetadata.RegisterInlineTokenHandler( inlineToken =>
{
var value = Markup.Escape(inlineToken.Value);
var colored = headingMetadata.Level != 1 ?
new Markup($"[bold GreenYellow]{value}[/]") :
new Markup($"[bold yellow]** {value} **[/]");
AnsiConsole.Write(colored);
});
Debug.WriteLine("Written Heading inlines");
}
else if (token.Metadata is JsonCodeBlockMetadata jsonMetadata)
{
Console.WriteLine($"code: {jsonMetadata.Language}");
await jsonMetadata.RegisterInlineTokenHandler( inlineToken =>
{
var value = Markup.Escape(inlineToken.Value);
var colored = inlineToken.TokenType switch
{
JsonTokenType.StartObject => new Markup($"[yellow]{value}[/]"),
JsonTokenType.EndObject => new Markup($"[yellow]{value}[/]"),
JsonTokenType.StartArray => new Markup($"[yellow]{value}[/]"),
JsonTokenType.EndArray => new Markup($"[yellow]{value}[/]"),
JsonTokenType.PropertyName => new Markup($"[cyan]{value}[/]"),
JsonTokenType.StringValue => new Markup($"[green]{value}[/]"),
JsonTokenType.Number => new Markup($"[magenta]{value}[/]"),
JsonTokenType.True => new Markup($"[orange1]{value}[/]"),
JsonTokenType.False => new Markup($"[orange1]{value}[/]"),
JsonTokenType.Null => new Markup($"[grey]{value}[/]"),
JsonTokenType.Colon => new Markup($"[yellow]{value}[/]"),
JsonTokenType.Comma => new Markup($"[yellow]{value}[/]"),
JsonTokenType.Whitespace => new Markup($"[grey]{value}[/]"),
_ => new Markup(value)
};
AnsiConsole.Write(colored);
});
AnsiConsole.WriteLine();
}
else
{
// Handle regular markup tokens
var value = Markup.Escape(token.Value);
var colored = token.TokenType switch
{
MarkupTokenType.Text => new Markup($"{value}"),
MarkupTokenType.Bold => new Markup($"[bold]{value}[/]"),
MarkupTokenType.Italic => new Markup($"[italic]{value}[/]"),
_ => new Markup(value)
};
AnsiConsole.Write(colored);
}
});
This gives the following output:

String output
var result = await MarkupTokenizer.Create().ParseAsync(stream, onToken: async token => { /* handle tokens here */ }
In addition to streaming tokens, the original input is returned for convenience.
Code specific Tokenizers
The Code specific tokenizers are also available see:
| language | page |
| C# | CSharp Tokenizer |
| Json | Json Tokenizer |
| Sql | Sql Tokenizer |
| typescript/javascript | TypeScript Tokenizer |
| xml | Xml Tokenizer |
Features
- Stream Processing: Can handle large files or real-time data streams without loading everything into memory
- Real-time Parsing: Processes tokens as they are encountered
- Flexible Input: Supports various input sources including streams, readers, and strings
- Rich Token Information: Provides detailed token type information for precise handling
Especially suitable for parsing AI chat streams, NTokenizers excels at processing real-time tokenized data from AI models, enabling efficient handling of streaming responses and chat conversations without buffering entire responses.