Markup Tokenizer

The Markup tokenizer is designed to parse Markdown/Markup code and break it down into meaningful components (tokens) for processing. It provides stream-capable functionality for handling large Markup files or real-time markup processing.

Overview

The Markup tokenizer is part of the NTokenizers library and provides a stream-capable approach to parsing Markdown/Markup code. It can process Markup source code in real-time, making it suitable for large files or streaming scenarios where loading everything into memory at once is impractical.

Especially suitable for parsing AI chat streams, the Markup tokenizer excels at processing real-time tokenized data from AI models, enabling efficient handling of streaming responses and chat conversations without buffering entire responses.

Warning

The MarkupTokenizer makes heavy use of inline tokenizers for features like code fences, links, tables, emojis, footnotes, and more.

To get the full functionality, you need to handle each token’s Metadata and process any inline tokens it contains. If you skip handling these metadata types, some characters may be eaten or disappear, because the inline tokenizers strip or transform markup symbols during parsing.

Public API

The Markup tokenizer inherits from BaseTokenizer<MarkupToken> and provides the following key methods:

Inline tokenizers

The MarkupTokenizer produces tokens that carry metadata describing the type of content they represent. They also contain an inline token handler. Make sure to register to it:

await listMetadata.RegisterInlineTokenHandler(async inlineToken => { /* Handle inline tokens here */ })

Handling this metadata correctly is essential to render the markup accurately. Below is a breakdown of the different metadata types, separated into code block types and other markup types:

Code block metadata

Other markup metadata

Example: Handling Inline Tokens
// Main MarkupTokenizer
await MarkupTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
    // Handle inline tokens for list items, do this for all the metadata types you expect
    if (token.Metadata is ListItemMetadata listMetadata)
    {
        await listMetadata.RegisterInlineTokenHandler(async inlineToken =>
        {
            // Example: simply write the inline token value
            await ansiConsole.WriteAsync(inlineToken.Value);

        });
    }

    // You can handle other token types here...
});

Usage Examples

Basic Usage with Stream

using NTokenizers.Markup;
using NTokenizers.Markup.Metadata;
using Spectre.Console;
using System.Diagnostics;
using System.Text;

string markupCode = """
    # Heading
    
    This is **bold** and this is *italic*.
    
    - List item 1
    - List item 2
    """;

using var stream = new MemoryStream(Encoding.UTF8.GetBytes(markupCode));
await MarkupTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
    if (token.Metadata is ListItemMetadata listMetadata)
    {
        AnsiConsole.Write(new Markup($"[bold lime]{listMetadata.Marker} [/]"));
        await listMetadata.RegisterInlineTokenHandler(inlineToken =>
        {
            var value = Markup.Escape(inlineToken.Value);
            AnsiConsole.Write(new Markup($"[bold red]{value}[/]"));
        });
        AnsiConsole.WriteLine();
        Debug.WriteLine("Written listItem inlines");

    }
    else if (token.Metadata is HeadingMetadata headingMetadata)
    {
        await headingMetadata.RegisterInlineTokenHandler(inlineToken =>
        {
            var value = Markup.Escape(inlineToken.Value);
            var colored = headingMetadata.Level != 1 ?
                new Markup($"[bold GreenYellow]{value}[/]") :
                new Markup($"[bold yellow]** {value} **[/]");
            AnsiConsole.Write(colored);
        });
        Debug.WriteLine("Written Heading inlines");
    }
    else
    {
        var value = Markup.Escape(token.Value);
        var colored = token.TokenType switch
        {
            MarkupTokenType.Bold => new Markup($"[bold]{value}[/]"),
            MarkupTokenType.Italic => new Markup($"[italic]{value}[/]"),
            MarkupTokenType.Text => new Markup($"{value}"),
            _ => new Markup(value)
        };
        AnsiConsole.Write(colored);
    }
});

Advanced Usage with Inline Code Blocks

Here’s an example showing how to use the MarkupTokenizer with a stream containing markup text and JSON inline code blocks:

await MarkupTokenizer.Create().ParseAsync(reader, onToken: async token =>
{
    //Handle json code fence
    if (token.Metadata is JsonCodeBlockMetadata jsonMetadata)
    {
        await jsonMetadata.RegisterInlineTokenHandler( inlineToken =>
        {
            var value = Markup.Escape(inlineToken.Value);
            var colored = inlineToken.TokenType switch
            {
                JsonTokenType.StartObject => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.EndObject => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.StartArray => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.EndArray => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.PropertyName => new Markup($"[cyan]{value}[/]"),
                JsonTokenType.StringValue => new Markup($"[green]{value}[/]"),
                JsonTokenType.Number => new Markup($"[magenta]{value}[/]"),
                JsonTokenType.True => new Markup($"[orange1]{value}[/]"),
                JsonTokenType.False => new Markup($"[orange1]{value}[/]"),
                JsonTokenType.Null => new Markup($"[grey]{value}[/]"),
                JsonTokenType.Colon => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.Comma => new Markup($"[yellow]{value}[/]"),
                JsonTokenType.Whitespace => new Markup($"[grey]{value}[/]"),
                _ => new Markup(value)
            };
            AnsiConsole.Write(colored);
        });
    }
    else
    {
        // Handle regular markup tokens
        var value = Markup.Escape(token.Value);
        var colored = token.TokenType switch
        {
            MarkupTokenType.Text => new Markup($"{value}"),
            MarkupTokenType.Bold => new Markup($"[bold]{value}[/]"),
            MarkupTokenType.Italic => new Markup($"[italic]{value}[/]"),
            _ => new Markup(value)
        };

        AnsiConsole.Write(colored);
    }

    if (token.Metadata is InlineMarkupMetadata)
    {
        AnsiConsole.WriteLine();
    }
});

Using with TextReader

using NTokenizers.Markup;
using System.IO;

string markupCode = "This is **bold** text.";
using var reader = new StringReader(markupCode);
await MarkupTokenizer.Create().ParseAsync(reader, onToken: token =>
{
    Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
});

Parsing String Directly

using NTokenizers.Markup;

string markupCode = "# Title\n\nSome *text* here.";
var tokens = MarkupTokenizer.Create().Parse(markupCode);
foreach (var token in tokens)
{
    Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
}

Use Processed stream as string

using NTokenizers.Markup;
using System.Text;

string markupCode = "This is **bold** and *italic*.";
var processedString = await MarkupTokenizer.Create().ParseAsync(markupCode, token =>
{
    return token.TokenType switch
    {
        MarkupTokenType.Bold => $"<b>{token.Value}</b>",
        MarkupTokenType.Italic => $"<i>{token.Value}</i>",
        MarkupTokenType.Heading => $"<h1>{token.Value}</h1>",
        MarkupTokenType.Link => $"<a>{token.Value}</a>",
        _ => token.Value
    };
});
Console.WriteLine(processedString);

Token Types

The Markup tokenizer produces tokens of type MarkupTokenType with the following token types:

More info: MarkupTokenType.cs

See Also

"