Markdown Tokenizer
The Markdown Tokenizer is designed to parse Markdown code and break it down into meaningful components (tokens) for processing. It provides stream-capable functionality for handling large Markdown files or real-time markdown processing.
Overview
The Markdown Tokenizer is part of the NTokenizers library and provides a stream-capable approach to parsing Markdown code. It can process Markdown source code in real-time, making it suitable for large files or streaming scenarios where loading everything into memory at once is impractical.
Especially suitable for parsing AI chat streams, the Markdown Tokenizer excels at processing real-time tokenized data from AI models, enabling efficient handling of streaming responses and chat conversations without buffering entire responses.
Warning
TheMarkdownTokenizermakes heavy use of inline tokenizers for features like code fences, links, tables, emojis, footnotes, and more.
To get the full functionality, you need to handle each token’sMetadataand process any inline tokens it contains. If you skip handling these metadata types, some characters may be eaten or disappear, because the inline tokenizers strip or transform markdown symbols during parsing.
Public API
The Markdown Tokenizer inherits from BaseTokenizer<MarkdownToken> and provides the following key methods:
ParseAsync(Stream stream, Action<MarkdownToken> onToken)- Asynchronously parses a stream of Markdown codeParse(Stream stream, Action<MarkdownToken> onToken)- Synchronously parses a stream of Markdown codeParse(string input)- Parses a string of Markdown code and returns a list of tokensParseAsync(TextReader reader, Action<MarkdownToken> onToken)- Asynchronously parses from a TextReader
Inline tokenizers
The MarkdownTokenizer produces tokens that carry metadata describing the type of content they represent. They also contain an inline token handler. Make sure to register to it:
await listMetadata.RegisterInlineTokenHandler(async inlineToken => { /* Handle inline tokens here */ })
Handling this metadata correctly is essential to render the markdown accurately. Below is a breakdown of the different metadata types, separated into code block types and other markdown types:
Code block metadata
The Markdown tokenizer delegates code blocks to language-specific sub-tokenizers. Each code block produces metadata that you can use to register an inline token handler:
if (token.Metadata is ICodeBlockMetadata codeBlock)
{
await codeBlock.RegisterInlineTokenHandler(inlineToken =>
{
// inlineToken.TokenType is the language-specific token type
// inlineToken.Value is the token content
});
}
Markup languages:
HtmlCodeBlockMetadata—html ...CssCodeBlockMetadata—css ...
Data formats:
JsonCodeBlockMetadata—json ...YamlCodeBlockMetadata—yaml ...TomlCodeBlockMetadata—toml ...XmlCodeBlockMetadata—xml ...(also used for XAML and SVG code blocks)
Programming languages:
CSharpCodeBlockMetadata—csharp ...CCodeBlockMetadata—c ...CppCodeBlockMetadata—cpp ...GoCodeBlockMetadata—go ...JavaCodeBlockMetadata—java ...KotlinCodeBlockMetadata—kotlin ...PythonCodeBlockMetadata—python ...RustCodeBlockMetadata—rust ...SqlCodeBlockMetadata—sql ...SwiftCodeBlockMetadata—swift ...TypeScriptCodeBlockMetadata—typescript ...
Fallback:
GenericCodeBlockMetadata— used for any unrecognized language tag
Other markdown metadata
HeadingMetadataBlockquoteMetadataListItemMetadataOrderedListItemMetadataTableMetadataLinkMetadataFootnoteMetadataEmojiMetadata
Architecture
Most tokenizers, such as json, xml, or etc…, can be used individually, depending on the specific format you want to parse.
The MarkdownTokenizer however is a special case. Instead of working on a single format, it acts as a composite tokenizer, using the other tokenizers as subtokenizers. When parsing a stream, MarkdownTokenizer delegates portions of the input to the appropriate subtokenizer, allowing it to handle multiple formats seamlessly in one pass.
The same principle applies to inline tokenizers such as Heading, Blockquote, ListItem, and others. However, they cannot be used individually and produce the same token types as the MarkdownTokenizer.
Diagram
┌─────────┐
│ stream │
└─────────┘
│ ParseAsync()
▼
┌─────────────────────┐
│ MarkdownTokenizer │ ───────────► fire markdown tokens
└─────────────────────┘
│
▼ ┌─────────┐
├──────►│ json │ ───► fire json tokens
│ └─────────┘
│
│ ┌─────────┐
├──────►│ Heading │ ───► fire markdown tokens
│ └─────────┘
│
│ ┌─────────┐
├──────►│ html │ ───► fire html tokens
│ └─────────┘
│ │
│ ▼ ┌─────────┐
│ ├──────►│ css │ ───► fire css tokens
│ │ └─────────┘
│ │
│ │ ┌─────────┐
│ └──────►│ script │ ───► fire typescript tokens
│ └─────────┘
│ ┌─────────┐
└──────►│ etc.. │ ───► etc
└─────────┘
Example: Handling Inline Tokens
// Main MarkdownTokenizer
await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
// Handle inline tokens for list items, do this for all the metadata types you expect
if (token.Metadata is ListItemMetadata listMetadata)
{
await listMetadata.RegisterInlineTokenHandler(async inlineToken =>
{
// Example: simply write the inline token value
await ansiConsole.WriteAsync(inlineToken.Value);
});
}
// You can handle other token types here...
});
Usage Examples
Basic Usage with Stream
using NTokenizers.Markdown;
using NTokenizers.Markdown.Metadata;
using Spectre.Console;
using System.Diagnostics;
using System.Text;
string markdownCode = """
# Heading
This is **bold** and this is *italic*.
- List item 1
- List item 2
""";
using var stream = new MemoryStream(Encoding.UTF8.GetBytes(markdownCode));
await MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token =>
{
if (token.Metadata is ListItemMetadata listMetadata)
{
AnsiConsole.Write(new Markup($"{token.Value}[bold lime]{listMetadata.Marker} [/]"));
await listMetadata.RegisterInlineTokenHandler(inlineToken =>
{
var value = Markup.Escape(inlineToken.Value);
AnsiConsole.Write(new Markup($"[bold red]{value}[/]"));
});
AnsiConsole.WriteLine();
Debug.WriteLine("Written listItem inlines");
}
else if (token.Metadata is HeadingMetadata headingMetadata)
{
await headingMetadata.RegisterInlineTokenHandler(inlineToken =>
{
var value = Markup.Escape(inlineToken.Value);
var colored = headingMetadata.Level != 1 ?
new Markup($"[bold GreenYellow]{value}[/]") :
new Markup($"[bold yellow]** {value} **[/]");
AnsiConsole.Write(colored);
});
Debug.WriteLine("Written Heading inlines");
}
else
{
var value = Markup.Escape(token.Value);
var colored = token.TokenType switch
{
MarkdownTokenType.Bold => new Markup($"[bold]{value}[/]"),
MarkdownTokenType.Italic => new Markup($"[italic]{value}[/]"),
MarkdownTokenType.Text => new Markup($"{value}"),
_ => new Markup(value)
};
AnsiConsole.Write(colored);
}
});
Advanced Usage with Inline Code Blocks
Here’s an example showing how to use the MarkdownTokenizer with a stream containing markdown text and JSON inline code blocks:
await MarkdownTokenizer.Create().ParseAsync(reader, onToken: async token =>
{
//Handle json code fence
if (token.Metadata is JsonCodeBlockMetadata jsonMetadata)
{
await jsonMetadata.RegisterInlineTokenHandler( inlineToken =>
{
var value = Markup.Escape(inlineToken.Value);
var colored = inlineToken.TokenType switch
{
JsonTokenType.StartObject => new Markup($"[yellow]{value}[/]"),
JsonTokenType.EndObject => new Markup($"[yellow]{value}[/]"),
JsonTokenType.StartArray => new Markup($"[yellow]{value}[/]"),
JsonTokenType.EndArray => new Markup($"[yellow]{value}[/]"),
JsonTokenType.PropertyName => new Markup($"[cyan]{value}[/]"),
JsonTokenType.StringValue => new Markup($"[green]{value}[/]"),
JsonTokenType.Number => new Markup($"[magenta]{value}[/]"),
JsonTokenType.True => new Markup($"[orange1]{value}[/]"),
JsonTokenType.False => new Markup($"[orange1]{value}[/]"),
JsonTokenType.Null => new Markup($"[grey]{value}[/]"),
JsonTokenType.Colon => new Markup($"[yellow]{value}[/]"),
JsonTokenType.Comma => new Markup($"[yellow]{value}[/]"),
JsonTokenType.Whitespace => new Markup($"[grey]{value}[/]"),
_ => new Markup(value)
};
AnsiConsole.Write(colored);
});
}
else
{
// Handle regular markdown tokens
var value = Markup.Escape(token.Value);
var colored = token.TokenType switch
{
MarkdownTokenType.Text => new Markup($"{value}"),
MarkdownTokenType.Bold => new Markup($"[bold]{value}[/]"),
MarkdownTokenType.Italic => new Markup($"[italic]{value}[/]"),
_ => new Markup(value)
};
AnsiConsole.Write(colored);
}
if (token.Metadata is InlineMarkdownMetadata)
{
AnsiConsole.WriteLine();
}
});
Using with TextReader
using NTokenizers.Markdown;
using System.IO;
string markdownCode = "This is **bold** text.";
using var reader = new StringReader(markdownCode);
await MarkdownTokenizer.Create().ParseAsync(reader, onToken: token =>
{
Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
});
Parsing String Directly
using NTokenizers.Markdown;
string markdownCode = "# Title\n\nSome *text* here.";
var tokens = MarkdownTokenizer.Create().Parse(markdownCode);
foreach (var token in tokens)
{
Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
}
Use Processed stream as string
using NTokenizers.Markdown;
using System.Text;
string markdownCode = "This is **bold** and *italic*.";
var processedString = await MarkdownTokenizer.Create().ParseAsync(markdownCode, token =>
{
return token.TokenType switch
{
MarkdownTokenType.Bold => $"<b>{token.Value}</b>",
MarkdownTokenType.Italic => $"<i>{token.Value}</i>",
MarkdownTokenType.Heading => $"<h1>{token.Value}</h1>",
MarkdownTokenType.Link => $"<a>{token.Value}</a>",
_ => token.Value
};
});
Console.WriteLine(processedString);
Token Types
The Markdown Tokenizer produces tokens of type MarkdownTokenType with the following token types:
Text- Represents plain text contentBold- Represents bold text (value contains text without**markers)Italic- Represents italic text (value contains text without*markers)Heading- Represents a heading (value contains text without#markers, level in Metadata)HorizontalRule- Represents a horizontal rule (---or***)TypographicReplacement- Represents a typographic replacement ((c),(r),(tm),+-)Emphasis- Represents a generic emphasis markerBlockquote- Represents a blockquote (value contains text without>marker)UnorderedListItem- Represents an unordered list item (value contains leading whitespace/indentation; inline content is tokenized separately)OrderedListItem- Represents an ordered list item (value contains leading whitespace/indentation; inline content is tokenized separately)CodeInline- Represents inline code (value contains code without`markers)CodeBlock- Represents a code block (value contains code without ``` markers, language in Metadata)Table- Represents a table (value is empty, structure in Metadata)TableRow- Represents a table row (value is empty, position in Metadata)TableCell- Represents a table cell (value contains cell content without|delimiters)TableAlignments- Represents table column alignments (value is empty, alignments in Metadata)Link- Represents a link (value contains link text without[ ]markers, URL in Metadata)Image- Represents an image (value contains alt text without![ ]markers, URL in Metadata)Emoji- Represents an emoji (value contains emoji name without:markers)Subscript- Represents subscript text (value contains content without^markers)Superscript- Represents superscript text (value contains content without~markers)InsertedText- Represents inserted text (value contains content without++markers)MarkedText- Represents marked text (value contains content without==markers)FootnoteReference- Represents a footnote reference (value contains reference ID without[^ ]markers)FootnoteDefinition- Represents a footnote definition (value contains definition content)DefinitionTerm- Represents a definition termDefinitionDescription- Represents a definition description (value contains description without:marker)Abbreviation- Represents an abbreviation (value contains definition)CustomContainer- Represents a custom container (value contains container type/name without:::markers)HtmlTag- Represents an HTML tag (value contains complete tag including< >markers)
More info: MarkdownTokenType.cs
Supported Code Block Languages
The Markdown Tokenizer supports code blocks for many languages. Code blocks using xml, xaml, or svg language identifiers all use the same XML tokenizer, since XAML and SVG are XML-based formats.