XML Tokenizer

The XML tokenizer is designed to parse XML code and break it down into meaningful components (tokens) for processing. It provides stream-capable functionality for handling large XML files or real-time XML data analysis.

Overview

The XML tokenizer is part of the NTokenizers library and provides a stream-capable approach to parsing XML code. It can process XML source code in real-time, making it suitable for large files or streaming scenarios where loading everything into memory at once is impractical.

Public API

The XML tokenizer inherits from BaseSubTokenizer<XmlToken> and provides the following key methods:

Usage Examples

Basic Usage with Stream

using NTokenizers.Xml;
using Spectre.Console;
using System.Text;

string xmlCode = """
    <user id="4821" active="true">
        <name>Laura Smith</name>
    </user>
    """;

using var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlCode));
await XmlTokenizer.Create().ParseAsync(stream, onToken: token =>
{
    var value = Markup.Escape(token.Value);
    var colored = token.TokenType switch
    {
        XmlTokenType.ElementName => new Markup($"[blue]{value}[/]"),
        XmlTokenType.EndElement => new Markup($"[blue]{value}[/]"),
        XmlTokenType.OpeningAngleBracket => new Markup($"[yellow]{value}[/]"),
        XmlTokenType.ClosingAngleBracket => new Markup($"[yellow]{value}[/]"),
        XmlTokenType.SelfClosingSlash => new Markup($"[yellow]{value}[/]"),
        XmlTokenType.AttributeName => new Markup($"[cyan]{value}[/]"),
        XmlTokenType.AttributeEquals => new Markup($"[yellow]{value}[/]"),
        XmlTokenType.AttributeQuote => new Markup($"[grey]{value}[/]"),
        XmlTokenType.AttributeValue => new Markup($"[green]{value}[/]"),
        XmlTokenType.Text => new Markup($"[white]{value}[/]"),
        XmlTokenType.Comment => new Markup($"[grey]{value}[/]"),
        XmlTokenType.Whitespace => new Markup($"[grey]{value}[/]"),
        _ => new Markup(value)
    };
    AnsiConsole.Write(colored);
});

Using with TextReader

using NTokenizers.Xml;
using System.IO;

string xmlCode = """
    <?xml version="1.0"?>
    <root><item id="1">Text</item></root>
    """;
using var reader = new StringReader(xmlCode);
await XmlTokenizer.Create().ParseAsync(reader, onToken: token =>
{
    Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
});

Parsing String Directly

using NTokenizers.Xml;

string xmlCode = "<note><to>User</to></note>";
var tokens = XmlTokenizer.Create().Parse(xmlCode);
foreach (var token in tokens)
{
    Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
}

Use Processed stream as string

using NTokenizers.Xml;
using System.Text;

string xmlCode = "<user id=\"123\"><name>Laura</name></user>";
var processedString = await XmlTokenizer.Create().ParseAsync(xmlCode, token =>
{
    return token.TokenType switch
    {
        XmlTokenType.ElementName => $"[blue]{token.Value}[/]",
        XmlTokenType.AttributeName => $"[cyan]{token.Value}[/]",
        XmlTokenType.AttributeValue => $"[green]{token.Value}[/]",
        XmlTokenType.Text => $"[white]{token.Value}[/]",
        XmlTokenType.Comment => $"[grey]{token.Value}[/]",
        _ => token.Value
    };
});
Console.WriteLine(processedString);

Token Types

The XML tokenizer produces tokens of type XmlTokenType with the following token types:

More info: XmlTokenType.cs

See Also

"