HTML Tokenizer

The HTML tokenizer is designed to parse HTML code and break it down into meaningful components (tokens) for processing. It provides stream-capable functionality for handling large HTML files or real-time HTML data analysis.

Overview

The HTML tokenizer is part of the NTokenizers library and provides a stream-capable approach to parsing HTML code. It can process HTML source code in real-time, making it suitable for large files or streaming scenarios where loading everything into memory at once is impractical.

The HTML tokenizer also supports embedded CSS and JavaScript by delegating to the CSS and TypeScript tokenizers respectively when encountering <style> and <script> elements.

Public API

The HTML tokenizer inherits from BaseSubTokenizer<HtmlToken> and provides the following key methods:

Usage Examples

Basic Usage with Stream

using NTokenizers.Html;
using Spectre.Console;
using System.Text;

string htmlCode = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <h1>Hello World</h1>
        <p>This is a paragraph.</p>
    </body>
    </html>
    """;

using var stream = new MemoryStream(Encoding.UTF8.GetBytes(htmlCode));
await HtmlTokenizer.Create().ParseAsync(stream, onToken: token =>
{
    var value = Markup.Escape(token.Value);
    var colored = token.TokenType switch
    {
        HtmlTokenType.ElementName => new Markup($"[blue]{value}[/]"),
        HtmlTokenType.OpeningAngleBracket => new Markup($"[yellow]{value}[/]"),
        HtmlTokenType.ClosingAngleBracket => new Markup($"[yellow]{value}[/]"),
        HtmlTokenType.SelfClosingSlash => new Markup($"[yellow]{value}[/]"),
        HtmlTokenType.AttributeName => new Markup($"[cyan]{value}[/]"),
        HtmlTokenType.AttributeEquals => new Markup($"[yellow]{value}[/]"),
        HtmlTokenType.AttributeQuote => new Markup($"[grey]{value}[/]"),
        HtmlTokenType.AttributeValue => new Markup($"[green]{value}[/]"),
        HtmlTokenType.Text => new Markup($"[white]{value}[/]"),
        HtmlTokenType.Comment => new Markup($"[grey]{value}[/]"),
        HtmlTokenType.DocumentTypeDeclaration => new Markup($"[magenta]{value}[/]"),
        HtmlTokenType.Whitespace => new Markup($"[grey]{value}[/]"),
        _ => new Markup(value)
    };
    AnsiConsole.Write(colored);
});

Using with TextReader

using NTokenizers.Html;
using System.IO;

string htmlCode = """<div class="container"><p>Hello</p></div>""";
using var reader = new StringReader(htmlCode);
await HtmlTokenizer.Create().ParseAsync(reader, onToken: token =>
{
    Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
});

Parsing String Directly

using NTokenizers.Html;

string htmlCode = """<a href="https://example.com">Link</a>""";
var tokens = HtmlTokenizer.Create().Parse(htmlCode);
foreach (var token in tokens)
{
    Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
}

HTML with Embedded CSS and JavaScript

The HTML tokenizer automatically delegates to specialized tokenizers when it encounters <style> and <script> elements:

using NTokenizers.Html;
using System.Text;

string htmlCode = """
    <html>
    <head>
        <style>
            body { font-family: Arial, sans-serif; }
            .container { max-width: 600px; }
        </style>
    </head>
    <body>
        <div class="container">Content</div>
        <script>
            console.log('Hello, World!');
        </script>
    </body>
    </html>
    """;

using var stream = new MemoryStream(Encoding.UTF8.GetBytes(htmlCode));
await HtmlTokenizer.Create().ParseAsync(stream, onToken: token =>
{
    if (inlineToken.Metadata is TypeScriptCodeBlockMetadata tsMeta)
    {
        await tsMetadata.RegisterInlineTokenHandler(inlineToken =>
        {
            Console.WriteLine($"Token: {inlineToken.TokenType} = '{inlineToken.Value}'");
        }
    }
    else if (inlineToken.Metadata is CssCodeBlockMetadata cssMeta)
    {
        await cssMetadata.RegisterInlineTokenHandler(inlineToken =>
        {
            Console.WriteLine($"Token: {inlineToken.TokenType} = '{inlineToken.Value}'");
        }
    }
    else
    {
        Console.WriteLine($"Token: {token.TokenType} = '{token.Value}'");
    }
});

Token Types

The HTML tokenizer produces tokens of type HtmlTokenType with the following token types:

More info: HtmlTokenType.cs

Special Features

CSS and JavaScript Integration

The HTML tokenizer provides seamless integration with CSS and TypeScript tokenizers:

This allows for proper tokenization of embedded CSS and JavaScript code within HTML documents while maintaining the streaming architecture.

See Also

"