Java Tokenizer

A fully streaming Java tokenizer that processes source code character-by-character using a state machine approach. Designed for syntax highlighting applications.

Features

Token Types

The Java tokenizer recognizes the following token types:

Token Type Description Example
Operator Any operator +, -, *, /, ==, !=, &&, ||, ->
OpenParenthesis Open parenthesis (
CloseParenthesis Close parenthesis )
OpenBrace Open brace {
CloseBrace Close brace }
OpenBracket Open bracket [
CloseBracket Close bracket ]
Comma Comma ,
Dot Dot .
SequenceTerminator Semicolon ;
Colon Colon :
StringValue String literal "Hello, World!"
CharValue Character literal 'a'
Number Numeric literal 42, 3.14, 0xFF, 0b1010
Boolean Boolean literal true, false
Null Null literal null
Identifier Variable/method name myVariable, main
Keyword Java keyword public, class, static, void
Comment Comment // line, /* block */
Whitespace Whitespace ` , \t, \n`

Usage

using NTokenizers.Java;

var tokenizer = JavaTokenizer.Create();
var tokens = tokenizer.Tokenize("public class Main { }").ToList();

foreach (var token in tokens)
{
    Console.WriteLine($"{token.TokenType}: {token.Value}");
}

Markdown Integration

The Java tokenizer is integrated with the Markdown tokenizer. Java code blocks in Markdown are automatically tokenized:

using NTokenizers.Markdown;

var markdown = "# Java Example\n\n```java\npublic class Main { }\n```";
var tokens = await MarkdownTokenizer.Create().TokenizeAsync(markdown);

Java Keywords

The tokenizer recognizes all standard Java keywords:

"