C Tokenizer

A fully streaming C tokenizer that processes source code character-by-character using a state machine approach. Designed for syntax highlighting applications.

Features

Token Types

The C tokenizer recognizes the following token types:

Token Type Description Example
Operator Any operator +, -, *, /, ==, !=, &&, ||, ->
OpenParenthesis Open parenthesis (
CloseParenthesis Close parenthesis )
OpenBrace Open brace {
CloseBrace Close brace }
OpenBracket Open bracket [
CloseBracket Close bracket ]
Comma Comma ,
Dot Dot .
Arrow Arrow ->
SequenceTerminator Semicolon ;
Colon Colon :
StringValue String literal "Hello, World!"
CharValue Character literal 'a'
Number Numeric literal 42, 3.14, 0xFF, 077
Identifier Variable/function name myVariable, main
Keyword C keyword int, struct, void, return
Preprocessor Preprocessor directive #include, #define
Comment Comment // line, /* block */
Whitespace Whitespace ` , \t, \n`

Usage

using NTokenizers.C;

var tokenizer = CTokenizer.Create();
var tokens = tokenizer.Parse("int main(void) { return 0; }").ToList();

foreach (var token in tokens)
{
    Console.WriteLine($"{token.TokenType}: {token.Value}");
}

Markdown Integration

The C tokenizer is integrated with the Markdown tokenizer. C code blocks in Markdown are automatically tokenized:

using NTokenizers.Markdown;

var markdown = "# C Example\n\n```c\nint main(void) { return 0; }\n```";
var tokens = await MarkdownTokenizer.Create().TokenizeAsync(markdown);

C Keywords

The tokenizer recognizes all standard C keywords:

"