Python Tokenizer
The Python tokenizer provides character-by-character streaming tokenization for Python source code.
Supported Token Types
- Keywords:
def,class,if,for,while,return,import,from,async,await,try,except,finally,with,yield,lambda,pass,break,continue,raise,del,global,nonlocal,assert,is,in,not,and,or,as,elif,else,True,False,None - Operators:
+,-,*,/,//,%,**,=,==,!=,<,>,<=,>=,:=,+=,-=,*=,/=,//=,%=,**=,&,|,^,~,<<,>>,&=,|=,^=,<<=,>>=,@ - Punctuation:
(,),{,},[,],,,.,:,; - Strings: Single quotes
'...', double quotes"...", triple quotes'''...''',"""...""", f-strings, raw strings, byte strings - Numbers: Integers, floats, hex (
0x), binary (0b), octal (0o), complex (3j), scientific notation, underscores (1_000_000) - Comments:
#line comments - Decorators:
@decoratorsyntax - Type hints:
def func(x: int) -> str:
Usage
using NTokenizers.Python;
var tokenizer = PythonTokenizer.Create();
var tokens = tokenizer.Parse("def greet(name: str) -> str: return f\"Hello, {name}!\"");
foreach (var token in tokens)
{
Console.WriteLine($"{token.TokenType}: {token.Value}");
}
Streaming
await PythonTokenizer.Create().ParseAsync(stream, onToken: token =>
{
// Handle tokens as they arrive
});
Markdown Integration
Python code blocks are automatically recognized in Markdown:
```python
def greet(name: str) -> str:
return f"Hello, {name}!"
```
The MarkdownTokenizer will use the Python tokenizer to tokenize the code block content.
Limitations
- F-strings with nested expressions are tokenized best-effort
- Complex numbers like
1+2jmay be split into multiple tokens - String prefixes (
r,b,f,u) are tokenized as separate identifiers