6.17.1. Python Tokeniser

This module was modified from tokenize.py of the standard library marked
  __version__ = "Ka-Ping Yee, 26 October 1997; patched, GvR 3/30/98'
The module provides tokenisation of python source code.

The module provides a class 'python_tokenize' and a function 'tokenize'.

The function tokenize is provided for compatibility with the original tokenize.py. It accepts up to four arguments. The first argument, readline, is required and is a callback function which fetches a line for tokenisation. It should return a line with a trailing newline character, or an empty string to indicate end of input. The second argument, tokeneater, is a callback which is called with each token as an argument. If omitted, it defaults to a pretty-printing routine which writes a formatted display of the token to sys.stdout.

The class constructor and function accept three optional arguments.

squashop
The argument squashop defaults to 0 for the class and 1 for the function. If set, all special tokens are reported as token OP.
report_comments
The argument report_comments defaults to 0 for the class and 1 for the function. If set, comments are reported as COMMENTS, and blanks lines and mid-statement end of lines are reported as NL.
split_multiline_strings
If set, multiline strings are reported using one token per line. The default is 0 because partial strings aren't really tokens.
If the optional arguments are zero, the result is a 'pure' token stream suitable for parsing, if they're all set the result is more suitable for pretty printing.

The class provides the following methods. The method reset() resets the tokenizer state. The method write accepts arbitrary text data. The method writeline shall be called with a single line including a trailing newline character, or with an empty string, indicating end of input. The method get_tokens is called to fetch tokens which have been produced and clears the token queue. The method close signals end of input and returns any trailing tokens. The method tokenize accepts any text data and returns the tokens from the queue. Tokens which span lines are report after the line in which they are terminated is processed.

The format of a token consists of an integer token index corresponding to python tokens as listed in the file token.py, the lexeme which the token represents, the starting and ending positions of the lexeme as (line, column) pairs, and the source containing the lexeme. Lines are numbered from 1.

Start python section to interscript/tokenisers/python.py[1 /4 ] Next Last
     1: #line 57 "python_tokeniser.ipk"
     2: __version__ = "Ka-Ping Yee 1997/10/26; GvR 1998/3/20, Skaller 1998/11/21"
     3: 
     4: import string, re
     5: from token import *
     6: 
     7: COMMENT = N_TOKENS
     8: tok_name[COMMENT] = 'COMMENT'
     9: 
    10: NL = N_TOKENS + 1
    11: tok_name[NL] = 'NL'
    12: 
    13: WHITESPACE = N_TOKENS+2
    14: tok_name[WHITESPACE] = 'WHITESPACE'
    15: 
    16: MULTILINE_STRING_FIRST = N_TOKENS+3
    17: tok_name[MULTILINE_STRING_FIRST]= 'MULTILINE_STRING_FIRST'
    18: 
    19: MULTILINE_STRING_MIDDLE = N_TOKENS+4
    20: tok_name[MULTILINE_STRING_MIDDLE]= 'MULTILINE_STRING_MIDDLE'
    21: 
    22: MULTILINE_STRING_LAST = N_TOKENS+5
    23: tok_name[MULTILINE_STRING_LAST]= 'MULTILINE_STRING_LAST'
    24: 
    25: # Changes from 1.3:
    26: #     Ignore now accepts \f as whitespace.  Operator now includes '**'.
    27: #     Ignore and Special now accept \n or \r\n at the end of a line.
    28: #     Imagnumber is new.  Expfloat is corrected to reject '0e4'.
    29: # Note: to quote a backslash in a regex, it must be doubled in a r'aw' string.
    30: 
    31: def group(*choices): return '(' + string.join(choices, '|') + ')'
    32: def any(*choices): return apply(group, choices) + '*'
    33: def maybe(*choices): return apply(group, choices) + '?'
    34: 
    35: Whitespace = r'[ \f\t]*'
    36: Comment = r'#[^\r\n]*'
    37: Ignore = Whitespace + any(r'\\\r?\n' + Whitespace) + maybe(Comment)
    38: Name = r'[a-zA-Z_]\w*'
    39: 
    40: Hexnumber = r'0[xX][\da-fA-F]*[lL]?'
    41: Octnumber = r'0[0-7]*[lL]?'
    42: Decnumber = r'[1-9]\d*[lL]?'
    43: Intnumber = group(Hexnumber, Octnumber, Decnumber)
    44: Exponent = r'[eE][-+]?\d+'
    45: Pointfloat = group(r'\d+\.\d*', r'\.\d+') + maybe(Exponent)
    46: Expfloat = r'[1-9]\d*' + Exponent
    47: Floatnumber = group(Pointfloat, Expfloat)
    48: Imagnumber = group(r'0[jJ]', r'[1-9]\d*[jJ]', Floatnumber + r'[jJ]')
    49: Number = group(Imagnumber, Floatnumber, Intnumber)
    50: 
    51: Single = any(r"[^'\\]", r'\\.') + "'"
    52: Double = any(r'[^"\\]', r'\\.') + '"'
    53: Single3 = any(r"[^'\\]",r'\\.',r"'[^'\\]",r"'\\.",r"''[^'\\]",r"''\\.") + "'''"
    54: Double3 = any(r'[^"\\]',r'\\.',r'"[^"\\]',r'"\\.',r'""[^"\\]',r'""\\.') + '"""'
    55: Triple = group("[rR]?'''", '[rR]?"""')
    56: String = group("[rR]?'" + any(r"[^\n'\\]", r'\\.') + "'",
    57:                '[rR]?"' + any(r'[^\n"\\]', r'\\.') + '"')
    58: 
    59: Operator = group('\+', '\-', '\*\*', '\*', '\^', '~', '/', '%', '&', '\|',
    60:                  '<<', '>>', '==', '<=', '<>', '!=', '>=', '=', '<', '>')
    61: Bracket = '[][(){}]'
    62: Special = group(r'\r?\n', r'[:;.,`]')
    63: Funny = group(Operator, Bracket, Special)
    64: 
    65: PlainToken = group(Number, Funny, String, Name)
    66: Token = Ignore + PlainToken
    67: 
    68: ContStr = group("[rR]?'" + any(r'\\.', r"[^\n'\\]") + group("'", r'\\\r?\n'),
    69:                 '[rR]?"' + any(r'\\.', r'[^\n"\\]') + group('"', r'\\\r?\n'))
    70: PseudoExtras = group(r'\\\r?\n', Comment, Triple)
    71: PseudoToken = Whitespace + group(PseudoExtras, Number, Funny, ContStr, Name)
    72: 
    73: tokenprog, pseudoprog, single3prog, double3prog = map(
    74:     re.compile, (Token, PseudoToken, Single3, Double3))
    75: endprogs = {"'": re.compile(Single), '"': re.compile(Double),
    76:             "'''": single3prog, '"""': double3prog,
    77:             "r'''": single3prog, 'r"""': double3prog,
    78:             "R'''": single3prog, 'R"""': double3prog, 'r': None, 'R': None}
    79: 
    80: opdict = {
    81:   '(':LPAR,
    82:   ')':RPAR,
    83:   '[':LSQB,
    84:   ']':RSQB,
    85:   ':':COLON,
    86:   ',':COMMA,
    87:   ';':SEMI,
    88:   '+':PLUS,
    89:   '-':MINUS,
    90:   '*':STAR,
    91:   '/':SLASH,
    92:   '|':VBAR,
    93:   '&':AMPER,
    94:   '<':LESS,
    95:   '>':GREATER,
    96:   '=':EQUAL,
    97:   '.':DOT,
    98:   '%':PERCENT,
    99:   '`':BACKQUOTE,
   100:   '{':LBRACE,
   101:   '}':RBRACE,
   102:   '==':EQEQUAL,
   103:   '!=':NOTEQUAL,
   104:   '<>':NOTEQUAL,
   105:   '<=':LESSEQUAL,
   106:   '>=':GREATEREQUAL,
   107:   '~':TILDE,
   108:   '^':CIRCUMFLEX,
   109:   '<<':LEFTSHIFT,
   110:   '>>':RIGHTSHIFT,
   111:   '**':DOUBLESTAR
   112:   }
   113: 
   114: tabsize = 8
   115: TokenError = 'TokenError'
   116: def printtoken(type, token, (srow, scol), (erow, ecol), line): # for testing
   117:     print "%d,%d-%d,%d:\t%s\t%s" % \
   118:         (srow, scol, erow, ecol, tok_name[type], repr(token))
   119: 
End python section to interscript/tokenisers/python.py[1]


6.17.1.1. Callback Interface
6.17.1.2. Server Interface