6.17.1. Python Tokeniser

The function tokenize is provided for compatibility with the original tokenize.py. It accepts up to four arguments. The first argument, readline, is required and is a callback function which fetches a line for tokenisation. It should return a line with a trailing newline character, or an empty string to indicate end of input. The second argument, tokeneater, is a callback which is called with each token as an argument. If omitted, it defaults to a pretty-printing routine which writes a formatted display of the token to sys.stdout.

The class provides the following methods. The method reset() resets the tokenizer state. The method write accepts arbitrary text data. The method writeline shall be called with a single line including a trailing newline character, or with an empty string, indicating end of input. The method get_tokens is called to fetch tokens which have been produced and clears the token queue. The method close signals end of input and returns any trailing tokens. The method tokenize accepts any text data and returns the tokens from the queue. Tokens which span lines are report after the line in which they are terminated is processed.

The format of a token consists of an integer token index corresponding to python tokens as listed in the file token.py, the lexeme which the token represents, the starting and ending positions of the lexeme as (line, column) pairs, and the source containing the lexeme. Lines are numbered from 1.

     1: #line 57 "python_tokeniser.ipk"
     2: __version__ = "Ka-Ping Yee 1997/10/26; GvR 1998/3/20, Skaller 1998/11/21"
     3: 
     4: import string, re
     5: from token import *
     6: 
     7: COMMENT = N_TOKENS
     8: tok_name[COMMENT] = 'COMMENT'
     9: 
    10: NL = N_TOKENS + 1
    11: tok_name[NL] = 'NL'
    12: 
    13: WHITESPACE = N_TOKENS+2
    14: tok_name[WHITESPACE] = 'WHITESPACE'
    15: 
    16: MULTILINE_STRING_FIRST = N_TOKENS+3
    17: tok_name[MULTILINE_STRING_FIRST]= 'MULTILINE_STRING_FIRST'
    18: 
    19: MULTILINE_STRING_MIDDLE = N_TOKENS+4
    20: tok_name[MULTILINE_STRING_MIDDLE]= 'MULTILINE_STRING_MIDDLE'
    21: 
    22: MULTILINE_STRING_LAST = N_TOKENS+5
    23: tok_name[MULTILINE_STRING_LAST]= 'MULTILINE_STRING_LAST'
    24: 
    25: # Changes from 1.3:
    26: #     Ignore now accepts \f as whitespace.  Operator now includes '**'.
    27: #     Ignore and Special now accept \n or \r\n at the end of a line.
    28: #     Imagnumber is new.  Expfloat is corrected to reject '0e4'.
    29: # Note: to quote a backslash in a regex, it must be doubled in a r'aw' string.
    30: 
    31: def group(*choices): return '(' + string.join(choices, '|') + ')'
    32: def any(*choices): return apply(group, choices) + '*'
    33: def maybe(*choices): return apply(group, choices) + '?'
    34: 
    35: Whitespace = r'[ \f\t]*'
    36: Comment = r'#[^\r\n]*'
    37: Ignore = Whitespace + any(r'\\\r?\n' + Whitespace) + maybe(Comment)
    38: Name = r'[a-zA-Z_]\w*'
    39: 
    40: Hexnumber = r'0[xX][\da-fA-F]*[lL]?'
    41: Octnumber = r'0[0-7]*[lL]?'
    42: Decnumber = r'[1-9]\d*[lL]?'
    43: Intnumber = group(Hexnumber, Octnumber, Decnumber)
    44: Exponent = r'[eE][-+]?\d+'
    45: Pointfloat = group(r'\d+\.\d*', r'\.\d+') + maybe(Exponent)
    46: Expfloat = r'[1-9]\d*' + Exponent
    47: Floatnumber = group(Pointfloat, Expfloat)
    48: Imagnumber = group(r'0[jJ]', r'[1-9]\d*[jJ]', Floatnumber + r'[jJ]')
    49: Number = group(Imagnumber, Floatnumber, Intnumber)
    50: 
    51: Single = any(r"[^'\\]", r'\\.') + "'"
    52: Double = any(r'[^"\\]', r'\\.') + '"'
    53: Single3 = any(r"[^'\\]",r'\\.',r"'[^'\\]",r"'\\.",r"''[^'\\]",r"''\\.") + "'''"
    54: Double3 = any(r'[^"\\]',r'\\.',r'"[^"\\]',r'"\\.',r'""[^"\\]',r'""\\.') + '"""'
    55: Triple = group("[rR]?'''", '[rR]?"""')
    56: String = group("[rR]?'" + any(r"[^\n'\\]", r'\\.') + "'",
    57:                '[rR]?"' + any(r'[^\n"\\]', r'\\.') + '"')
    58: 
    59: Operator = group('\+', '\-', '\*\*', '\*', '\^', '~', '/', '%', '&', '\|',
    60:                  '<<', '>>', '==', '<=', '<>', '!=', '>=', '=', '<', '>')
    61: Bracket = '[][(){}]'
    62: Special = group(r'\r?\n', r'[:;.,`]')
    63: Funny = group(Operator, Bracket, Special)
    64: 
    65: PlainToken = group(Number, Funny, String, Name)
    66: Token = Ignore + PlainToken
    67: 
    68: ContStr = group("[rR]?'" + any(r'\\.', r"[^\n'\\]") + group("'", r'\\\r?\n'),
    69:                 '[rR]?"' + any(r'\\.', r'[^\n"\\]') + group('"', r'\\\r?\n'))
    70: PseudoExtras = group(r'\\\r?\n', Comment, Triple)
    71: PseudoToken = Whitespace + group(PseudoExtras, Number, Funny, ContStr, Name)
    72: 
    73: tokenprog, pseudoprog, single3prog, double3prog = map(
    74:     re.compile, (Token, PseudoToken, Single3, Double3))
    75: endprogs = {"'": re.compile(Single), '"': re.compile(Double),
    76:             "'''": single3prog, '"""': double3prog,
    77:             "r'''": single3prog, 'r"""': double3prog,
    78:             "R'''": single3prog, 'R"""': double3prog, 'r': None, 'R': None}
    79: 
    80: opdict = {
    81:   '(':LPAR,
    82:   ')':RPAR,
    83:   '[':LSQB,
    84:   ']':RSQB,
    85:   ':':COLON,
    86:   ',':COMMA,
    87:   ';':SEMI,
    88:   '+':PLUS,
    89:   '-':MINUS,
    90:   '*':STAR,
    91:   '/':SLASH,
    92:   '|':VBAR,
    93:   '&':AMPER,
    94:   '<':LESS,
    95:   '>':GREATER,
    96:   '=':EQUAL,
    97:   '.':DOT,
    98:   '%':PERCENT,
    99:   '`':BACKQUOTE,
   100:   '{':LBRACE,
   101:   '}':RBRACE,
   102:   '==':EQEQUAL,
   103:   '!=':NOTEQUAL,
   104:   '<>':NOTEQUAL,
   105:   '<=':LESSEQUAL,
   106:   '>=':GREATEREQUAL,
   107:   '~':TILDE,
   108:   '^':CIRCUMFLEX,
   109:   '<<':LEFTSHIFT,
   110:   '>>':RIGHTSHIFT,
   111:   '**':DOUBLESTAR
   112:   }
   113: 
   114: tabsize = 8
   115: TokenError = 'TokenError'
   116: def printtoken(type, token, (srow, scol), (erow, ecol), line): # for testing
   117:     print "%d,%d-%d,%d:\t%s\t%s" % \
   118:         (srow, scol, erow, ecol, tok_name[type], repr(token))
   119: