Scanning

Scanner

We will read a proposed configuration file in a following way:

  • Conf. file contains «classes»
  • Every «class» begins with a token of type «class»…
  • … and contain a class ID (type of «classid») and then a «block»
  • Block contains other class” characteristics.

With SPARK we will subclass a spark.GenericScanner class and define reguar expressions for token recognizing in Python docstrings for methods. So, for every «token» which matches a „[a-zA-Z_]+[0-9]*‘ regexp, t_string will be called, etc-etc.

#!/usr/bin/env python
 
import spark
import token
 
class SimpleScanner(spark.GenericScanner):
def __init__(self):
    spark.GenericScanner.__init__(self)
    self.keywords = [
        '<strong>class</strong>',
        'rate',
        'ceil',
        'descr',
        'root',
        'parent',
    ]
 
def tokenize(self, input):
    self.rv = []
    spark.GenericScanner.tokenize(self, input)
    return self.rv
 
def t_whitespace(self, s):
    r'\s+'
    pass
 
def t_comment(self, s):
    r'\<em>#.*'</em>
    pass
 
def t_semicol(self, s):
    r';'
    self.rv.append(token.Token(type=s, attr=s))
 
def t_openblock(self, s):
    r'{'
    self.rv.append(token.Token(type=s, attr=s))
 
def t_closeblock(self, s):
    r'}'
    self.rv.append(token.Token(type=s, attr=s))
 
def t_equal(self, s):
    r'='
    self.rv.append(token.Token(type=s, attr=s))
 
def t_number(self, s):
    r'[0-9]+'
    self.rv.append(token.Token(type='number', attr=s))
 
def t_classid(self,s):
    r'[0-9]+:[0-9]+'
    self.rv.append(token.Token(type='classid', attr=s))
 
def t_keyword(self, s):
    # r' class | irate | iceil | descr '
    self.rv.append(token.Token(type=s, attr=s))
 
def t_string(self, s):
    r'[a-zA-Z_]+[0-9]*'
    if s in self.keywords:
        self.t_keyword(s)
    else:
        self.rv.append(token.Token(type='string', attr=s))

What this relatively simple class will do for our task? Let’s try:

#!/usr/bin/env python
 
import spark
import scanner
 
def scan(f):
    input = f.read()
    scnr = scanner.SimpleScanner()
    return scnr.tokenize(input)
 
f = open('test.confg')
 
scanned = scan(f)
 
print scanned

`print scanned‘ will print us a list of tokens (i will break long lines):

[class, 1:5, root, {, rate, =, 10240, ceil, =, 20480, },
 class, 1:50, parent, 1:5, {, ceil, =, 2048, rate, =, 1024, descr, =, My_favorite_client, },
 class, 1:53, parent, 1:50, {, descr, =, My_other_client, ceil, =, 2048, rate, =, 1024, }
]

It’s not too bad: we have our configuration file scanned and returned as a list of tokens!

Great, go forward:

>>> print '\n'.join(['Token "%s" of type "%s"' % (x.attr, x.type) for x in scanned])
Token "class" of type "class"
Token "1:5" of type "classid"
Token "root" of type "root"
Token "{" of type "{"
Token "rate" of type "rate"
Token "=" of type "="
Token "10240" of type "number"
Token "ceil" of type "ceil"
 
<... etc-etc ...>
 
Token "1024" of type "number"
Token "}" of type "}"

So, our configuration file was scanned successfuly — we now have a list of tokens with their types (class token.Token is responsible for that — i have not provided it, but it almost identical to one provided by John Aycock).

Making lexical errors

Let’s write «roo*» instead of «root» in 3th line and test again. We will have a message:

Specification error: unmatched input

You can see that SPARK does not report line numbers, where errors occur. Read SPARK’s tutorial how to do that.

But what if we write the code like this one?

class 1:50 parent 1:5 {
    ceil = foo moo bar
    rate = 1024
}

We will have tokens, scanned and returned successfuly:

. . .
Token "ceil" of type "ceil"
Token "=" of type "="
Token "foo" of type "string"
Token "moo" of type "string"
Token "bar" of type "string"
. . .

So, it is quite OK from lexical point of view. But, surely, it is incorrect :-)

This is a task for the next step — Parsing.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.