Skip to Content.
Sympa Menu

k-user - Re: [[K-user] ] Differences between Token and Lexer

k-user AT lists.cs.illinois.edu

Subject: K-user mailing list

List archive

Re: [[K-user] ] Differences between Token and Lexer


Chronological Thread 
  • From: Eric Huber <echuber2 AT illinois.edu>
  • To: Radu Mereuta <headness13 AT gmail.com>
  • Cc: k-user AT lists.cs.illinois.edu
  • Subject: Re: [[K-user] ] Differences between Token and Lexer
  • Date: Mon, 30 May 2016 15:55:14 -0500

Thank you for the clarification. Actually it's K4 that I have been playing around with, so I'm trying to understand the pieces of documentation that I can find and reading the examples packaged with the distribution, which is where I found this note. Is there a rundown of K4's lexer syntax anywhere?

Many thanks,
Eric Huber

On May 30, 2016 3:14 PM, "Radu Mereuta" <headness13 AT gmail.com> wrote:
Hi Eric,

Token is exactly like Lexer, with the addition that it tries to infer a follow restriction automatically (nothing too fancy, just look at the regex, and if it has a * or + group at the end, add a follow rule -/- with that group). This works in simple cases like Identifiers, but on more complex regular expressions, like MODNAME above, it requires user attention.

I'm assuming you are using K3.6, which uses SDF as a parser generator, so the behavior is manifested from there.
The follow restrictions are useful in cases like the following:
syntax Exp ::= Exp Exp | Id
syntax Id ::= Lexer{[a-zA-Z][a-zA-Z0-9]*}
A simple input like "ab", with the rules above, will be ambiguous, resulting in the following parses: '__(Id("a"), Id("b")) and Id("ab").
Adding the follow restriction:
syntax Id -/- [a-zA-Z0-9]
will tell the parser not to generate the Id("a") because it is followed by a 'b'. So it forces the parser to continue parsing greedily.
Eventually, Id will stop parsing at the first whitespace, punctuation or <eof>, and the follow restriction will be respected in that case.

I hope this clarifies things a bit.

If you are interested in more things about K, we are working on a new version, K4, which has a completely redesigned lexer,
which allows for more control, compactness and brevity.

Radu
K dev


On Fri, May 27, 2016 at 8:44 AM, Eric Huber <echuber2 AT illinois.edu> wrote:
Hi, I'm wondering if I understand correctly the differences between Token and Lexer. Lexer is not greedy, so it must be given the follow rule with -/- to show when it may NOT stop collecting characters for one token, correct? Whereas Token is greedy and can not use -/- rules.

Also, this detail from ref-manual.k seems counter-intuitive:
  The construction {<char set> "<separator>"}+ will match 'A', 'A-A',
  'A-A-A', but not 'A-'.

Doesn't the syntax suggest that the separator character must follow each character? I take it that the only reason this works is because in the particular example:

syntax MODNAME ::= Lexer{"#"?{[a-z0-9A-Z\_]+ "-"}+}
syntax MODNAME -/- [a-z0-9A-Z\-\_]

it is using Lexer with the construction, so it is not a greedy selection, and the follow exclusion rule says that - may not appear at the end. But the line in the documentation makes it sound like a special construction to specify ranges with a separator.

Thank you,
Eric Huber




Archive powered by MHonArc 2.6.16.

Top of Page