k-user AT lists.cs.illinois.edu

Subject: K-user mailing list

List archive

Re: [[K-user] ] Differences between Token and Lexer

From: Radu Mereuta <headness13 AT gmail.com>
To: Eric Huber <echuber2 AT illinois.edu>
Cc: k-user AT lists.cs.illinois.edu
Subject: Re: [[K-user] ] Differences between Token and Lexer
Date: Mon, 30 May 2016 23:14:52 +0300

Hi Eric,

Token is exactly like Lexer, with the addition that it tries to infer a follow restriction automatically (nothing too fancy, just look at the regex, and if it has a * or + group at the end, add a follow rule -/- with that group). This works in simple cases like Identifiers, but on more complex regular expressions, like MODNAME above, it requires user attention.

I'm assuming you are using K3.6, which uses SDF as a parser generator, so the behavior is manifested from there.

The follow restrictions are useful in cases like the following:

syntax Exp ::= Exp Exp | Id

syntax Id ::= Lexer{[a-zA-Z][a-zA-Z0-9]*}

A simple input like "ab", with the rules above, will be ambiguous, resulting in the following parses: '__(Id("a"), Id("b")) and Id("ab").

Adding the follow restriction:

syntax Id -/- [a-zA-Z0-9]

will tell the parser not to generate the Id("a") because it is followed by a 'b'. So it forces the parser to continue parsing greedily.

Eventually, Id will stop parsing at the first whitespace, punctuation or <eof>, and the follow restriction will be respected in that case.

I hope this clarifies things a bit.

If you are interested in more things about K, we are working on a new version, K4, which has a completely redesigned lexer,

which allows for more control, compactness and brevity.

Radu

K dev

On Fri, May 27, 2016 at 8:44 AM, Eric Huber <echuber2 AT illinois.edu> wrote:

Hi, I'm wondering if I understand correctly the differences between Token and Lexer. Lexer is not greedy, so it must be given the follow rule with -/- to show when it may NOT stop collecting characters for one token, correct? Whereas Token is greedy and can not use -/- rules.

Also, this detail from ref-manual.k seems counter-intuitive:
The construction {<char set> "<separator>"}+ will match 'A', 'A-A',
'A-A-A', but not 'A-'.

Doesn't the syntax suggest that the separator character must follow each character? I take it that the only reason this works is because in the particular example:

syntax MODNAME ::= Lexer{"#"?{[a-z0-9A-Z\_]+ "-"}+}
syntax MODNAME -/- [a-z0-9A-Z\-\_]

it is using Lexer with the construction, so it is not a greedy selection, and the follow exclusion rule says that - may not appear at the end. But the line in the documentation makes it sound like a special construction to specify ranges with a separator.

Thank you,
Eric Huber

[[K-user] ] Differences between Token and Lexer, Eric Huber, 05/27/2016
- Re: [[K-user] ] Differences between Token and Lexer, Radu Mereuta, 05/30/2016
  - Re: [[K-user] ] Differences between Token and Lexer, Eric Huber, 05/30/2016
    - Re: [[K-user] ] Differences between Token and Lexer, Radu Mereuta, 05/30/2016