Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 22:57
    erezsh commented #316
  • Jan 31 2019 22:57
    erezsh commented #316
  • Jan 31 2019 22:42
    excitoon commented #316
  • Jan 31 2019 22:41
    excitoon commented #316
  • Jan 31 2019 22:29

    erezsh on master

    Docs: Fixup (compare)

  • Jan 31 2019 22:28
    erezsh commented #316
  • Jan 31 2019 22:28

    erezsh on master

    BUGFIX: Indenter was in corrupt… Docs: Added instructions on how… (compare)

  • Jan 31 2019 22:05
    erezsh commented #309
  • Jan 31 2019 22:04
    erezsh commented #309
  • Jan 31 2019 22:03

    erezsh on master

    BUGFIX: Fixed common.ESCAPED_ST… (compare)

  • Jan 31 2019 19:33
    excitoon edited #316
  • Jan 31 2019 19:33
    excitoon edited #316
  • Jan 31 2019 19:32
    excitoon opened #316
  • Jan 31 2019 15:24
    chaosite commented #314
  • Jan 31 2019 02:31
    Vesuvium commented #314
  • Jan 30 2019 19:24
    Agitolyev starred lark-parser/lark
  • Jan 30 2019 07:36
    YaakovTooth starred lark-parser/lark
  • Jan 29 2019 05:59
    macdavid313 starred lark-parser/lark
  • Jan 29 2019 02:36
    ibrahimsharaf starred lark-parser/lark
  • Jan 29 2019 02:07
    fuyunliu starred lark-parser/lark
Jj
@jjdelc
is there a way to hint or address how to resolve that conflict from the syntax?
Erez Shinan
@erezsh
Yes, you can break the loop apart to its own rule, and use priority. Higher priority will be chosen in case of conflict.
You'll know it worked if you no longer see the conflict.
Jj
@jjdelc
Something like this?
start.1: _WS? "RENDER" opts
opts.5: (_WS (decimals|axis))+
axis: "AXIS" _WS ("X"|"Y") axis_opts
axis_opts.10: (_WS (axis_title|axis_subtitle))+
axis_title.100: "TITLE" _WS "NULL"
axis_subtitle.100: "SUBTITLE" _WS "NULL"
decimals: "DECIMALS" _WS SIGNED_NUMBER
Erez Shinan
@erezsh
That might work, yes. Although generally I recommend that _WS won't be the first terminal in any rule
(better if they end with it)
Cisphyx
@Cisphyx
Does the lalr parser only raise UnexpectedToken exceptions? I'm converting a grammar from earley to lalr and wondering if i still need to catch UnexpectedCharacters/UnexpectedEOF exceptions
keith a westgate
@kawestgate_twitter

good news / bad news. the IT folks updated lark to 1.1.2. I'm trying the version with the plugins to attempt the include function.

I used the code from the example

but it's giving me this error:

> /grid/common/pkgs/python/v3.7.2/bin/python3 inc.py 
Traceback (most recent call last):
  File "inc.py", line 65, in <module>
    """)
  File "/grid/common/pkgs/python/v3.7.2/lib/python3.7/site-packages/lark/lark.py", line 625, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/grid/common/pkgs/python/v3.7.2/lib/python3.7/site-packages/lark/parser_frontends.py", line 95, in parse
    stream = self._make_lexer_thread(text)
  File "/grid/common/pkgs/python/v3.7.2/lib/python3.7/site-packages/lark/parser_frontends.py", line 90, in _make_lexer_thread
    return text if self.skip_lexer else cls.from_text(self.lexer, text)
AttributeError: type object 'RecursiveLexerThread' has no attribute 'from_text'

this example did work with a newer version of lark. I just not sure what it was at this moment.

Erez Shinan
@erezsh
@Cisphyx These errors can still happen with LALR, they are just less common.
@kawestgate_twitter Try class RecursiveLexerThread(lexer.LexerThread): ...
keith a westgate
@kawestgate_twitter

I'm now getting:

/grid/common/pkgs/python/v3.9.6/bin/python3 inc.py 
<lark.lexer.ContextualLexer object at 0x2b7594fa42e0> <lark.lexer.LexerState object at 0x2b7594f82d40>
Traceback (most recent call last):
  File "/home/westgate/code/python/lark/inc.py", line 61, in <module>
    tree = parser.parse("""
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/lark.py", line 625, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/parser_frontends.py", line 96, in parse
    return self.parser.parse(stream, chosen_start, **kw)
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 41, in parse
    return self.parser.parse(lexer, start)
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 171, in parse
    return self.parse_from_state(parser_state)
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 178, in parse_from_state
    for token in state.lexer.lex(state):
  File "/home/westgate/code/python/lark/inc.py", line 18, in lex
    token = next(lex)
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/lexer.py", line 528, in lex
    yield lexer.next_token(lexer_state, parser_state)
  File "/grid/common/pkgs/python/v3.9.6/lib/python3.9/site-packages/lark/lexer.py", line 460, in next_token
    while line_ctr.char_pos < len(lex_state.text):
TypeError: object of type 'LexerState' has no len()

I can see that lex is supposed to be a generator

Erez Shinan
@erezsh
No, it seems that the text attribute is a LexerState instance, instead of a string
I think the problem is that def __init__(self, lexer: Lexer, text: str): is not true anymore, the second argument is LexerState
So you might want to rewrite it as
    def __init__(self, lexer: Lexer, lexer_state):
        self.lexer = lexer
        self.state_stack = [lexer_state]
keith a westgate
@kawestgate_twitter
@erezsh that did the trick. I'll post the updated code to work with files shortly
keith a westgate
@kawestgate_twitter

here's a working example of include:

import sys

from lark import Lark

from lark.lexer import Lexer, LexerState, LexerThread

class RecursiveLexerThread(LexerThread):

    def __init__(self, lexer: Lexer, lexer_state):
        self.lexer = lexer
        self.state_stack = [lexer_state]

    def lex(self, parser_state):
        while self.state_stack:
            lexer_state = self.state_stack[-1]
            lex = self.lexer.lex(lexer_state, parser_state)
            try:
                token = next(lex)
            except StopIteration:
                self.state_stack.pop()  # We are done with this file
            else:
                if token.type == "_INCLUDE":
                    name = token.value[8:].strip()  # get just the filename
                    self.state_stack.append(LexerState(open(name).read()))
            yield token  # The parser still expects this token either way

grammar = r"""
start: ((_INCLUDE|line)* _EOL)*

line: STRING+
STRING : /\S+/

_INCLUDE.1 : /include\s+\S+/i

_EOL : /(\n+)/

%ignore /[ \t]+/
"""

parser = Lark(grammar, _plugins={
    "LexerThread": RecursiveLexerThread
}, parser="lalr")

tree = parser.parse(open(sys.argv[1]).read())

print(tree.pretty())

thanks for all your help

Erez Shinan
@erezsh
Looks nice! Maybe we can add it to the "recipes" part of the docs
keith a westgate
@kawestgate_twitter

tweaked it just a hair. the file name is now it's own token. makes it a little cleaner I think

import sys

from lark import Lark

from lark.lexer import Lexer, LexerState, LexerThread

class RecursiveLexerThread(LexerThread):

    def __init__(self, lexer: Lexer, lexer_state):
        self.lexer = lexer
        self.state_stack = [lexer_state]

    def lex(self, parser_state):
        while self.state_stack:
            lexer_state = self.state_stack[-1]
            lex = self.lexer.lex(lexer_state, parser_state)
            try:
                token = next(lex)
            except StopIteration:
                self.state_stack.pop()  # We are done with this file
            else:
                print(token.type, token.value)
                if token.type == "INCLUDE_FILE_NAME":
                    self.state_stack.append(LexerState(open(token.value).read()))
            yield token  # The parser still expects this token either way

grammar = r"""
start: ((include|line)* _EOL)*

line: STRING+
STRING : /\S+/

include : "include"i INCLUDE_FILE_NAME
INCLUDE_FILE_NAME : /\S+/

_EOL : /(\n+)/

%ignore /[ \t]+/
"""

parser = Lark(grammar, _plugins={
    "LexerThread": RecursiveLexerThread
}, parser="lalr")

tree = parser.parse(open(sys.argv[1]).read())

print(tree.pretty())

```

Erez Shinan
@erezsh
Nice! I was going to suggest it, but didn't want to complicate matters.
P.S. this wouldn't work with the "basic" lexer, just fyi
keith a westgate
@kawestgate_twitter

I'm back to playing with error messages. I notice when lark raises an exception I get a nice error message like this:

lark.exceptions.UnexpectedCharacters: No terminal matches 'n' in the current parser context, at line 1 col 5

inc no-such-file
    ^
Expected one of: 
        * LPAR

I was hoping to pull that information out of the exception. I can find some of it, but I don't see where you get a copy of the current line, or the "No terminal matches..." text. Is there access to that?

MegaIng
@MegaIng
The text is just hardcoded into the __str__ function of the class UnexpectedCharacters. I would suggest checking https://github.com/lark-parser/lark/blob/master/lark/exceptions.py
The 'n' is in the .char attribute, the "context", i.e. the current line is extract in __init__ to avoid keeping a copy of the entire string alive, and only that line is stored in ._context
keith a westgate
@kawestgate_twitter
thanks, I'll check this out
keith a westgate
@kawestgate_twitter
I'd like to store the file name in the lexer state stack when using the plugins. I can do it for the include when I hit it, but I don't see a good way to do it for the initial file. Any thoughts on this?
MegaIng
@MegaIng
You could do something like prefixing the text you pass into lark with the file name
Lark might also tolerate passing in a tuple, you could try that
(If the lexer Thread then correctly sets the text attribute ofcourse)
Erez Shinan
@erezsh
Why not just initialize it with the proper name in the stack, instead of an empty one?
keith a westgate
@kawestgate_twitter
@erezsh I don't see a way to pass in the file name when I call the initial parse, it just takes a text:str argument.
keith a westgate
@kawestgate_twitter

@MegaIng and then just rip off the file name before parsing the rest? seems a bit forced, have to tweak the grammar, but could work.

the tuple idea I like, but when can I pass it in? I don't have access to the initial call that sets up the LexerThread. I'm looking at doing something with the LexerThread plugin with a function.

keith a westgate
@kawestgate_twitter

here's what I did. I created a function that returns a class. I pass it the plugin and then call a class function just before parsing. I used a SimpleNameSpace (SNS) to wrap around the lexer state. I also had to duplicate the from_text method so I could pass in the file name.

I think I can pull the fname via the state during an error. that's next.

class RecursiveLexerThread(LexerThread):

    def __init__(self, fname: str, lexer: Lexer, lexer_state):
        self.lexer = lexer
        self.state_stack = [SNS(ls=lexer_state, fname=fname)]

    @classmethod
    def from_text(cls, fname, lexer: 'Lexer', text: str):
        return cls(fname, lexer, LexerState(text))            

    def lex(self, parser_state):
        while self.state_stack:
            lexer_state = self.state_stack[-1].ls
            lex = self.lexer.lex(lexer_state, parser_state)
            try:
                token = next(lex)
            except StopIteration:
                self.state_stack.pop()  # We are done with this file
            else:
                if token.type == "INCLUDE_FILE_NAME":
                    fname = token.value
                    self.state_stack.append(SNS(fname=fname, ls=LexerState(open(fname).read())))
            yield token

def wrapper_fxn(cls):
    class WrapperCls:
        fname = None
        real_cls = cls

        @classmethod
        def set_fname(cls, fname):
            cls.fname = fname

        @classmethod
        def from_text(cls, lexer: 'Lexer', text: str):
            rv = cls.real_cls.from_text(cls.fname, lexer, text)
            return rv

    return WrapperCls

class Parser:
    def ast(self, fh):
        try:
            self.cls.set_fname(str(fh))
            if self.debug:
                parse = self.parser.parse(fh.read_text())
                print(parse.pretty())
                print()
                print('DBG: transfrom start\n')

                tree = self.transform.transform(parse)
            else:
                tree = self.parser.parse(fh.read_text())
        except FileNotFoundError as err:
            raise BasicError(err)
        except lark.exceptions.LarkError as err:
            raise BasicError('DBG got:' + str(err))

        return tree

    def __init__(self, *, grammar, transformer, debug=False):
        cls = wrapper_fxn(RecursiveLexerThread)
        if debug:
            self.parser = lark.Lark(grammar,
                                    debug=True,
                                    start='start',
                                    parser='lalr',
                                    _plugins={"LexerThread": cls},
                                    transformer=None
            )
        else:
            self.parser = lark.Lark(grammar,
                                    start='start',
                                    parser='lalr',
                                    _plugins={"LexerThread": cls}, 
                                    transformer=transformer
            )

        self.cls = cls
        self.debug=debug
        self.transform = transformer
Erez Shinan
@erezsh

I think you have the right idea, but there's a simpler way to write it. Just for example

        @classmethod
        def set_fname(cls, fname):
            cls.fname = fname

Is not necessary. Just do whatever_cls_var.fname = fname.

You can also just create an instance of WrapperCls, and from_text can be a regular method. You don't have to create it in a closure.

P.S. for pasting long code, it's better to use a paster like gist or pastebin
keith a westgate
@kawestgate_twitter
I'll try set_fname, I thought I tried something like this and ran into a read only issue. I think the closure is necessary . I can't access the one that lark is using directly (the lexerThread), so I need to do this backdoor method.
Erez Shinan
@erezsh
I mean something like this:
    class WrapperCls:
        def __init__(self, real_cls):
            self.real_cls = real_cls
            self.fname = None

        def from_text(self, lexer: 'Lexer', text: str):
            rv = self.real_cls.from_text(cls.fname, lexer, text)
            return rv
Why wouldn't it work?
keith a westgate
@kawestgate_twitter
I'll give it a whirl
keith a westgate
@kawestgate_twitter
@erezsh - my code does not have access to when WrapperCls in inited, at least I don't think I do. that's called when the lexer thread is inited. Now if the plugin allowed a customer variable to be passed in all would be great. That's what I'm doing with my wrapper fxn
keith a westgate
@kawestgate_twitter
I was expecting this to generate an EOF instead of an unexpected token. This is a simpler version of my real code. Why isn't EOF generated?
import sys
import lark

grammar = r"""
start: (STRING+ _EOL)+
STRING : /\S+/
_EOL : /(\n+)/
%ignore /[ \t]+/
"""

def main():
    parser = lark.Lark(grammar, parser="lalr")

    try: 
        tree = parser.parse('')
    except lark.exceptions.UnexpectedToken as err:
        sys.exit('Unexpected Token')
    else:
        print(tree.pretty())

main()
Erez Shinan
@erezsh
@kawestgate_twitter
To the best of my memory, LALR doesn't produce UnexpectedEOF exception, only earley does
But you get Unexpected token Token('$END', '') which is kind of an EOF
Erez Shinan
@erezsh
But maybe it makes sense to change it. We can throw a different exception if the token is $END
keith a westgate
@kawestgate_twitter
it does make a little more sense to me to throw an EOF, maybe an EOB, instead of an unexpected token.
Johannes Künsebeck
@hnesk

I'm trying to build a lark-parser for a shell like syntax, which I want to use for syntax-highlighting / error reporting / autocompletion. An contrived example:

ocrd-doxa-binarize -I IMG,MAX -O BIN -P model "../models en/model.tgz" -p '{"dpi":300}'

The parameters can be (shell-quoted) json, can be shell-quoted strings or comma separated values. The consuming software uses pythons shlex module for lexing and json.loads() for the json parts, but that doesn't give me the location of the tokens for syntax-highlighting.
So what would be the best approach?
a) use Lark with shlex as a Lexer for the shell quoted strings or
b) use Lark for the whole parsing, if so, is there an example of a Lark grammar for shell-quoted strings
c) some other solution I didn't think of?

MegaIng
@MegaIng
Is the syntax well defined, context free and unambiguous? Often shell syntax fails these criteria, then you probably need to write your own parsing loop. If you want positions, using shlex as a parser unmodified is not an option. You can maybe copy the shlex source code and add position information.
if it is context free and mostly unambigous, you can use larg. You need to define regex for the shell-quoted strings. I don't know the actual rules for shell quoted strings, but I don't think they should be to hard to express as regex
Erez Shinan
@erezsh
@hnesk For the example you contrived here, it would be pretty simple to do with Lark, without any outside help. However, as MegaIng pointed out, some shell syntax features aren't context-free (though I can't think of an example right now). So, best to make sure that everything you want to support is context-free, before you commit to using Lark for this.
Johannes Künsebeck
@hnesk

Thanks for the hints: I will take another approach now, I found out I can get the position information from shlex like this (simplified):

lexer = shlex.shlex(line, posix=True, punctuation_chars=True)
start_offset = 0
for token in lexer:
    end_offset = lexer.instream.tell()
    yield line[start_offset:end_offset]
    start_offset = end_offset

and do the parsing manually. This way it is close to the original implementation. For error reporting inside the json values I will still use lark. Thanks for the hints, again!