Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 22:57
    erezsh commented #316
  • Jan 31 2019 22:57
    erezsh commented #316
  • Jan 31 2019 22:42
    excitoon commented #316
  • Jan 31 2019 22:41
    excitoon commented #316
  • Jan 31 2019 22:29

    erezsh on master

    Docs: Fixup (compare)

  • Jan 31 2019 22:28
    erezsh commented #316
  • Jan 31 2019 22:28

    erezsh on master

    BUGFIX: Indenter was in corrupt… Docs: Added instructions on how… (compare)

  • Jan 31 2019 22:05
    erezsh commented #309
  • Jan 31 2019 22:04
    erezsh commented #309
  • Jan 31 2019 22:03

    erezsh on master

    BUGFIX: Fixed common.ESCAPED_ST… (compare)

  • Jan 31 2019 19:33
    excitoon edited #316
  • Jan 31 2019 19:33
    excitoon edited #316
  • Jan 31 2019 19:32
    excitoon opened #316
  • Jan 31 2019 15:24
    chaosite commented #314
  • Jan 31 2019 02:31
    Vesuvium commented #314
  • Jan 30 2019 19:24
    Agitolyev starred lark-parser/lark
  • Jan 30 2019 07:36
    YaakovTooth starred lark-parser/lark
  • Jan 29 2019 05:59
    macdavid313 starred lark-parser/lark
  • Jan 29 2019 02:36
    ibrahimsharaf starred lark-parser/lark
  • Jan 29 2019 02:07
    fuyunliu starred lark-parser/lark
Drew Abbot
@drewabbot
@erezsh - a grammar-cache would be nice! but for now, it would be helpful to know if there was a release schedule. for example, a few of the recent functions in master regarding persistence are useful, and overlap with code we wrote against 0.8.1 before noticing them. since we use lark in a production system, if we knew ahead of time that version 0.8.2, say, was coming on or around a certain date, we could better prepare for the upgrade.
Erez Shinan
@erezsh
@drewabbot No problem, I will probably release 0.8.2 sometime next week
Drew Abbot
@drewabbot
thanks! :+1:
Erez Shinan
@erezsh
Version 0.8.2 is out
@drewabbot You probably noticed, but here's a ping anyway
Phoebus Giannopoulos
@phoebusg
<-- a complete Lark newbie, want to look into using Lark to read legacy software strange parametric configuration files...
Pointed to Lark by some other Pythoner.
Erez Shinan
@erezsh
Welcome! I daresay Lark is optimal for strange configuration files :)
Phoebus Giannopoulos
@phoebusg
Just what I want to hear... I am not sure yet where to start however... to be able to deal with the weird configs :/
If you have a starter/tutorial in mind to get me there eventually let me know!!
Erez Shinan
@erezsh
You should read the two tutorials here: https://lark-parser.readthedocs.io/en/latest/ (json parser, and how to write a DSL)
And then see this example for an actual config format (altho very simple): https://github.com/lark-parser/lark/blob/master/examples/conf_earley.py
Finally, you should look at https://mappyfile.readthedocs.io/en/latest/, which is written using Lark in order to parse a super complicated config file
Phoebus Giannopoulos
@phoebusg
@erezsh thank you, copying these to my note and todo list... cheers!
Andrew Albershtein
@alberand
Hey guys, I am trying to use Reconstructor and facing the error that regex reconstruction is not implemented yet. I use regexes to filter out comments in the source file. Is there a quick way to ignore regex reconstruction to see if reconstructor will still produce a valid result?
oh, nevermind, seems like it stops on _NL (newline)
Drew Abbot
@drewabbot
@erezsh thanks!
Şahin Akkaya
@Asocia

Hi all,

I'm trying to parse WhatsApp messages. I build a grammar but it is too slow. Can you guide me what am I doing wrong?

This is my sample input file: https://pastebin.com/Kafx3rup (I didn't include every case but this should be OK)
And here is the grammar that I wrote for it: https://pastebin.com/Hs7F61TB (This includes every rule, some of them does not appear in my sample file but I need them at the end so I didn't remove them here.)
Lastly, this is my simple script to read input file and use my grammar to parse it: https://pastebin.com/HLc61mtS

The problem is even with this small input file it takes 0.34 seconds to build a tree. When I increase the line count from 15 to 100 for example, it takes forever to terminate.
And the real files that I'm using has thousands of messages so this is a real problem
Şahin Akkaya
@Asocia
I tried it with regex by the way but I don't like it. Besides it fails in some cases, it's reducing the readability too much. I want to get rid of whatsapp related issues as early as possible.
Erez Shinan
@erezsh
Hi @Asocia , you are probably experiencing the O(n^2) or even O(n^3) of the Earley parser. There are a number of things you can do to fix it, in order of difficulty :
1) Turn everything you can into terminals. For example group_name into GROUP_NAME. You can strip the quotes later.
2) Split your input into lines. Maybe you can detect a multiple-line message by checking for the date prefix with a regexp.
3) Reduce ambiguity, by being more specific whenever possible. You already did some of it by restricting the size with {...}, but specifying an ABC instead of /.+/ can also help, etc.. (which is also why group_name is so inefficient)
4) Convert to LALR. That will bring the best performance, but might be impossible for your requirements. Tough to say just by looking.
Şahin Akkaya
@Asocia
Thank you @erezsh I will adjust my grammar considering these tips. Hope I can come up with a good complexity
Erez Shinan
@erezsh
It should be possible, so don't give up. I think number #2 alone might do the trick.
Şahin Akkaya
@Asocia
I've started improving it. But I'm still not sure when to use a terminal or a rule. For example person can be terminal according to #1 but can I convert persons to a terminal too?
https://lark-parser.readthedocs.io/en/latest/grammar/#rules This should do the job actually. Everything is explained but I still fell dumb sometimes :D
Şahin Akkaya
@Asocia
I've just completed #1 and I'm getting something similar to O(n):
5k lines: 2.79 seconds
20k lines: 7.35 seconds
Şahin Akkaya
@Asocia
I think this is acceptable but now my question is how do I know which line matched with which rule? For example I have something like this in my grammar:
group_chat_operation: ADD_PERSON | REMOVE_PERSON | LEFT_GROUP | ...
And in whatsapp input file there are lines that corresponds to each of these operations. When I run the parser, I'm just getting "group_chat_operation" part. It doesn't say that line matched with ADD_PERSON or REMOVE_PERSON etc.
Am I missing something here?
Erez Shinan
@erezsh
There are three solutions for that. You can use aliases (with ->), you can access the token type with my_token.type, or you can create a rule that only contains that one token. It won't affect performance by much.
Probably aliases are the best way.
Şahin Akkaya
@Asocia
Thank you so much @erezsh, it worked!
I feel like I'll finally have a good grammar that operates in a reasonable time
But I have one little problem to solve
When I change this
persons: PERSON ((", " PERSON)* " and " PERSON)?
to this
PERSONS: PERSON ((", " PERSON)* " and " PERSON)?
I'm getting an error and I can't figure out why
Şahin Akkaya
@Asocia
1 22 33 added +99 555 999 77 99 and +99 555 111 33 44
                                        ^
Expecting: {Terminal('DATETIME')}
Error is something like this
Şahin Akkaya
@Asocia
My PERSON terminal captures any string up to length of 25. So it matches 25 characters after "added" part then fails. But I think this should not happen since I defined the rule like: PERSON "added" PERSONS and this rule do match when "persons" is a rule.
Şahin Akkaya
@Asocia

Here is a small code to reproduce:

from lark import Lark

GRAMMAR = """
PERSON:  /.{1,25}/
persons: PERSON ((", " PERSON)* " and " PERSON)?

DATETIME: /\d{1,2}\/\d{1,2}\/\d{1,2}, \d{1,2}:\d{1,2}( AM| PM)?/ " - "


start: DATETIME add_person

add_person: PERSON " added " persons

%import common.WS
%ignore WS"""

parser = Lark(GRAMMAR, lexer="dynamic_complete")
s = "3/5/18, 12:51 PM - +99 555 999 22 33 added +99 555 999 44 99 and +99 555 111 33 44"
print(parser.parse(s).pretty())

I don't understand why converting persons to a terminal causing error.

Erez Shinan
@erezsh
either way, persons should stay a rule
Don't turn everything to terminals. It's a bit of an art, but not too hard. Use rules for anything complex or structured. But you also have to make sure terminals have cues, so they won't match for nothing.
Şahin Akkaya
@Asocia
OK, thank you so much again :)
Ivan Belyaev
@ptiza-v-nebe
What is the actual difference between Transformer and Visitor? Is Transformer the typical way of parsing? Is it possible to handle Visitor the same way as Transformer? How much performance is lost if using Transformer? How much and what extra work should be done to use Visitor (or Visitor_Recursive)?
MegaIng
@MegaIng

@erezsh I would discuss a few thinks regarding #540 here. First: the Question if value, expr, ... should be used as argument. Currently I am using value, mainly because I am using that to generate the name of the rule, which currently will have a relativly nice name. If we use something else, we will have to do a lot more work: We have to figure out which argument lists are equivalent.

I personally would use anything 'above' atom/expr, since I dont think that syntax looks very clean: sep{"A" | "B", "," | ";"}. Atom would allow the same constructs, but force you to use parentheses: sep{("A" | "B"), ("," | ";")}. expr would allow you to use operators. I think that would be ok. (But I am not really sure how to detect equivalent patterns with either of these)

Second: TOKEN templates: yes or no?
MegaIng
@MegaIng

I personally would use anything 'above' atom/expr

should be

I personally would not use anything 'above' atom/expr

Erez Shinan
@erezsh
@ptiza-v-nebe Transformer recreates the tree. It's the most common way to work with an AST. It's a little slower than visitor, but you probably won't even notice. Visitor is most often used for preprocessing the AST. Visitor_Recursive is slightly faster, but has a depth limit due to stack.
In general, don't worry so much about it. It's fairly easy to switch between all of them, as the syntax and semantics are very similar.
Erez Shinan
@erezsh
@MegaIng I agree, let's use atom. I don't think it should be difficult to change. If you're not sure how, you can leave it at value and I'll take it from there.
MegaIng
@MegaIng
@erezsh Ok, I will do a pull request with the current version. My second question still stands: Should we add TOKEN templates? This will be a little harder, but still possible (I think).
Erez Shinan
@erezsh
What would be the use for token templates?
MegaIng
@MegaIng
The one think I could think of was Strings/Regex with different delimiters. Or 'numbers' with different allowed digits:
NUMBER{DIGITS, EXPONENT}: ("+"|"-")? (DIGITS+ "."?|DIGITS* "." DIGITS+)  (EXPONENT ("+"|"-")? DIGITS+)?
Erez Shinan
@erezsh
Yeah, I can see some utility in this. Also PY_STRING("'''") | PY_STRING("\"\"\"")