Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Markus Westerlind
@Marwes

Is it okay to reuse combine parsers?

Yes

And why do they take &mut self for their methods anyways?

It allows them to take FnMut functions so that they can mutate things (say push to a Vec for the many parser)

eaglgenes101
@eaglgenes101
Okay good experience, but trying to make spans work was a pain
Im just interested in the location of the token where a parser began parsing, and of the token it either committed a success or ran into an error
Seems severely burdensome to have to replace half the types in my function signatures with spanned analogs just to get those
eaglgenes101
@eaglgenes101
(Also, I tried to implement the combinators myself. What's with all the parse_mode stuff in the library combinators, and do I need to concern myself with those?)
marwes
@marwes:matrix.org
[m]

Seems severely burdensome to have to replace half the types in my function signatures with spanned analogs just to get those

If you need the default errors to know about spans then I am afraid you need a custom stream that knows about spans (the stream would also tokenize so I'd expect a custom would be needed regardless). If you only need the span of a particular parser you can do something like (position(), my_parser, position()) to get the start end end of it

(Also, I tried to implement the combinators myself. What's with all the parse_mode stuff in the library combinators, and do I need to concern myself with those?)

The parse_mode stuff is quite a mess coming in from the outside but it necessary to support parsing incomplete/partial input. If you need a custom parser I'd recommend copying a parser as a base from the library itself or using the parsers in https://docs.rs/combine/4.6.0/combine/parser/function/index.html as they let you invoke all the parse machinery manually while still only needing to write one function

eaglgenes101
@eaglgenes101
And what's the idea of add_committed_expected_error? The implementations I could find in the library mostly just forward it or implement the method much like add_error
eaglgenes101
@eaglgenes101
Right now I'm trying to understand the choice combinator to see if I can implement one that has different error-merging behavior, returning the error corresponding to the longest matching alternative instead of (from what I can tell) the error the last alternative returns
eaglgenes101
@eaglgenes101
...and nevermind, that's already what the choice combinator does for errors. Argh.
eaglgenes101
@eaglgenes101
Okay, I finally figured out the issue I was having-- many of the individual parsers set the expected set in the errors, and the combinators kind of let this happen. Ideally, the many combinator and friends would make it so that setting the expected set of the wrapped combinator merely causes it to be added to the final combinator rather than completely overwriting it.
I assume this is a bit of an oversight regarding how different little parts of combine interact when put together
eaglgenes101
@eaglgenes101
As a workaround, I created my own combinator that, when an error occurs, holds onto a tracked error's value and just merges the inner combinator's error contribution on top
eaglgenes101
@eaglgenes101
Should I file an issue about this?
marwes
@marwes:matrix.org
[m]
Sure, though it may be difficult to change without breaking other assumptions
Ed Page
@epage

Wondering if someone can help me with a change I'm making to toml_edit. I'm updating the parser to handle un-escaped quotes right before/after triple-quotes. I'm assuming I need to wrap stuff with attempt but didn't get anywhere when sprinkling it around

I've updated toml_edits parser
https://github.com/ordian/toml_edit/pull/125/commits/37c3d8c82d38333dcbfdfe8dea8f18f653b4197c
(core is ml_basic_body and ml_literal_body, with ml_literal_body being the easier-to-read one)

to reflect the ABNF grammar
https://github.com/toml-lang/toml/blob/master/toml.abnf

To handle test cases like:

lit_one = ''''one quote''''
lit_two = '''''two quotes'''''
lit_one_space = ''' 'one quote' '''
lit_two_space = ''' ''two quotes'' '''

https://github.com/BurntSushi/toml-test/blob/master/tests/valid/string/multiline-quotes.toml

And get errors like

 thread 'parser::tests::values' panicked at 'Unexpected error for "'''I [dw]on't need \\d{2} apples'''": Parse error at line: 1, column: 35
Unexpected `end of input`
Expected `'''`
While parsing a Multiline Literal String

https://github.com/ordian/toml_edit/pull/125/checks?check_run_id=3414324671

Ed Page
@epage

Also, anyone have experience optimizing combine parsers? toml_edit takes twice as long to parse as toml-rs and all of that is wtihin the parsing and the deep stacks are making it harder to differentiate what is costing us.

This is part of the effort to switch cargo from toml-rs to toml_edit so we can mainline cargo-add.

marwes
@marwes:matrix.org
[m]
I'd look into parsing &[u8] instead of &str as the char decoding takes a fair bit of time
https://docs.rs/combine/4.6.1/combine/macro.dispatch.html may be useful in some really hot locations where there are a lot of alternative parsers
Use https://docs.rs/combine/4.6.1/combine/parser/range/index.html parsers when possible as they work with slices instead of constructing String/Vec etc
Ed Page
@epage
I've been considering &[u8]. With a format-preserving toml parser, almost all bytes will be passed back up, which for me, means needing to re-validate for UTF-8 which might negate the benefits (with how much there is to know about utf-8, I'm always cautious about conditions which I can construct a str without validation).
1 reply
When looking in a profiler, it felt like there was a lot of overhead from state management and error preparation (even though no errors occurred). I suspect moving away from the easy parser and constructing location information after-the-fact, rather than before,. would be a big help.
1 reply
We do use range and recognize(skip_many) in a lot of places, so we're covered there.
1 reply
I feel like dispatchs example isn't quite making its use clear. I'm assuming this is meant to help with parsing state machines so, for example, when trying similar alternates like numbers, date, and time values, we can branch off at the appropriate points rather than restarting from scratch.
1 reply
marwes
@marwes:matrix.org
[m]
I'd use it in say, parsing a JSON value where dispatching on {[ft" or a number is a very hot location
marwes
@marwes:matrix.org
[m]
Since ranges are still &str you can still have UTF-8 string literals and all
(skips the char decoding in many cases)
Though it can lose some error reporting
marwes
@marwes:matrix.org
[m]
6 replies
Ed Page
@epage

Byte processing gave a small, but measurable, improvement.
https://github.com/ordian/toml_edit/pull/219/files

I could probably remove some UTF-8 validation to speed it up further. I'm just cautious about assuming I'm as safe as I think I am.

My main feedback from making the switch

  • Getting a position out of a IndexPositioner is annoying because its only available through the Positioner trait which requires setting 1 generic type and 2 associated types
  • Generally converting errors is annoying. Some of my code for this could probably be upstreamed
Ed Page
@epage
Looks like Errors::merge is taking up 8% of the time in my profiled case (including all the process start overhead which is 30%, so the percentage of parse time is even higher)
marwes
@marwes:matrix.org
[m]
Figuring out what is actually creating errors to eagerly is a good plan then, because normally that shouldn't be the case
Could do a local patch of Errors::merge to panic in it and check the stacktrace