These are chat archives for mdedetrich/scala-json-ast

9th
Apr 2016
Erik Osheim
@non
Apr 09 2016 00:14
@Ichoran nice!
Ichoran
@Ichoran
Apr 09 2016 00:25
@non - Thanks! It is nice to have something that actually puts Jackson to shame rather than just sorta keeps up with it :) (And is fast enough to justify making JSON the standard data storage format for a bunch of stuff instead of some weird binary thing with a converter for interoperability.)
Erik Osheim
@non
Apr 09 2016 00:37
haha yeah. although imo jawn does more than just "keep up with" jackson ;)
but yeah -- sounds like you've got things figured out pretty well
Ichoran
@Ichoran
Apr 09 2016 00:40
When I tried it last, Jawn was not faster than Jackson for parsing and using heavily numeric data. Maybe it's faster now?
Anyway, I still think we should just merge Jawn, once it's stable, into the standard library, with scala-json-ast as the standard AST.
Erik Osheim
@non
Apr 09 2016 00:41
huh, interesting. for me that was the case that it beat jackson/gson on. idk i'll have to look again.
i'm envious of jsonal though -- i like that direct style of code, makes the bounds-checking, etc a lot nicer
Ichoran
@Ichoran
Apr 09 2016 00:42
Meh, writing all the code four times was not very much fun.
Erik Osheim
@non
Apr 09 2016 00:42
but being able to chunk/buffer was a requirement when i was writing jawn so i had to support non-string input formats
Ichoran
@Ichoran
Apr 09 2016 00:43
JsonRecyclingParser is a chunking parser.
Erik Osheim
@non
Apr 09 2016 00:43
i just mean that since you're working from a String it's relatively easy to see what's going on
Ichoran
@Ichoran
Apr 09 2016 00:43
It doesn't do streaming, but it does do chunking reads.
Erik Osheim
@non
Apr 09 2016 00:43
ah right
i should investigate that approach sometime
Ichoran
@Ichoran
Apr 09 2016 00:44
It's about 20% slower than the String parser, IIRC.
Maybe 15%.
Erik Osheim
@non
Apr 09 2016 00:44
with jawn i get inconclusive results. sometimes parsing from a string is 5-10% faster, other times parsing from chunked buffers / arrays ends up being faster
but i think the code has a bit more indirection which slows down the string parser
Ichoran
@Ichoran
Apr 09 2016 00:45
I am not sure whether using unsafe was mandatory for it to be as fast as it was. I was worried about bounds checking being slow.
was/is
Erik Osheim
@non
Apr 09 2016 00:45
right, i think bounds-checking definitely has an impact
Ichoran
@Ichoran
Apr 09 2016 00:45
OTOH, I have to do a lot of Long/Int conversion stuff that is completely pointless except that that's the interface for unsafe.
Erik Osheim
@non
Apr 09 2016 00:45
that's a place where jawn could do better -- right now we do a lot of semi-necessary/unnecessary bounds-checking
due to the semi-OO interface
Ichoran
@Ichoran
Apr 09 2016 00:46
Well, you need some way to reduce code duplication. I should either be bald or white-haired by now :P
Erik Osheim
@non
Apr 09 2016 00:47
:)
Ichoran
@Ichoran
Apr 09 2016 00:47
But, anyway, Jsonal is as fast as Boon, and nothing is faster than Boon, so I'm happy. (And with Boon you can't rapidly get at your numbers.)
That 250 MB/s is reading a file full of numbers and multiplying every one by a scaling factor.
Erik Osheim
@non
Apr 09 2016 00:49
i'll have to add boon to my benchmarks (and jsonal)
Ichoran
@Ichoran
Apr 09 2016 00:50
You can get Jsonal now. It's up as a snapshot. libraryDependencies += "com.github.ichoran" %% "kse" % "0.4-SNAPSHOT"
With the standard snapshot resolver.
Jast.parse(whatever) is probably the best thing to benchmark.
Erik Osheim
@non
Apr 09 2016 00:51
ok great. i'll add it and run it by you to be sure it seems reasonable
Ichoran
@Ichoran
Apr 09 2016 00:51
Okay, sounds good!
I will try not to break the snapshot too frequently :P
I probably should publish jsonal as its own thing so it is not necessary to pull in all my other weird stuff just to parse numeric Json quickly.
Oh, and if it breaks, file a bug report please :)
One more thing--it's really best for numbers that are 1-ish, and don't have huge numbers of digits (e.g. 15 or less). These can be converted to doubles really efficiently, and also happen to cover most scientific data (if you use appropriate units, and keep in mind the precision of your measurement).
Now I have a train to catch.
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:17
@Ichoran Nice work, thats pretty impressive considering that boon is using unsafe
Ichoran
@Ichoran
Apr 09 2016 03:21
@mdedetrich - I don't use it for strings, but I do use it for InputStream.
(I was too worried about bounds checking slowing things down to even try writing that without using unsafe, especially after the performance hit I got from ByteBuffer and CharBuffer compared to String.)
@mdedetrich - But you can do some crazy stuff with unsafe. I can parse true, false, and null insanely fast from InputStreams :)
(Because you just load one Int, and compare.)
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:23
Actually while you are here, would you know a better way of detecting “if a String number fits into a double without losing precision”, I am currently doing https://github.com/mdedetrich/scala-json-ast/blob/master/shared/src/main/scala/scala/json/ast/package.scala#L26-L28, but its not completely correct (wont work for E^(some massive number), since that will all reduce down to 0, and not some form of infinity
Yeah, unsafe can do some crazy stuff, honestly if you want speed, you want as minimal indirection as possible
Ichoran
@Ichoran
Apr 09 2016 03:24
I never found a good way to detect whether you really won't lose any precision at all, or whether you'll only lose the last digit.
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:24
Well it doesn’t have to be perfect, its for hashCode, which doesn’t have to be a perfect hash
Ichoran
@Ichoran
Apr 09 2016 03:26
Honestly, I'm not sure why you're calling .toDouble at all. Just compute off the string directly.
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:26
The problem is that code is going to create a lot of collections for numbers which are e^-(something massive)
Ichoran
@Ichoran
Apr 09 2016 03:26
Have you benchmarked .toDouble vs. .hashCode on a string?
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:26
@Ichoran Thats a lot slower than just doing toDouble and converting to bits
The intention is that if the number can fall into a double without losing (too) much precision, and doing that check is cheap, we just get the bits of the double
Else we compute directly off string
Im trying to make the hash have higher performance for typical situations, which are going to be Double most of the time
Definitely open to suggestions to make it faster, this is my first attempt in a while, and I am fairly rusty at this
Ichoran
@Ichoran
Apr 09 2016 03:28
But you never have a Double hanging around. It's all String.
So you only want to call .toDouble if you .toDouble is faster than computing the hash code, right?
I am calling toDouble to check if a the number representation in the string can fit into a double without losing precision (or too much precision, to be accurate), if it does, we use val long = java.lang.Double.doubleToLongBits(asDouble), else we just compute off the string, which is much more expensive (but it will only do that for really large numbers)
Ichoran
@Ichoran
Apr 09 2016 03:30
Why do you think it's more expensive to compute off the string?
A hashCode done Java style is basically just treating the number as base 31 instead of base 10.
And you don't worry about wrapping or anything.
Overflow, I mean.
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:32
I did some preliminary benchmarks and it was slowler
At least doing it on a string, because it does a lot of bit shifts (even if the bitshifts are faster)
If you treat the the string as a Double, it does a lot less bitshifts then computing it manually off each character
Ichoran
@Ichoran
Apr 09 2016 03:34
Yes, but you have to compute the Double value from the string
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:34
But I do need to do better benchmarking to figure out if this is the best strategy
Yes I do, but from microbenchmarks that seems fairly cheap
Ichoran
@Ichoran
Apr 09 2016 03:34
That isn't cheap. So, anyway, my advice is to use the same scheme I did with comparison but make it compute a hashCode instead.
It's fairly cheap, but it's less cheap than a couple of compares and a multiply per digit, which is about what you'd need.
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:35
Well thats basically what is here https://github.com/mdedetrich/scala-json-ast/blob/master/shared/src/main/scala/scala/json/ast/package.scala#L29-L55, but the treatment of 0 there is a bit silly, so it may need more work
Ichoran
@Ichoran
Apr 09 2016 03:35
Yeah, I'll write it for you if I get time, I guess. But I don't have time right now.
Basically, you want to encode chunks of zeros as something special--keep a count of them.
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:36
Thats fine, the hashcode isn’t top priority, it doesn’t break binary compatibliity. I haven’t had much time at all either, with moving and all, so I just did something quick that is “good enough for now"
Ichoran
@Ichoran
Apr 09 2016 03:36
nod
I just gave up on hashCoding big numbers. Everything big or really tiny goes to the hash of Infinity or 0.
Bug me in a week or so if you still want me to look at it. I will probably forget rather than do it. (Not that I'll have time.)
(But I might anyway, if I think it's fun or sufficiently useful.)
Matthew de Detrich
@mdedetrich
Apr 09 2016 03:39
Thats what my code does now, its a just question of how cheap/expensive doing toDouble is in the sheme of things (which is hard to calculate). The treatment of zeroes right now is fairly dumb though
But yeah, ill bug you a bit later, just haven’t had more than an hours worth of time to do proper coding, most of my work lately has been “admin related” :wink2:
*doubles=zeroes