These are chat archives for frictionlessdata/chat

20th
Jul 2017
Georges L J Labrèche
@georgeslabreche_twitter
Jul 20 2017 01:24
Any standard type inference test case csv files we want to make sure pass the test? Would be cool of all libraries pass the same minimum requirement test cases.
Daniel Fireman
@danielfireman
Jul 20 2017 01:31
For validation, I used tests/test_validate.py as baseline
Georges L J Labrèche
@georgeslabreche_twitter
Jul 20 2017 02:03
Solid
tableschema-js has this approach in whicj the schema descriptor can be changed in-place with all changes requiring a commit, like so: table.schema.commit()
I haven't seen this approach in -py nor in -php. Am I missing something or is this some exception with -js?
roll
@roll
Jul 20 2017 06:59
@georgeslabreche_twitter For now tableschema-js could be considered as a reference implementation for dynamic typed languages. Changing descriptor in-place was a missing requirement from http://specs.frictionlessdata.io/implementation/. But I think for static typed languages it could be like a thing to discuss may be later.
I'm working on testsuite-basic for basic level spec implementations - https://github.com/frictionlessdata/implementations#implementation - it will include testing data/schemas. It's happening in a background so kinda slowly but I'm sure we will be having it soon. For now the best source for any testing inspiration could be Python/JavaScript library tests.
Georges L J Labrèche
@georgeslabreche_twitter
Jul 20 2017 07:14
Thanks man
Georges L J Labrèche
@georgeslabreche_twitter
Jul 20 2017 08:14
If we ever build a Haskell library, this may be interesting to dig in for type inference: http://conscientiousprogrammer.com/blog/2015/12/12/24-days-of-hackage-2015-day-12-json-autotype-inferring-types-from-data/
Georges L J Labrèche
@georgeslabreche_twitter
Jul 20 2017 08:42
Just curious, what's the limit people like to use on the numbers of rows to scan to infer type: https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/infer.py#L57-L58
roll
@roll
Jul 20 2017 09:00
@georgeslabreche_twitter I use configurable limit with defaults of 100 rows. TBH I can't advise to look too much into any existent infer implementations. It's good in type guessing but it could be really improved (at least JavaScript one) in overall data flow. Small row limits is used for now because it does full scan but much more effective to stop scan when you're confident enough on field type. It will allow to increase overall limit (for controversial fields) but still be much faster. I really don't like premature optimization (esp. on the stage implementations are now) but infer.py has proven itself already as a bottle-neck e.g. for goodtables-py. So fresh look on this problem in new implementations could be great.
Georges L J Labrèche
@georgeslabreche_twitter
Jul 20 2017 09:04
Yes, implementation of infer is really bothering me at the moment.
Groovy's own implementation of type inference is not that great.
Rufus Pollock
@rufuspollock
Jul 20 2017 12:19

@georgeslabreche_twitter I use configurable limit with defaults of 100 rows. TBH I can't advise to look too much into any existent infer implementations. It's good in type guessing but it could be really improved (at least JavaScript one) in overall data flow. Small row limits is used for now because it does full scan but much more effective to stop scan when you're confident enough on field type. It will allow to increase overall limit (for controversial fields) but still be much faster. I really don't like premature optimization (esp. on the stage implementations are now) but infer.py has proven itself already as a bottle-neck e.g. for goodtables-py. So fresh look on this problem in new implementations could be great.

@roll this is really great info - thoughts on algorithm are super useful.