These are chat archives for frictionlessdata/chat

31st
Jul 2017
Daniel Fireman
@danielfireman
Jul 31 2017 13:07
Thanks @OriHoch .. the algorithm is similar to the one implemented in JS and Python, right? I mean, it tries to cast to availables types (code in FieldsFactory) and each type has a popularity score. At the end, the best score so far is returned.
Ori Hoch
@OriHoch
Jul 31 2017 13:07
yes
Daniel Fireman
@danielfireman
Jul 31 2017 13:14
In frictionlessdata/tableschema-go#18 I mention an alternative solution (code), which has two main pros: it is asymptotically faster (benchmark results) and always return field types that the whole column could be cast to. The main disadvantage is that, it could end up inferring fields that are too generic (i.e. strings)
Paul Walsh
@pwalsh
Jul 31 2017 13:17
@danielfireman I definitely think the 2nd pro is actually a con. it means, to take an extreme example, a 20,000 row CSV, with a column where 19,999 cells are decimals, and 1 cell is a string, would be inferred as string type.
Daniel Fireman
@danielfireman
Jul 31 2017 13:17
this is the result of my thoughts on the algorithm, @rufuspollock .. I am totally keen to discuss, if needed.
@pwalsh got your point. I believe inferring a table and starting using the inferred schema right away for manipulating (cast) values is a pro. Inferring the column as "number" will error out if I try to use.
In any case, the user are not going to be able to use the schema right away. In one case, the returned value will be "number" and there is an string, in the other case, the inferred type will be "string" but the user will try to cast to "number" and will also get an error
Paul Walsh
@pwalsh
Jul 31 2017 13:23
correct. but, in the extreme case I gave, an error is good because the user likely has a broken CSV, and now she can know so
Daniel Fireman
@danielfireman
Jul 31 2017 13:25
Either way she would know, @pwalsh
When she tries to do "number" manipulations on a inferred string, or when she tries to do "number" manipulations on a inferred "number" which is actually a string
Ori Hoch
@OriHoch
Jul 31 2017 13:34
@pwalsh @danielfireman what's the use-case for inferring 20,000 rows from a CSV?
From my experience working with data sets a simple infer of number vs. string from a sample of the first few rows is enough
auto-inferring is just a first step before doing manual validation and adjustments of the schema
I guess I'm coming from the more technical users of datapackages..
Daniel Fireman
@danielfireman
Jul 31 2017 13:38
I totally agree. But even in simple cases, we could have inconsistencies. I believe my major point is that we need to focus on informing the user those inconsistencies. Whether the inferred type is based on a appearance count or implicit casting only matters if we care about the number of rows to process before inferring. Wwhich circles back to your question, that @pwalsh could).
could answer. (sorry, hit the enter button too soon)