These are chat archives for frictionlessdata/chat

16th
Aug 2016
roll
@roll
Aug 16 2016 08:12

Hi all! Released new version of Tabulator: v0.5.
Along many changes and improvements user-friendly processing
interface has been introduced:

from pprint import pprint
from tabulator import topen

def pick_best_2015(extended_rows):
    for number, headers, row in extended_rows:
        krow = dict(zip(headers, row))
        if krow['2015'] != 'NA' and int(krow['2015']) > 85:
            yield (number, None, {'Country': krow['Jurisdiction'], 'Result': krow['2015']})

source = 'https://raw.githubusercontent.com/datasets/corruption-perceptions-index/master/data/cpi.csv'
with topen(source, headers='row1',post_parse=[pick_best_2015]) as table:
    print('The least corrupted countries in 2015:')
    pprint(sorted(table, key=lambda krow: krow['Result'], reverse=True))

For this example usage of processor could be replaced by direct processing
of table rows but when table.write will be implemented and this functionality
will be added to jsontableschema.Table (new class under development) it could shine because of
really demanded flow: open-process-save.

Daniel Fowler
@danfowler
Aug 16 2016 08:13
:+1:
roll
@roll
Aug 16 2016 08:59
Forgot to attach output of the script:
The least corrupted countries in 2015:
[{'Country': 'Denmark', 'Result': '91'},
 {'Country': 'Finland', 'Result': '90'},
 {'Country': 'Sweden', 'Result': '89'},
 {'Country': 'New Zealand', 'Result': '88'},
 {'Country': 'Netherlands', 'Result': '87'},
 {'Country': 'Norway', 'Result': '87'},
 {'Country': 'Switzerland', 'Result': '86'}]
Daniel Fowler
@danfowler
Aug 16 2016 09:25
Thanks @roll !
Does anyone use this product?
Jaakko Korhonen
@jaakkokorhonen
Aug 16 2016 09:34
@danfowler Power BI is used by a lot of companies and even public sector running Azure. Export to Power BI might be relevant to achieve moremore reuse in commercial use.
Daniel Fowler
@danfowler
Aug 16 2016 09:44
Thanks @jaakkokorhonen
Jaakko Korhonen
@jaakkokorhonen
Aug 16 2016 09:52
So basically we would be talking about a csv - odata converter i guess.
I guess quickest way would probably to import csv to mssql and publish from there as odata, and the other way around.
Daniel Fowler
@danfowler
Aug 16 2016 10:44
New topic of discussion for specs: observational error measurements in data: frictionlessdata/specs#281 submitted by @steko
roll
@roll
Aug 16 2016 11:16
And another example now from upcoming jsontableschema.Table:
import sqlalchemy as sa
from pprint import pprint
from jsontableschema import Table

# Data source
SOURCE = 'https://raw.githubusercontent.com/okfn/jsontableschema-py/master/data/data_infer.csv'

# Create SQL database
db = sa.create_engine('sqlite://')

# Data processor
def skip_under_30(erows):
    for number, headers, row in erows:
        krow = dict(zip(headers, row))
        if krow['age'] >= 30:
            yield (number, headers, row)

# Work with table
table = Table(SOURCE, post_convert=[skip_under_30])
table.schema.save('tmp/persons.json') # Save INFERRED schema
table.save('persons', backend='sql', engine=db) # Save data to SQL
table.save('tmp/persons.csv')  # Save data to DRIVE

# Check the result
pprint(Table('persons', backend='sql', engine=db).read(keyed=True))
pprint(Table('tmp/persons.csv').read(keyed=True))
# Will print (twice)
# [{'age': 39, 'id': 1, 'name': 'Paul'},
#  {'age': 36, 'id': 3, 'name': 'Jane'}]
Everything goes thru streams so we don't load data to memory - using this approach really big files could be opened-processed-saved. And I hope with really nice API) Looking forward for jsotableschema-py release with this functionality!
sirex
@sirex
Aug 16 2016 11:55
@roll can I use generator directly, instead of post_convert callback? Related: martinblech/xmltodict#88
roll
@roll
Aug 16 2016 11:56
@sirex you mean table instance? of course
but this processors are not callbacks
it's generators with full flow control power
sirex
@sirex
Aug 16 2016 11:58
I will try to rewrite your example so show what I mean.
roll
@roll
Aug 16 2016 11:59
great thanks - here is API frictionlessdata/jsontableschema-py#90
you could just operate in normal flow using table.iter
sirex
@sirex
Aug 16 2016 12:03
from jsontableschema import Table

# Data source
SOURCE = 'https://raw.githubusercontent.com/okfn/jsontableschema-py/master/data/data_infer.csv'

# Read -> Filter -> Export
for row in Table(SOURCE, target=['tmp/persons.csv', 'sqlite://']):
    if row['age'] >= 30:
        row.save()
@roll ^
roll
@roll
Aug 16 2016 12:05
aa got it - no it's not possible with this class
sirex
@sirex
Aug 16 2016 12:05
ok.
roll
@roll
Aug 16 2016 12:06
But what problems you are expecting with injected processors?
sirex
@sirex
Aug 16 2016 12:07
No problems, just it does not look pythonic (but this is my personal opinion).
roll
@roll
Aug 16 2016 12:08
Because of power of python generators they allow skip iteration, stop iteration, raise/handle exception, even collect/process and re-emit data etc
I suppose with @danfowler we will write some tutorials covering it)
Problem with things like row.save is every row should be wrapped by whole python class - so it could be really not effective on big data. So rows here just native python objects
sirex
@sirex
Aug 16 2016 12:12
row.save is very close to ORM's, everyone knows how to use them, and row can be just a small wrapper around named tuple which is more efficient than dicts.
roll
@roll
Aug 16 2016 12:13
So anyway it could be used like this:
from jsontableschema import Table

# Data source
SOURCE = 'https://raw.githubusercontent.com/okfn/jsontableschema-py/master/data/data_infer.csv'

# Read -> Filter -> Export
for krow in Table(SOURCE).iter(keyed=True):
    if krow['age'] >= 30:
        # save to sqllite (but again saving row by row to SQL is not effective)
@pwalsh has recently started some ORM work for JTS - https://github.com/frictionlessdata/jsontableschema-models-js
sirex
@sirex
Aug 16 2016 12:14
Yes, saving row by row is really slow, that why row.save should save many items at once, bet then my example should be changed a bit:
# Read -> Filter -> Export
with Table(source, target=['tmp/persons.csv', 'sqlite://']) as rows
    for row in rows:
        if row['age'] >= 30:
            row.save()
But this example really mostly covers single use case: read - process/filter - write.
roll
@roll
Aug 16 2016 12:28
@sirex but looking to my initial example you could just move this logic to processor and let lib handle all this saving things
I think it's pretty common case for people in data mining/analysis etc with functional programming base. ORM is more for programmers
But anyway both ways are available
sirex
@sirex
Aug 16 2016 12:32
I am more programmer than data miner, so probably that is why I see it that way. Also I like to use lazy evaluation strategy, where you can pass generator to a function composition to get your data processed.
I would like to see how this would look from datapackage API?
As I understand, jsontableschema should be considered lower level internal library and datapackage higher level public API?
roll
@roll
Aug 16 2016 14:44

@sirex I can't say jts is internal library. It has self-value in our stack but yes it will be wrapped by dp later. Now focus is more on data validation so next big thing for the project will be goodtables update. It's mostly not about datapackages but about general tabular data validation.

And also as a bonus (I think you've noticed storage usage here):

from jsontableschema import Table

# Data source
SOURCE = 'https://raw.githubusercontent.com/okfn/jsontableschema-py/master/data/data_infer.csv'

# Data processor
def prepare_data_for_pandas(erows):
    # ...

# Export to pandas
table = Table(SOURCE, post_parse/post_convert=[prepare_data_for_pandas])
storage = table.save('persons', backend='pandas')
print(storage['persons'])
# Will print (e.g.)
# id age name
# 1  39  Paul
# 3  36  Jane
Daniel Fowler
@danfowler
Aug 16 2016 15:18
@/all JSON Table Schema spec updated to support for missingValues: http://specs.frictionlessdata.io/json-table-schema/
roll
@roll
Aug 16 2016 15:26
:+1: from one of my previous examples - krow['2015'] != 'NA' - the case for missingValues)
sirex
@sirex
Aug 16 2016 19:28
Regarding https://github.com/frictionlessdata/jsontableschema-pandas-py/pulls currently I don't have enough time to properly review jts-pandas pull requests, because during summer most of my weekends are planned for something else so if someone is willing to review and merge waiting pull requests you are welcome to do so. Sooner or later I will review pull requests myself, but it can take time (some weeks probably). I'm hoping, that during winter I will be more active.
roll
@roll
Aug 16 2016 20:33
@sirex ok thanks I'll take a look
about missing values - sorry technically it's missingValue property singular. Link to the section - http://specs.frictionlessdata.io/json-table-schema/#missing-values