Branko-Dj on master
[travis][s]: Added update comma… (compare)
I'm having issue trouble not inferring types on load in dataflows when I am loading the file directly without a datapackage. Can someone help me out?
From the documentation, validate option should control that but it isn't working unless I am misunderstanding.
Using dataflows Version: 0.0.32
cruise,station,date,time,lat,lon,cast,pump_serial_num FK160115,4,1/19/2016,20:00,10,204,MP01,12665-01 FK160115,4,1/19/2016,20:00,10,204,MP01,12665-02 FK160115,4,1/19/2016,20:00,10,204,MP01,ML12371-01 FK160115,4,1/19/2016,20:00,10,204,MP01,ML 10820-02 FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11000-01 FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11515-02 FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11934-02 FK160115,5,1/20/2016,18:00,8,156,MP02,12665-01 FK160115,5,1/20/2016,18:00,8,156,MP02,12665-02 FK160115,5,1/20/2016,18:00,8,156,MP02,ML 10820-02 FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11515-02 FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11000-01 FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11491-02
from dataflows import Flow, load, printer def flow(): flow = Flow( load('data.csv', format='csv',validate=False,force_strings=True), printer(num_rows=1), ) flow.process() if __name__ == '__main__': flow()
data: # cruise station date time lat lon cast pump_serial_num (string) (integer) (string) (string) (number) (number) (string) (string) --- ---------- ----------- ---------- ---------- ---------- ---------- ---------- ----------------- 1 FK160115 4 1/19/2016 20:00 10 204 MP01 12665-01 2 FK160115 4 1/19/2016 20:00 10 204 MP01 12665-02 ... 66 FK160115 14 2/4/2016 13:00 -4.23 142.23 MP14 ML12371-01
at the moment
load always tries to infer datatypes.
The parameters affect this as:
force_stringsonly applies to Excel files, which sometimes have internal type information (so irrelevent for your use case).
loadwill validate and cast the inferred types. This is useful as inferring is done on sample of rows, and sometimes there’s an offending row later on.
I suggest using
validate=False, and then adding
set_type as needed to fix the wronlgy inferred types (if any). e.g.:
load('data.csv', format='csv',validate=False), set_type("station", type ="string"), set_type("date",type="date",format="%m/%d/%Y"), set_type("time",type="time",format="%H:%M"), data: # cruise station date time lat lon cast pump_serial_num (string) (string) (date) (time) (number) (number) (string) (string) --- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------------- 1 FK160115 4 2016-01-19 20:00:00 10 204 MP01 12665-01 2 FK160115 4 2016-01-19 20:00:00 10 204 MP01 12665-02
@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.
That’s right @zaneselvans :smile:
@zaneselvans If I have many tens of thousands of files, wouldn't that become a problem to manage using JSON? I am working with MRI data, which in its raw format is delivered as DICOM files. Each subject may have a few thousands of those created at a full scanning session.
@ZviBaratz that should be fine. One question: are all these files different resources in one package or ...
There is the NIfTI format, which is often used to share MRI data (there's a whole specification just for that called BIDS). The problem is that I am looking for a way to not only publish the data but also maintain a database of references, so that the querying, extraction, and piping of data is facilitated.
This is a really interesting use case - please tell us more. When you say a database of references what exactly do you mean?
@rufuspollock I mean a database that keeps references to that actual files. Some fields from the header that might be relevant to data aggregation in analyses etc. can also be saved for easy querying, but once you want the data itself (pixel data or a header field that's not included in the database schema), you have to read it from the DICOM file.
Sorry for the slow reply @ZviBaratz!
To answer your question I think you could a) pull out metadata and save into datapackage.json b) i’m understanding that you want to do specific post-processing on the data e.g. to generate all of the info for a particular scan. I’d be doing that with a separate workflow after storing the basic packages.
@akariv , I have been looking through the dataflow tutorials that use custom functions and nested flows trying to figure out how I can use custom functions for one resource when there are many.
For example, I have a row processor:
What I want to do is this in a flow specifying one resource:
mycustomfcn(resouces='mclane_log') and be able to specify whether it is a package, row, or rows processor somehow.
I can also do stuff directly in the flow like this but again, I can't figure out how to specify one resource if there are many.
lambda row: dict(row, val=row['val']/5),
def mycustomfcn(package): yield package.pkg resources = iter(package) for resource in resources: if resource.name == 'my-resource-name': # do stuff here Eg: yield filter(lambda row: (row['x'] in [1,2,3,4,5]), resource) else: # Deque others yield resource