Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

Rufus Pollock
@rufuspollock
@jhofra11_twitter there is no formal query API there but you can json versions for example
Amber York
@adyork

I'm having issue trouble not inferring types on load in dataflows when I am loading the file directly without a datapackage. Can someone help me out?

From the documentation, validate option should control that but it isn't working unless I am misunderstanding.

  • tried validate=True and False
  • also tried force_strings from tabulator since the doc also mentions tabulator options

Using dataflows Version: 0.0.32

data.csv

cruise,station,date,time,lat,lon,cast,pump_serial_num
FK160115,4,1/19/2016,20:00,10,204,MP01,12665-01
FK160115,4,1/19/2016,20:00,10,204,MP01,12665-02
FK160115,4,1/19/2016,20:00,10,204,MP01,ML12371-01
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 10820-02
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11000-01
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11515-02
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11934-02
FK160115,5,1/20/2016,18:00,8,156,MP02,12665-01
FK160115,5,1/20/2016,18:00,8,156,MP02,12665-02
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 10820-02
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11515-02
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11000-01
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11491-02
from dataflows import Flow, load, printer
def flow():

    flow = Flow(
        load('data.csv', format='csv',validate=False,force_strings=True),
        printer(num_rows=1),
    )
    flow.process()


if __name__ == '__main__':

    flow()

getting:

data:
#    cruise          station  date        time               lat         lon  cast        pump_serial_num
     (string)      (integer)  (string)    (string)      (number)    (number)  (string)    (string)
---  ----------  -----------  ----------  ----------  ----------  ----------  ----------  -----------------
1    FK160115              4  1/19/2016   20:00            10         204     MP01        12665-01
2    FK160115              4  1/19/2016   20:00            10         204     MP01        12665-02
...
66   FK160115             14  2/4/2016    13:00            -4.23      142.23  MP14        ML12371-01
Adam Kariv
@akariv

@adyork so:
at the moment load always tries to infer datatypes.

The parameters affect this as:

  • force_strings only applies to Excel files, which sometimes have internal type information (so irrelevent for your use case).
  • validate controls whether load will validate and cast the inferred types. This is useful as inferring is done on sample of rows, and sometimes there’s an offending row later on.

I suggest using load with validate=False, and then adding set_type as needed to fix the wronlgy inferred types (if any). e.g.:
set_type(‘station’, type=‘string’)

(if a different behaviour is required, please open an issue on github.com/datahq/dataflows)
Amber York
@adyork

Thanks @akariv. validate=False and then using set_type works for this case.

I was thrown because I guess I didn't quite understand validate=True|False since I was thinking validate=False which does not "validate and cast" would return everything as strings.

        load('data.csv', format='csv',validate=False),
        set_type("station", type ="string"),
        set_type("date",type="date",format="%m/%d/%Y"),
        set_type("time",type="time",format="%H:%M"),

data:
#    cruise         station  date        time             lat         lon  cast        pump_serial_num
     (string)      (string)  (date)      (time)      (number)    (number)  (string)    (string)
---  ----------  ----------  ----------  --------  ----------  ----------  ----------  -----------------
1    FK160115             4  2016-01-19  20:00:00       10         204     MP01        12665-01
2    FK160115             4  2016-01-19  20:00:00       10         204     MP01        12665-02
Adam Kariv
@akariv
Thanks for the feedback @adyork - I’ll make sure that the README is clearer on this subject.
I’ll also consider parametrizing the inferring of data types so it’s not mandatory.
Amber York
@adyork
Yes, that would help a lot. Especially when we have cases where we need to preserve the date/time/datetime in a format other than "yyyy-mm-dd" "HH:MM:SS" and have to use "string." In the data.csv example above it wasn't an issue. But sometimes it does infer date/time and we have to go back and string it and sometimes find/replace back to the orginal format.
Zvi Baratz
@ZviBaratz
Hello,
Oops, sorry - Hello, I'm curious regarding storage on DataHub. Where is published data saved? How much data can I upload?
Zvi Baratz
@ZviBaratz
Also, is it possible to create nested data packages?
Zane Selvans
@zaneselvans
@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.
Zvi Baratz
@ZviBaratz
@zaneselvans If I have many tens of thousands of files, wouldn't that become a problem to manage using JSON? I am working with MRI data, which in its raw format is delivered as DICOM files. Each subject may have a few thousands of those created at a full scanning session.
Zane Selvans
@zaneselvans
@ZviBaratz Hmm, I dunno about that. If they're spatial and/or time slices, is there some standard way to bundle them up into larger files that contain more dimensions of the data, and use those? I don't think Data Packages are really meant for dealing with thousands or tens of thousands of individual files.
Zvi Baratz
@ZviBaratz
There is the NIfTI format, which is often used to share MRI data (there's a whole specification just for that called BIDS). The problem is that I am looking for a way to not only publish the data but also maintain a database of references, so that the querying, extraction, and piping of data is facilitated.
(Refernces to the DICOM files, which are the raw data used in many neuroimaging labs)
Rufus Pollock
@rufuspollock

@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.

That’s right @zaneselvans :smile:

@zaneselvans If I have many tens of thousands of files, wouldn't that become a problem to manage using JSON? I am working with MRI data, which in its raw format is delivered as DICOM files. Each subject may have a few thousands of those created at a full scanning session.

@ZviBaratz that should be fine. One question: are all these files different resources in one package or ...

There is the NIfTI format, which is often used to share MRI data (there's a whole specification just for that called BIDS). The problem is that I am looking for a way to not only publish the data but also maintain a database of references, so that the querying, extraction, and piping of data is facilitated.

This is a really interesting use case - please tell us more. When you say a database of references what exactly do you mean?

Zvi Baratz
@ZviBaratz
@rufuspollock I mean a database that keeps references to that actual files. Some fields from the header that might be relevant to data aggregation in analyses etc. can also be saved for easy querying, but once you want the data itself (pixel data or a header field that's not included in the database schema), you have to read it from the DICOM file.
Zvi Baratz
@ZviBaratz
I am in touch with Paul Walsh regarding examining this use case further and hopefully finding a way to implement Frictionless Data ideas/projects into our data flow. In the mean time I am trying to get a better grasp of the possibilities.
Zvi Baratz
@ZviBaratz
Currently I'm implementing the reference database as a Django project backed with postgres, so that the model classes include methods for easy aggregation and extraction (e.g. get all the scans that make up a single DICOM series as a 3D numpy array)
Rufus Pollock
@rufuspollock
@Branko-Dj @adyork could we get a graph onto https://datahub.io/cryptocurrency/bitcoin asap?
Branko
@Branko-Dj
@rufuspollock I will create a graph for it
Rufus Pollock
@rufuspollock

@rufuspollock I mean a database that keeps references to that actual files. Some fields from the header that might be relevant to data aggregation in analyses etc. can also be saved for easy querying, but once you want the data itself (pixel data or a header field that's not included in the database schema), you have to read it from the DICOM file.

Sorry for the slow reply @ZviBaratz!

To answer your question I think you could a) pull out metadata and save into datapackage.json b) i’m understanding that you want to do specific post-processing on the data e.g. to generate all of the info for a particular scan. I’d be doing that with a separate workflow after storing the basic packages.

JavaScriptFamily
@JavaScriptFamily
Hi All,
Can you please help me, I have created a frictionless data package using data package for PHP
I want to validate this
Is there any online platform to validate my package
thanks
Anuar Ustayev
@anuveyatsu
@JavaScriptFamily Hi :wave: please, have a look at this page - https://datahub.io/tools/validate
Branko
@Branko-Dj
@rufuspollock the graph for bitcoin is now available
Raja Sahe
@rajasahe
I need Indian cities house prices dataset. How can i get that ? Any idea
Rufus Pollock
@rufuspollock
@rajasahe please email us at support@datahub.io
Stephen Abbott Pugh
@StephenAbbott
@rufuspollock Sincere apologies. I failed to follow up and flag the issue I raised with you here on September 12th. I have now raised an issue on datahub-qa as you suggested datahq/datahub-qa#245
Amber York
@adyork

@akariv , I have been looking through the dataflow tutorials that use custom functions and nested flows trying to figure out how I can use custom functions for one resource when there are many.

For example, I have a row processor:

def mycustomfcn(row):

What I want to do is this in a flow specifying one resource:
mycustomfcn(resouces='mclane_log') and be able to specify whether it is a package, row, or rows processor somehow.

Any tips?

I can also do stuff directly in the flow like this but again, I can't figure out how to specify one resource if there are many.

lambda row: dict(row, val=row['val']/5),

Irakli Mchedlishvili
@zelima
@adyork how about
def mycustomfcn(package):
    yield package.pkg
    resources = iter(package)

    for resource in resources:
         if resource.name == 'my-resource-name':
              # do stuff here Eg:
              yield filter(lambda row: (row['x']  in [1,2,3,4,5]),  resource)
        else:
              # Deque others
              yield resource
Amber York
@adyork
Thanks @zelima! I will try that.
Amber York
@adyork

Found a bit simpler way to get one resource:

resource = package.pkg.get_resource('seagrass')

Irakli Mchedlishvili
@zelima
:+1:
Rakesh Kumar Devalapally
@devalapa_gitlab
Hi, I am working on the temperatures dataset and I see an entry like this
France
France(Europe)
Can anyone explain the difference between these two?
Irakli Mchedlishvili
@zelima
@devalapa_gitlab can you please paste the link to the dataset?
Rakesh Kumar Devalapally
@devalapa_gitlab
Irakli Mchedlishvili
@zelima
@devalapa_gitlab data is coming from https://data.giss.nasa.gov/gistemp/ I believe you will find answer there
Rakesh Kumar Devalapally
@devalapa_gitlab
thank you
@zelima I will check it out
Irakli Mchedlishvili
@zelima
:+1: