Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

José Ferraz Neto
@netoferraz
Hello! There is a way to use data over a company proxy ?
lapidus
@lapidus
Hi! Appreciated if anyone can shine some more light on Datapackages and DBMS.
We have 100s of datapackages with different schemas and want to make them accessible behind one API with at least basic filtering across available dimensions (for production web application use). Are there some readymade solutions for this? (Does for example https://datahub.io/data-factory help with parts of this or are there other solutions to automatically ingest into Postgres, MongoDB etc?)
Anuar Ustayev
@anuveyatsu

Hi @lapidus great to hear from you!

I believe Data Factory could be a good fit for what you’re trying to accomplish. @akariv could you please suggest?

lapidus
@lapidus
This message was deleted
Thank you @anuveyatsu! @akariv: Any recommendations very appreciated here or on DM.
sebaswm
@sebaswm
@johanricher Wow, very interesting openfacts.or., thank you very much. Would it be possible to speak with the team ?
Adam Kariv
@akariv
@lapidus hey, so - you can use dataflows to easily load datapackages into a relational database (postgres and the like). Just create a flow with the load and dump_to_sql processors.
For a good api, I recommend using babbage, which provides facts and aggregation endpoints over such DBs.
Zane Selvans
@zaneselvans
@akariv So is the idea with dump_to_sql that you can automatically reconstitute a relational database with all of the same constraints and types and inter-table relationships that are specified within a data package? Is there any mechanism for specifying relationships between tables that are originating from resources in different data packages?
Adam Kariv
@akariv
@zaneselvans data packages have the concept of foreign keys, which allows defining these sort of inter resource relationships. dump_to_sql will load all data tables to the DB for you, but you might need to add some of these constraints yourself (I'll check the code once I'm in front of my PC)
lapidus
@lapidus
Thank you @akariv!
Jhon F.
@jhofra11_twitter
I am a beginner in Datahub. Could you explain me how I can change the name of one dataset? The platform only shows 2 datasets?
Anuar Ustayev
@anuveyatsu
@jhofra11_twitter Hi! You simply change name property in your datapackage.json and re-push it again. Note that you’ll still have a dataset with the old name but you can make it unlisted.
Jhon F.
@jhofra11_twitter
Does the Platform only show 2 datasets on Dashboard? How I can show the other datasets?
Anuar Ustayev
@anuveyatsu

@jhofra11_twitter the best way of overviewing datasets from a publisher is to go to a publisher page. Use your username on datahub: https://datahub.io/username. On that page you can see all published datasets (and only publisher himself/herself can see unlisted/private datasets).

Number of datasets you can publish is unlimited but total size of your datasets is limited (eg, 5GB for basic plan).

Jhon F.
@jhofra11_twitter
@anuveyatsu I understand that. My problem: I published 10 dataset but when I check my dashboard, it only shows 2 dataset (onlye two dataset that I published when I started with datahub)
Anuar Ustayev
@anuveyatsu
@jhofra11_twitter could you please share your publisher name with me?
Jhon F.
@jhofra11_twitter
@anuveyatsu the publisher name is https://datahub.io/jhon.herrera
Anuar Ustayev
@anuveyatsu
@jhofra11_twitter I’ll check this and get back to you. Thanks!
Jhon F.
@jhofra11_twitter
@anuveyatsu Have you had time to check my problem?
Anuar Ustayev
@anuveyatsu
@jhofra11_twitter Hi, I can confirm that you definately have more than 2 datasets but for some reason your publisher page isn’t showing up all of them. We are having a look at our metastore service to identify the problem. Thank you for reporting this and for your patience :+1: I’ll get back to you once it’s resolved
Irakli Mchedlishvili
@zelima
@jhofra11_twitter Hi, it seems that metastore service was down for a while for some yet unknown reasons. We are trying to investigate the root cause of it, but right now it seems to be resolved. Could you please try re-pushing them - they should appear in search once re-pushed
clamar14
@clamar14
Hi, I have 2 questions. First of all, it's possible search in datahub, like in the old version, specifying parameter as "organization", "tags", "formats", etc. etc.? The second question is if there is a sparql endpoint (like "http://semantic.ckan.net/sparql") to send queries. Thank you all
clamar14
@clamar14
Please, someone answer me, it's very important
Rufus Pollock
@rufuspollock
@clamar14 no you can’t search in the old way - but we have full elasticsearch type stuff if you needed it. But we don’t have tags etc in the same way.
@clamar14 and there is no sparql endpoint at present. If you or someone else were interested in creating that let us know :smile:
@clamar14 and let us know the contexst of these questions - i.e. what you are trying to do :smile:
Jhon F.
@jhofra11_twitter
@zelima Thanks. I can check my datasets.
Jhon F.
@jhofra11_twitter
I am working in a open and linked data project, and I have a question: How I can implement linked data in this new version of Datahub? How I can publish the url? How I have to upload the RDF files? In one last post, I read that in this version of datahub, we can not build endpoints.
Rufus Pollock
@rufuspollock
You can just upload but you don’t get a special endpoint
Jhon F.
@jhofra11_twitter
@rufuspollock If I want to query my datasets, I have to use Phyton for instance?
Rufus Pollock
@rufuspollock
well if you publish csv as well you can get queries ...
Jhon F.
@jhofra11_twitter
@rufuspollock Can I only query CSV using Phyton? and RDF files?
Rufus Pollock
@rufuspollock
@jhofra11_twitter there is no formal query API there but you can json versions for example
Amber York
@adyork

I'm having issue trouble not inferring types on load in dataflows when I am loading the file directly without a datapackage. Can someone help me out?

From the documentation, validate option should control that but it isn't working unless I am misunderstanding.

  • tried validate=True and False
  • also tried force_strings from tabulator since the doc also mentions tabulator options

Using dataflows Version: 0.0.32

data.csv

cruise,station,date,time,lat,lon,cast,pump_serial_num
FK160115,4,1/19/2016,20:00,10,204,MP01,12665-01
FK160115,4,1/19/2016,20:00,10,204,MP01,12665-02
FK160115,4,1/19/2016,20:00,10,204,MP01,ML12371-01
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 10820-02
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11000-01
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11515-02
FK160115,4,1/19/2016,20:00,10,204,MP01,ML 11934-02
FK160115,5,1/20/2016,18:00,8,156,MP02,12665-01
FK160115,5,1/20/2016,18:00,8,156,MP02,12665-02
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 10820-02
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11515-02
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11000-01
FK160115,5,1/20/2016,18:00,8,156,MP02,ML 11491-02
from dataflows import Flow, load, printer
def flow():

    flow = Flow(
        load('data.csv', format='csv',validate=False,force_strings=True),
        printer(num_rows=1),
    )
    flow.process()


if __name__ == '__main__':

    flow()

getting:

data:
#    cruise          station  date        time               lat         lon  cast        pump_serial_num
     (string)      (integer)  (string)    (string)      (number)    (number)  (string)    (string)
---  ----------  -----------  ----------  ----------  ----------  ----------  ----------  -----------------
1    FK160115              4  1/19/2016   20:00            10         204     MP01        12665-01
2    FK160115              4  1/19/2016   20:00            10         204     MP01        12665-02
...
66   FK160115             14  2/4/2016    13:00            -4.23      142.23  MP14        ML12371-01
Adam Kariv
@akariv

@adyork so:
at the moment load always tries to infer datatypes.

The parameters affect this as:

  • force_strings only applies to Excel files, which sometimes have internal type information (so irrelevent for your use case).
  • validate controls whether load will validate and cast the inferred types. This is useful as inferring is done on sample of rows, and sometimes there’s an offending row later on.

I suggest using load with validate=False, and then adding set_type as needed to fix the wronlgy inferred types (if any). e.g.:
set_type(‘station’, type=‘string’)

(if a different behaviour is required, please open an issue on github.com/datahq/dataflows)
Amber York
@adyork

Thanks @akariv. validate=False and then using set_type works for this case.

I was thrown because I guess I didn't quite understand validate=True|False since I was thinking validate=False which does not "validate and cast" would return everything as strings.

        load('data.csv', format='csv',validate=False),
        set_type("station", type ="string"),
        set_type("date",type="date",format="%m/%d/%Y"),
        set_type("time",type="time",format="%H:%M"),

data:
#    cruise         station  date        time             lat         lon  cast        pump_serial_num
     (string)      (string)  (date)      (time)      (number)    (number)  (string)    (string)
---  ----------  ----------  ----------  --------  ----------  ----------  ----------  -----------------
1    FK160115             4  2016-01-19  20:00:00       10         204     MP01        12665-01
2    FK160115             4  2016-01-19  20:00:00       10         204     MP01        12665-02
Adam Kariv
@akariv
Thanks for the feedback @adyork - I’ll make sure that the README is clearer on this subject.
I’ll also consider parametrizing the inferring of data types so it’s not mandatory.
Amber York
@adyork
Yes, that would help a lot. Especially when we have cases where we need to preserve the date/time/datetime in a format other than "yyyy-mm-dd" "HH:MM:SS" and have to use "string." In the data.csv example above it wasn't an issue. But sometimes it does infer date/time and we have to go back and string it and sometimes find/replace back to the orginal format.
Zvi Baratz
@ZviBaratz
Hello,
Oops, sorry - Hello, I'm curious regarding storage on DataHub. Where is published data saved? How much data can I upload?
Zvi Baratz
@ZviBaratz
Also, is it possible to create nested data packages?
Zane Selvans
@zaneselvans
@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.
Zvi Baratz
@ZviBaratz
@zaneselvans If I have many tens of thousands of files, wouldn't that become a problem to manage using JSON? I am working with MRI data, which in its raw format is delivered as DICOM files. Each subject may have a few thousands of those created at a full scanning session.
Zane Selvans
@zaneselvans
@ZviBaratz Hmm, I dunno about that. If they're spatial and/or time slices, is there some standard way to bundle them up into larger files that contain more dimensions of the data, and use those? I don't think Data Packages are really meant for dealing with thousands or tens of thousands of individual files.
Zvi Baratz
@ZviBaratz
There is the NIfTI format, which is often used to share MRI data (there's a whole specification just for that called BIDS). The problem is that I am looking for a way to not only publish the data but also maintain a database of references, so that the querying, extraction, and piping of data is facilitated.
(Refernces to the DICOM files, which are the raw data used in many neuroimaging labs)
Rufus Pollock
@rufuspollock

@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.

That’s right @zaneselvans :smile: