Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

Zane Selvans
@zaneselvans
@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.
Zvi Baratz
@ZviBaratz
@zaneselvans If I have many tens of thousands of files, wouldn't that become a problem to manage using JSON? I am working with MRI data, which in its raw format is delivered as DICOM files. Each subject may have a few thousands of those created at a full scanning session.
Zane Selvans
@zaneselvans
@ZviBaratz Hmm, I dunno about that. If they're spatial and/or time slices, is there some standard way to bundle them up into larger files that contain more dimensions of the data, and use those? I don't think Data Packages are really meant for dealing with thousands or tens of thousands of individual files.
Zvi Baratz
@ZviBaratz
There is the NIfTI format, which is often used to share MRI data (there's a whole specification just for that called BIDS). The problem is that I am looking for a way to not only publish the data but also maintain a database of references, so that the querying, extraction, and piping of data is facilitated.
(Refernces to the DICOM files, which are the raw data used in many neuroimaging labs)
Rufus Pollock
@rufuspollock

@ZviBaratz Ultimately I believe the data is being stored on AWS in some S3 storage buckets. A single data package can contain many different resources. If it is a tabular data package, containing tabular data resources, those resources can contain references to each other (primary/foreign key relationships as in a database). If you're using a basic data package (not the more specialized tabular data package) which can contain arbitrary kinds of data resources, I guess one of those resources could itself be a data package. But I don't think this is the intended arrangement.

That’s right @zaneselvans :smile:

@zaneselvans If I have many tens of thousands of files, wouldn't that become a problem to manage using JSON? I am working with MRI data, which in its raw format is delivered as DICOM files. Each subject may have a few thousands of those created at a full scanning session.

@ZviBaratz that should be fine. One question: are all these files different resources in one package or ...

There is the NIfTI format, which is often used to share MRI data (there's a whole specification just for that called BIDS). The problem is that I am looking for a way to not only publish the data but also maintain a database of references, so that the querying, extraction, and piping of data is facilitated.

This is a really interesting use case - please tell us more. When you say a database of references what exactly do you mean?

Zvi Baratz
@ZviBaratz
@rufuspollock I mean a database that keeps references to that actual files. Some fields from the header that might be relevant to data aggregation in analyses etc. can also be saved for easy querying, but once you want the data itself (pixel data or a header field that's not included in the database schema), you have to read it from the DICOM file.
Zvi Baratz
@ZviBaratz
I am in touch with Paul Walsh regarding examining this use case further and hopefully finding a way to implement Frictionless Data ideas/projects into our data flow. In the mean time I am trying to get a better grasp of the possibilities.
Zvi Baratz
@ZviBaratz
Currently I'm implementing the reference database as a Django project backed with postgres, so that the model classes include methods for easy aggregation and extraction (e.g. get all the scans that make up a single DICOM series as a 3D numpy array)
Rufus Pollock
@rufuspollock
@Branko-Dj @adyork could we get a graph onto https://datahub.io/cryptocurrency/bitcoin asap?
Branko
@Branko-Dj
@rufuspollock I will create a graph for it
Rufus Pollock
@rufuspollock

@rufuspollock I mean a database that keeps references to that actual files. Some fields from the header that might be relevant to data aggregation in analyses etc. can also be saved for easy querying, but once you want the data itself (pixel data or a header field that's not included in the database schema), you have to read it from the DICOM file.

Sorry for the slow reply @ZviBaratz!

To answer your question I think you could a) pull out metadata and save into datapackage.json b) i’m understanding that you want to do specific post-processing on the data e.g. to generate all of the info for a particular scan. I’d be doing that with a separate workflow after storing the basic packages.

JavaScriptFamily
@JavaScriptFamily
Hi All,
Can you please help me, I have created a frictionless data package using data package for PHP
I want to validate this
Is there any online platform to validate my package
thanks
Anuar Ustayev
@anuveyatsu
@JavaScriptFamily Hi :wave: please, have a look at this page - https://datahub.io/tools/validate
Branko
@Branko-Dj
@rufuspollock the graph for bitcoin is now available
Raja Sahe
@rajasahe
I need Indian cities house prices dataset. How can i get that ? Any idea
Rufus Pollock
@rufuspollock
@rajasahe please email us at support@datahub.io
Stephen Abbott Pugh
@StephenAbbott
@rufuspollock Sincere apologies. I failed to follow up and flag the issue I raised with you here on September 12th. I have now raised an issue on datahub-qa as you suggested datahq/datahub-qa#245
Amber York
@adyork

@akariv , I have been looking through the dataflow tutorials that use custom functions and nested flows trying to figure out how I can use custom functions for one resource when there are many.

For example, I have a row processor:

def mycustomfcn(row):

What I want to do is this in a flow specifying one resource:
mycustomfcn(resouces='mclane_log') and be able to specify whether it is a package, row, or rows processor somehow.

Any tips?

I can also do stuff directly in the flow like this but again, I can't figure out how to specify one resource if there are many.

lambda row: dict(row, val=row['val']/5),

Irakli Mchedlishvili
@zelima
@adyork how about
def mycustomfcn(package):
    yield package.pkg
    resources = iter(package)

    for resource in resources:
         if resource.name == 'my-resource-name':
              # do stuff here Eg:
              yield filter(lambda row: (row['x']  in [1,2,3,4,5]),  resource)
        else:
              # Deque others
              yield resource
Amber York
@adyork
Thanks @zelima! I will try that.
Amber York
@adyork

Found a bit simpler way to get one resource:

resource = package.pkg.get_resource('seagrass')

Irakli Mchedlishvili
@zelima
:+1:
Rakesh Kumar Devalapally
@devalapa_gitlab
Hi, I am working on the temperatures dataset and I see an entry like this
France
France(Europe)
Can anyone explain the difference between these two?
Irakli Mchedlishvili
@zelima
@devalapa_gitlab can you please paste the link to the dataset?
Rakesh Kumar Devalapally
@devalapa_gitlab
Irakli Mchedlishvili
@zelima
@devalapa_gitlab data is coming from https://data.giss.nasa.gov/gistemp/ I believe you will find answer there
Rakesh Kumar Devalapally
@devalapa_gitlab
thank you
@zelima I will check it out
Irakli Mchedlishvili
@zelima
:+1:
Shrif Rai
@joyryder
hello brothers
fabirubiru
@fabirubiru
Hi everyone
I'm new on that and I wan to learn about Datahub, could someone helpme or shared any documentation about that?
Anuar Ustayev
@anuveyatsu
@joyryder Hi there!
@fabirubiru Hi! Sure, you can start here - http://datahub.io/docs
David Cottrell
@david-cottrell_gitlab
Is there a way to delete a datapackage? I ended up pushing a package called "datapackage", renamed it and repushed so now I have two. Have searched a lot but do not yet see how to delete.
Stephen Abbott Pugh
@StephenAbbott
Hi. I've been trying to install version 0.4.5 of the Data publishing app for MacOS but keep getting an error message. Is there a different version I should try? My laptop OS is MacOS High Sierra (version 10.13.6)
Rufus Pollock
@rufuspollock

@david-cottrell_gitlab

Is there a way to delete a datapackage? I ended up pushing a package called "datapackage", renamed it and repushed so now I have two. Have searched a lot but do not yet see how to delete.

You make it unpublished atm so no-one can see it - we are working on a purge type command but for now make it undeleted ...

Hi. I've been trying to install version 0.4.5 of the Data publishing app for MacOS but keep getting an error message. Is there a different version I should try? My laptop OS is MacOS High Sierra (version 10.13.6)

Can you give a bit more detail on the error message - and we can check that build :smile:

Stephen Abbott Pugh
@StephenAbbott

Can you give a bit more detail on the error message - and we can check that build :smile:

I've downloaded version 0.4.5 . When I open the application, it says 'Please wait, we are installing the CLI tool on this machine'. The install reaches 100% and then I get asked to update permissions on the downloaded CLI. I grant these permissions and then see an error message which just says 'Something went wrong while CLI tool update. We will try again automatically in 1 minute'. I've tried installing this version a few times now.

Rufus Pollock
@rufuspollock
@StephenAbbott ok - can you open an issue in github and we’ll look. In the meantime do you want to try installing the cli tool directly?