Branko-Dj on master
[travis][s]: Added update comma… (compare)
Regarding the Imagesnippets dataset (https://old.datahub.io/dataset/imagesnippets): What do you mean exactly by "republish on datahub.io". I cannot find anything to submit dataset metadata - only stuff to upload data. Are we supposed to make an out of sync copy of our triples here?
So the new datahub can do “metadata” only - you’d need to create a datapackage.json with empty resources array and push that. If you want you can do that :smile: - or you can push the dataset itself if that is possible (e.g. if it is bulk and reasonably static).
@rufuspollock Also I cannot find any link to upload that file. Do I have to install any of your software? Would the dataset be findable by other users of datahub.io after uploading the metadata (e.g. in https://datahub.io/search)?
Hi @michaelbrunnbauer yes, you need to install
data CLI tool to publish datasets - https://datahub.io/download. Once it is published, it will be findable by other users.
https://old.datahub.io/dataset/imagesnippets/datapackage.json You said the resources array has to be empty but in that case I would not be able to provide a single link (to triple dump, SPARQL endpoint, dataset homepage, etc.). Are you sure about it?
You could add links to the resources array to the remote resources - that should work i think.
rlinks, e.g., if you want to get this dataset https://datahub.io/core/finance-vix you’d use following URLs:
@cbenz one point of common interest would be how we’re building our data pipelines and what we could learn. We’ve been working a lot on a simple framework called “dataflows” built around tabular data packages and then running those in travis or gitlab runners if small (or in datahub itself as part of the SaaS data factory): https://github.com/datahq/dataflows - https://datahub.io/data-factory
In terms of monitoring, we currently have a monitoring and reporting system for dataflows run as part of datahub itself - but nothing for the travis/gitlab ones ...
I have read dataflows / data-factory blog posts and docs; never used them. We use GitLab CI pipelines with "download" and "convert" jobs.
Example for World Bank: https://git.nomics.world/dbnomics-fetchers/wb-fetcher/-/jobs
We built a dedicated dashboard (fetching data in GitLab API): https://db.nomics.world/dashboard/
DBnomics jobs are Python scripts (https://git.nomics.world/dbnomics-fetchers/wb-fetcher). They don't follow a common abstraction (like an AbstractClass or such), but a common pattern:
The source code of the jobs isn't committed with data.
Sometimes the "convert" jobs use Pandas, sometimes Json-Stat Python module, sometimes lxml directly, xlrd (Excel), or other ways.
We really like the fact that jobs write files and that we keep the history via Git (rather than writing directly in a database for example, in a more traditional approach). But using Git with such amounts of data doesn't come without problems (slow commits, slow pulls, excessive RAM consumption on GitLab server, etc.).
We have two other jobs (visible in the dashboard):
masterbranch) of each "json-data" repository
@cbenz really agree with you about patterns. In my experience of building stuff this is exactly how things start out: you have a download and extract/convert method or script.
All DataFlows is a way of having that common pattern plus a little bit of standardization and a library of common processors. If you had a moment to look at this it would be great to have your throughts: https://github.com/datahq/dataflows