Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    iainmwallace
    @iainmwallace
    This is a very exciting package! It will enable FAIR data sets in an organization. Do you have a roadmap that you could share? I am wondering specifically about versioning, storing additional metadata with a resource, and how best to administer a board on RStudio connect
    Just asked as an issues
    Javier Luraschi
    @javierluraschi
    Hi @iainmwallace great to hear from you here! Versioning is currently partially supported, pins does not cache multiple versions… however, the different backends might. For instance, RStudio Connect always stores multiple versions, so it ought to be possible to find the correct URL from a previous version and then retrieve it with pins. That said, it is likely the first CRAN release would not help much with versioning. Hadley asked about this as well and we have a work item tracking this: rstudio/pins#6
    Regarding additional metadata, this is actually possible… there is probably many ways one would want to store data you care about (and also about reading it back), you ca customize exactly how data is storead and how it retrieved, we just wrote an article describing this here: https://rstudio.github.io/pins/articles/pins-extending.html
    iainmwallace
    @iainmwallace
    Brilliant, thanks. The other thought that I had is would it would be possible for table resources to be optionally stored in a database that can be queries can be run via dbplyr against the pinned resource? Eg I pin a very large csv file to connect, this is silently loaded into a new Athena/bigquery table to enable quick retrieval of a subset of the table.
    Javier Luraschi
    @javierluraschi
    Databases are interesting, in fact, that was one of our first motivations to start this work, you can take a look at an early version of the README that supports databases: https://github.com/rstudio/pins/blob/83477850ba30e82660ca6cf273aa6f4feb308a72/README.md#databases — The challenge is that, using DBs is a really advance use case and it makes pins harder to understand for its first release… So the answer is it should be possible with a bit of work. One option is to have a database board (without using RStudio Connect necessarily) which you would be able to use to pin datasets to. Then you could create a Database + RSC board which all it does is create a pin in a Database board + the RSC board, or you could store the data in the database and only store a reference to the database in the RSC board, etc. However, I think we need to first implement the Databse board to make your life easier. Another option is to get this implemented direcetly in RSC, so you don’t need to do anything but all of the sudden, pins can be queried with dplyr without having to fetch the entire dataset. Let me know when this starts to become urgent to make sure I prioritize this appropiately, my goal for now is to get the package and get some usage/feedback going, then we can spend some time improving where it makes the most sense.
    iainmwallace
    @iainmwallace
    That sounds great! I like the RSC connection idea
    iainmwallace
    @iainmwallace
    FYI I explored using bigquery as a repository a while ago and was very impressed, https://github.com/iainmwallace/DataDepository/blob/master/README.md
    Javier Luraschi
    @javierluraschi
    Ah, yes, BigQuery would be quite popular. Fortunately, dplyr is quite great at supporting multiple databases; therefore, I think we would need dplyr support in pins and let dplyr do it’s magic with BigQuery :)
    iainmwallace
    @iainmwallace
    Agreed!!
    Carl Boettiger
    @cboettig
    Hi @javierluraschi , just hopping over here from the issue I mentioned in https://github.com/rstudio/pins/issues/168#issuecomment-582245462
    still quite confused. I can confirm that your example works perfectly when I test it out in https://github.com/cboettig/pins-test
    However, if I try uploading some of my actual files (still way below 2 GB, I'm testing with .tsv.bz2 files that are between 10 - 100 MB) I get the same curl errors as I first reported
    (you can see the files I'm trying to upload here, https://github.com/boettiger-lab/taxadb-cache/releases/tag/2019, where I had previously uploaded them to a release attachment by using the piggyback package)
    Carl Boettiger
    @cboettig
    I've added a simpler example to the issue thread, just using a slightly larger data file size (~30 MB) I'm back to getting the error
    Javier Luraschi
    @javierluraschi
    Great! I can reproduce now...
    > library(pins)
    > board_register(board = "github", repo = "javierluraschi/datasets", branch = "rsconf")
    > pin("flights.tsv", board = "github")
    Error in curl::form_file(path, type) : is.character(type) is not TRUE
    Javier Luraschi
    @javierluraschi
    oh oh, all large github uploads seem broken :( — Fix on its way, will push to CRAN as well.
    Carl Boettiger
    @cboettig
    Awesome! thanks much for this. I will continue banging away with pins!