Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Tim Hopper
    @tdhopper
    @blackary and i wrote a driver we're going to open source sometime that lets you create a catalog by expanding patterns in paths
    we have customer data in parquet files that is in s3 where the customer name is in the path, and we only want to open one customer at a time
    so you can create a "pattern catalog" from s3://bucket/{customer}/data.parquet and get a dataset for each customer
    Martin Durant
    @martindurant
    It is specific to parquet, of could you use it in conjunction with any file-based driver?
    Either way, be sure to let us know when it’s out!
    Tim Hopper
    @tdhopper
    it started as specific, but it was easy to modify to take the driver as a parameter
    Martin Durant
    @martindurant
    good to hear
    CarpeFridiem
    @brentonmallen1
    Hey all, I was looking into trying to pass along an AWS profile through the s3 -> session/client kwargs but when looking into it, it doesn't seem like aiobotocore has been implemented to use the profile_name kwarg found in boto. https://github.com/aio-libs/aiobotocore/blob/0ae9be3d53c56d89c0162df204ee32fea134f232/aiobotocore/session.py#L50
    have y'all come across trying to specify an aws profile in the past?
    Martin Durant
    @martindurant
    dask/s3fs#324
    You want profile=
    This is a top-level kwarg, so you would have storage_options={“profile”: “…”}, assuming you are using a driver where it’s called storage_options.
    CarpeFridiem
    @brentonmallen1
    ah, I'll give that a try. thank you!
    Dan Allan
    @danielballan
    Martin Durant
    @martindurant
    I’ll have my full zoom account back, so we can take a full hour!
    I’ll point the pangeo group to these thoughts too, maybe they can send someone.
    Dan Allan
    @danielballan
    Awesome, would be great to see them.
    Sorry to dump this with little time for you to consider it. There just happens to be renewed interest in making a server that really works.
    Martin Durant
    @martindurant
    That’s a good thing!
    Dan Allan
    @danielballan
    As you can see the first commit in that second link is this morning haha.
    As you will also see, I have come around on the notion of DataSource.describe(). I concede that my Reader proposal was wrong to omit that hook.
    Martin Durant
    @martindurant

    The separation should be:

    • what we can know from the catalog (catalog server) and imports alone
    • data description that can be found with non-zero cost, but constant time
    • the full data, or parts thereof

    This mixes in a complicated way with parameters that you might want to choose/apply.

    I’ll have my full zoom account back

    Actually, the link I have in intake/intake#472 points to a BNL zoom

    Dan Allan
    @danielballan
    I agree with that formulation. I think of the example of grabbing data from a NASA website. There is what you know from the index page itself including a little metadata about each DataSource (1), what you might learn by clicking the link and reading a more detailed description of the data and its structure (2), and finally the chunks of data itself (3).
    Sounds good, we can continue to use the BNL Zoom if you like.
    If you'd rather take over that is fine with me too. :-D
    Martin Durant
    @martindurant
    The invite on my calendar still has Julia’s zoom, I don’t know why. If the BNL link that’s there works let’s use it. I’ll post it here too before the meeting.
    Dan Allan
    @danielballan
    Thanks.
    Martin Durant
    @martindurant
    Dr. Thomas A Caswell
    @tacaswell
    http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf <- this seems relevant to our discussion today
    Martin Durant
    @martindurant
    Hmmm… OK, so I’ve only read it for 30s, but it seems quite vapour-like so far. Is this Blaze???
    Dan Allan
    @danielballan
    Haha. Blaze is everything right? ;-)
    Dr. Thomas A Caswell
    @tacaswell
    they have benchmarks and such, so there is some running code under there. It looks like they are leveraging the last 5 or so years of advances in the world-at-large (arrow and ibis in particular)
    Martin Durant
    @martindurant
    Blaze did df-like syntax into DB-specific queries, so yes. I suppose ibis by itself does some of that too.
    Dan Allan
    @danielballan
    If you want to try the server Martin, I'm pushing as I go in https://github.com/danielballan/catalog-server-from-scratch. Just added instructions for running it to the README. It doesn't actually send data yet but you can list things.
    Dr. Thomas A Caswell
    @tacaswell
    but, the connection is to push the knowledge/decision of which exact implementation is being used down as far as possible away from the consumer (and not exposing a zoo of to_* methods).
    Martin Durant
    @martindurant
    If datasets have an inherent computation system attached
    Dan Allan
    @danielballan
    Yeah, I guess there a couple factors to it. If your datasets have an inherit computation system, that's the one to standardize on. If not, you'll make the choice based on workload (sparse? GPU-friendly? parallel-friendly?) and available hardware (GPUs? cluster suitable for dask?).
    Whichever factor drives the choice, in our experience so far, the choice is consistent in a given context, and being able to swap it in one place is useful. I'm thinking, "I ran this on cupy when I had GPUs, and now I'm running it on my laptop," or "I want to try this on dask now." You still have option to explicitly choose your form (compute(), load() cupy.array, etc.) but you also have the option to leave it implicit and not have explicitly type-specific invocations scattered through your code. These really fight against the problem that NEP-18 tries to solve.
    Ryan Abernathey
    @rabernat

    Obscure fsspec question:

                with fs1.open(fname, mode="rb") as source:
                    with fs2.open(fname, mode="wb") as target:
                        target.write(source.read())

    Does this sort of thing "stream" if fs1 and and fs2 are both fsspec implementations (like localfilesystem to s3fs)

    Martin Durant
    @martindurant
    I’m afraid source in this snippet has no way to know what it’s intended use is. You can, of course, reach by chunks
    while True:
        data = source.read(chunksize)
        if not data: break
        target.write(data)
    but the filesystem’s cp function would be better - you can test if the two fss are compatible bu class matching or testing the protocol attribute.
    Ryan Abernathey
    @rabernat

    but the filesystem’s cp function would be better

    does that work if fs1 and fs2 are two different types of filesystem?

    Martin Durant
    @martindurant
    No, cp is within a filesystem. There is an issue somewhere about making a “dispatch” filesystem where you always provide the full URL and it decides what to call… but this is not implemented yet.
    Ryan Abernathey
    @rabernat
    so if I want to copy a huge file between two filesystems without loading all into memory, I have to use the chunksize loop approach
    Martin Durant
    @martindurant
    Yes, if the filesystems are different. You could add an exrta check to see whether using cp is appropriate (e.g., fs1 is fs2).
    (linked on my twitter and linkedin, for those that wish something to interact with)
    Martin Durant
    @martindurant
    @/all : community meeting coming up in 30min
    @danielballan , is your BNL zoom link still good? You can repost it here.
    Martin Durant
    @martindurant
    perfect, thanks.