Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Martin Durant
    @martindurant
    I’m not sure what you are asking, sorry. intake-server is a CLI included with Intake, and it listens for connections on some given port, returning the contents of a catalogue file you specify. You can connect to this from a python session with intake.open_catalog(“intake://server:port”).
    Tim Hopper
    @tdhopper
    "by connecting to a third-party data service (e.g., SQL server) or through an Intake Server protocol" seems to distinguish between using Intake server and using a SQL catalog
    but I can't find anything else in the docs about using a "third-party data service" catalog
    Martin Durant
    @martindurant
    Yes, those are two things. The intake server is a specific intake protocol, which you can use as I suggest. However, you can also interact with other things that provide a catalogue-like interface. In the case of SQL, see https://intake-sql.readthedocs.io/en/latest/api.html#intake_sql.SQLCatalog
    If you go to the source, you’ll see it’s rather easy to write such drivers. What data service do you have in mind?
    Tim Hopper
    @tdhopper
    What I have in mind is storing what might otherwise be stored in YAML in a sql database
    not generating data sources from tables in a database
    Martin Durant
    @martindurant
    OK, such a thing has been done for mongo; you would need to write a driver which queries your DB and turns the results into Entries. Presumably the connection string for the DB and the query to run would be parameters. You could implement far more sophisticated search on such a driver.
    Tim Hopper
    @tdhopper
    okay, thanks. Do you have a link to the Mongo example?
    Martin Durant
    @martindurant
    There are also certain examples at https://intake.readthedocs.io/en/latest/plugin-directory.html for various scientific metadata services that are probably further from your use case
    This is the package, but it’s fairly specific to the archive service in question https://nsls-ii.github.io/intake-bluesky/ and includes a lot of extras, such as linking to processed or raw data.
    Tim Hopper
    @tdhopper
    my interest is in the same functionality as YAML but more maintainable
    Martin Durant
    @martindurant
    Yes, understood, seems worthwhile. I am happy to help review and a driver like this could be included in the Intake main package, as being generally useful. Even a sqlite3 file might be much more compact and speedy compared to YAML. You will need to consider the schema for your table - which depends on how much in common the data sources have.
    Tim Hopper
    @tdhopper
    👍 I will keep it in mind
    Martin Durant
    @martindurant
    If you find that what you do is not easily generalisable, a public seaparate rpo would be fine, and I’d be happy to give pointers.
    CarpeFridiem
    @brentonmallen1
    Hi all, I'm running into an issue that I can't seem to find information on. I'm attempting to load a csv file and in the catalog yaml file, I'm trying to describe the list of column names as well as the data types for each column. The issue I'm running into is that the first column in the csv is being used as the index while shifting the column names to the right by 1 space. This results in a dataframe that has an unnamed index and the last column name getting dropped off as column_1 has the name of what was intended to be column_0. In other words, having a csv that does not have an index column, is there a way to have index column be generated on read and leave the column structure alone? I hope that makes sense.
    Capture.JPG
    my catalog yaml looks partially like this: sources: ufo_sightings: description: data around ufo sightings driver: csv args: urlpath: "{{CATALOG_DIR}}/data/ufo_scrubbed.csv" csv_kwargs: header: 0 names: ['dt', 'city', 'state', 'country', 'shape', 'duration_s', 'duration_hm', 'comments', 'date_posted', 'latitude'] dtype: {'dt': 'str', 'city': 'str', 'state': 'str', 'country': 'str', 'shape': 'str', 'duration_s': 'str', 'duration_hm': 'str', 'comments': 'str', 'date_posted': 'str', 'latitude': 'str'} infer_datetime_format: true
    (that didn't format the way I was hoping)
    Martin Durant
    @martindurant
    Have you figured out how to do the equivalend of the load you want with pandas (or dask) alone? At first guess index_col=False as an additional csv_kwarg.
    If you post in stack overflow, you’ll be able to format it better and someone (me?) should give you an answer pretty quickly.
    CarpeFridiem
    @brentonmallen1
    Thanks for the reply. Dask seemingly doesn't support the index_col kwarg (although pandas does). As far as I can tell it's something to due with parallelization/partitioning
    I'll go ahead and try to post on stack overflow
    Martin Durant
    @martindurant

    Dask seemingly doesn't support the index_col kwarg

    Hm… I thought we had fixed a simpler way to do that.

    CarpeFridiem
    @brentonmallen1
    right now it just raises an error saying that the kwarg isn't supported
    Martin Durant
    @martindurant
    So your problem is more dask than Intake. As far as Intake is concerned, your question is “why do I have to use Dask”, and is very valid :)
    CarpeFridiem
    @brentonmallen1
    in short, yes haha
    if I could use pandas instead, then I think I'd be fine but Dask is limiting
    It probably makes total sense to use dask though from an speed/performance perspective
    the ability to chose between implementations would be nice though. kind of like the engine kwarg in read_csv but I imagine that would be low priority
    CarpeFridiem
    @brentonmallen1
    I made a stack overflow post that's dask centric, but if anyone would like to follow along (or provide any insight) it can be found here: https://stackoverflow.com/questions/65254645/column-name-shift-using-read-csv-in-dask
    thanks for y'all's time
    Martin Durant
    @martindurant
    Sorry, not coming to the monthly meeting tomorrow, since I have a conflict, and in any case, not much has happened over December. I invite interested people to have a look at intake/intake#564 , which could be a big change for the future! @/all
    Dan Allan
    @danielballan
    Thanks, @martindurant! Will do.
    Zachary Blackwood
    @blackary
    Does filename-pattern-parsing only work for certain drivers? I've seen examples where it works with yml and csv, but can't seem to get it to work with a parquet file on S3. I'm trying to get something like this to work, but keep getting FileNotFound errors, even when files matching that pattern exist.
    name:
        driver: parquet
        args:
            urlpath: s3://BUCKET/PREFIX-{type}.parquet
    Martin Durant
    @martindurant

    Correct: whereas it’s typical to load a single dataset from many CSV files, for parquet a dataset is normally understood to be a single path and all the files under it.

    It would not be too far fetched to apply the mattern matching to any URLs (see intake.source.base.PatternMixin), but Intake would have to explicitly call glob and concat on the resulting datasets, which is not implemented.

    Zachary Blackwood
    @blackary
    OK, thanks
    Martin Durant
    @martindurant
    Note that CSV has path_as_pattern=Trueas an argument.
    You can still template any argument, though, either requiring the user to provide the value at runtime, or setting a user_parameter with defaults/options to indicate what the possibilities are.
    Davis Bennett
    @d-v-b
    I have a few questions about the ZarrArraySource, are the dtype and shape attributes referenced in the docstring missing from this class for a reason?
    and why are properties like chunks and npartitions set only in the _get_schema() method (which is only ever called in this class for its side effects?)
    Martin Durant
    @martindurant
    The first part is probably an omission - indeed I don’t see the dtype/shape being set anyhere, even though these are well-defined for a zarr source.
    As for things being set in _get_schema, this is a normal pattern, because we want to defer opening the actual target until the user wants some details or tries to read the data. Opening the target might take non-trivial time, so we don’t normally do it in a source’s init method.
    Davis Bennett
    @d-v-b
    and where does the output of that _get_schema method get used? because it's never called in the class methods for its return value
    Martin Durant
    @martindurant
    If you call .discover() - but the array object is stored as an attribute and called by the reading methods
    Davis Bennett
    @d-v-b
    thanks, that's helpful.
    Martin Durant
    @martindurant
    The parent class(es) contains a lot of extra stuff that is done implicitly here. Please do check whether you get a sensible dtype, though - if not, a PR is more than welcome.
    Davis Bennett
    @d-v-b
    maybe i'm missing something but it looks like the zarr array source isn't tested
    Martin Durant
    @martindurant
    Intake.sources.tests.test_npy::test_zarr_minimal does, but not much
    Davis Bennett
    @d-v-b
    ya i just found that
    i will look into adding checks for dtype and shape
    Martin Durant
    @martindurant
    again, if you have apetite to flesh that out...
    Davis Bennett
    @d-v-b
    Martin Durant
    @martindurant
    Thanks for posting