Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jan 31 12:23
    shashank88 synchronize #690
  • Jan 30 20:17
    shashank88 commented #649
  • Jan 30 20:17
    shashank88 closed #649
  • Jan 30 20:17
    shashank88 commented #649
  • Jan 30 16:28
    shashank88 assigned #692
  • Jan 30 16:27
    shashank88 commented #692
  • Jan 30 16:27
    shashank88 closed #692
  • Jan 30 16:27
    shashank88 closed #703
  • Jan 30 16:27
    shashank88 commented #703
  • Jan 30 15:13
    shashank88 closed #705
  • Jan 30 15:13

    shashank88 on master

    Pandas: use sort_index instead … Create df from structured lists… Fix another multi index read_st… and 2 more (compare)

  • Jan 30 14:47
    shashank88 synchronize #705
  • Jan 30 13:55
    shashank88 synchronize #705
  • Jan 30 13:49
    Utfard starred manahl/arctic
  • Jan 30 13:44
    mcfire starred manahl/arctic
  • Jan 30 13:19
    shashank88 synchronize #705
  • Jan 29 12:46
    shashank88 commented #702
  • Jan 29 12:37
    shashank88 assigned #706
  • Jan 29 10:52
    yschimke commented #706
  • Jan 29 10:52
    yschimke commented #706
Luciano
@lJoublanc
No idea about this but you need to bear in mind that the performance is asymmetric - writing records requires compression (which is slow), whereas reading them is fast, as decompression can approach memcpy speed. Again your performance will depend on the entropy of the data - if you have high compression it will be able to push stuff through much faster. Are you working with asset prices? Or something non-financial?
Matteo Angeloni
@mattange
hi - it seems like my previous posts are not visible. tks for responding. basically looking after macroeconomic datasets, so for now at least the series are not strictly speaking financial timeseries, but wanted to have a bit of both (or at least flexibility to get there)
Bryant Moscon
@bmoscon
it could be reasonable
its going to depend on a lot of factors - how many writes? how many symbols? is the mongo instance hosted locally? separate machine on the lan? on the internet? what are the machine stats, bandwidth, etc.
if you look here: manahl/arctic#582
there was a spreadsheet posted with some read/write statistics
Ewan Higgs
@ehiggs
Hi. Is there a doc for the internal document layout of artic documents?
Luciano
@lJoublanc
Note that the layout will depend on what storage engine you're using i.e. versionstore or tickstore.
Ewan Higgs
@ehiggs
yup. thanks
Luciano
@lJoublanc
Hi, @bmoscon @jamesblackburn may I ask if you have used arctic on mongodb 4.x versions? What version do you use in prod and/or consider the most stable? I have an old 2.6 DB I need to migrate :grimacing:
Bryant Moscon
@bmoscon
i havent tried 4.0 yet
i will soon
been using 3.6
it works fine
Luciano
@lJoublanc
Thank you - much appreciated. The mongodb release notes say I need to upgrade major versions successively. I'm really looking forward to four consecutive upgrades :laughing:
Dimosthenis Pediaditakis
@dimosped
@bmoscon @TomTaylorLondon @jamesblackburn 1.68.0 is out
Luciano
@lJoublanc
I'm trying to figure out the use of rowmask in TickStore. It appears this isn't used for dataframes at all, only when the argument to write is a list of dicts.
I assume that this is because these can have 'columns'/fields missing, and so they need to be marked absent, whereas a dataframe will have all values present. Would that mean that double NaNs would still have a 'ones' bitmask?
Or have I entirely misunderstood the concept here ...?
Luciano
@lJoublanc
Another question about TickStore: the fields END_SEQ, START_SEQ, SEGMENT don't appear to be used in the python driver. Anything you are willing to share about those fields and what you use them for?
Bryant Moscon
@bmoscon
there are multiple rowmasks used in the code, which line? I also am not sure about those fields in tickstore. They may be legacy left overs
Luciano
@lJoublanc
I'm refering to the ones iside the COLUMN field.
You have a triplet of data, dtype, rowmask
but rowmask only gets written if you pass in a list, not when you pass in a dataframe.
Sorry, I mean it gets set to all ones if you pass in a dataframe.
So if you pass a list of dicts, I suppose it sets zeros for the fields that are missing.
Bryant Moscon
@bmoscon
dicts can be sparse, but data frames cannot, hence the all 1s for DFs
thats what its for i believe
Luciano
@lJoublanc
Morning. I've found that frames with columns of type numpy.int32 are written as numpy.int64 into arctic. And using my custom scala driver, if I store them as int32, then reading them using your python driver I get float64 back!
This seems to be a quirk with pandas rather than arctic. I'm aware you have java tick loggers, and was wondering if you've run into this problem before? Not sure how to handle it, besides raising upstream with pandas.
Luciano
@lJoublanc
Ah no, in fact it's this (in tickstore I should have mentioned):
    def _set_or_promote_dtype(self, column_dtypes, c, dtype):
        existing_dtype = column_dtypes.get(c)
        if existing_dtype is None or existing_dtype != dtype:
            # Promote ints to floats - as we can't easily represent NaNs
            if np.issubdtype(dtype, int):
                dtype = np.dtype('f8')
            column_dtypes[c] = np.promote_types(column_dtypes.get(c, dtype), dtype)
Luciano
@lJoublanc
Hi all, I've open-sourced my scala based driver https://gitlab.com/lJoublanc/scarctic :grin:
There are currently no jars so need to build from source, but hopefully that will be remedied by the end of the week.
Luciano
@lJoublanc
Jars are now released on bintray repo (see readme on how to fetch these).
o4308105
@o4308105
Hey all. Is arctic a good place to go to get and store real time data? So maybe 1-min candle data (low, high, open, close, vol)? If not, what about 5 or 15 min? Is there a different project that is better for this? Or would this be a manual thing
Luciano
@lJoublanc
@o4308105 yes have a look in the wiki for some videos.
ilpomo
@ilpomo
Hi all, how does Arctic performance compare to Timescaledb and Influxdb?
Bryant Moscon
@bmoscon
@ilpomo as far as I am aware no one has compared their performances. they are quite different though, arctic stores the data in a format so it can be read out as a pandas dataframe directly
the other datastores do not as far as i am aware
ilpomo
@ilpomo
@bmoscon yeah I noticed that, really useful feature to get a pandas dataframe when querying the db. It makes Python developers feels comfortable. Querying performance also seems quite good from my early tests, I have 50GB of historical 1min OHLCV data for 500+ assets, I can retrieve 1 month data range in < 1 sec after storing data with daterange='D'. I'm wondering how much changing daterange to 'M' or 'Y' will affect query performance. Also, it would be awesome if someone could update the tickstore docs. Have a nice day and good work.
clickingbuttons
@clickingbuttons
Hello! Any plans to open source the Arctic Java tick logger sometime soon? Would save me some time from writing my own
Steffen
@SteffenNa
Have a quick question. Is there any built in way to avoid writing doubles when appending versionstore? Writing daily PnL data to a versionstore.
Bryant Moscon
@bmoscon
what do you mean by doubles? duplicate data? for version store you should only write changes. its append only
Steffen
@SteffenNa
yes sorry. mean duplicate data
when I append and the data is already there it seems to append anyway. was just wondering if there is a built-in way to prevent duplicates when writing of if I need to check for that mayself
myself
kliao
@kliao
Hi. Playing around with Tickstore and getting WARNING:arctic.tickstore.tickstore:NB treating all values as 'exists' - no longer sparse. What does it mean?
Bryant Moscon
@bmoscon
you can ignore that
Adam Li
@adam2392

Hi, I'm a PhD student considering exploring using arctic for some time series storage and analysis. However, I'm not gonna store financial data. It'll essentially be health time series w/ metadata, possibly multivariate.

Looked online and didn't seem to find anyone ever exploring this.

Was wondering if:

  1. if anyone has previously tried and ran into issues
  2. are there any problems I should maybe look into?
  3. any other resources I can read?
Steffen
@SteffenNa
@adam2392
I have been using arctic (and love the simple API) but changed to flat files which IMHO are better suited than mongoDB for time series. You might want to have a look at Apache parquet or HDF5.
https://wesmckinney.com/blog/python-parquet-update/
Also Dask is amazing for big data.