I have an issue of pulling large table( can't fit the memory) from the azure database or another server, that table i need to divide in multiple csv-s to generate.
So ,i basicaly have no transformation except for dividing it to equal parts.
I think that the Dask is the right tool i'm looking for?
I tried many ways to make a simple connection to the sql server, but i just can't
import dask.dataframe as dd import sqlachemy as sa engine =sa.create_engine('mssql+pyodbc://VM/Data?driver=SQL+Server+Native+Client+11.0') metadata = sa.MetaData() posts = sa.Table('posts', metadata, schema= 'dbo', autoload= True, autoload_with= engine) query = sa.select([posts]) sql_reader = dd.read_sql_table('posts', uri =engine, npartitions=16, index_col='userId')
Any help with this ?
User \"system:serviceaccount:nublado-athornton:dask\" cannot get resource \"pods\" in API group \"\" in the namespace \"nublado-athornton\""but I have what look like the right rules in my role:
rules: - apiGroups: - "" resources: - pods verbs: - list - create - delete
RandomForestRegressor()constructor, but still end up with some of my dask workers using 2000% CPU, which is looking really weird.
That last time I looked, pandas loaded everything. It would be reasonable to implement that iteratively, and fastparquet does have a specific method to do that.
Is there a way to do that query without knowing that row-group 1 is where you want to look
Parquet optionally stores columns max and min values for each row-group, so maybe
split_everyequal or lower than length of