Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 08:06
    merraksh commented #22611
  • 08:06
    merraksh commented #22611
  • 07:58
    merraksh commented #22611
  • 07:28
    jorisvandenbossche commented #33287
  • 06:00
    neilkg synchronize #33299
  • 05:53
    github-actions[bot] assigned #33039
  • 05:53
    neilkg commented #33039
  • 05:27
    neilkg edited #33299
  • 05:26
    neilkg opened #33299
  • 05:18
    AlexKirko commented #33236
  • 05:17
    AlexKirko synchronize #33236
  • 04:52
    farhanreynaldo review_requested #33283
  • 04:10
    farhanreynaldo synchronize #33283
  • 04:07
    DataSolveProblems edited #33298
  • 04:07
    DataSolveProblems edited #33298
  • 04:05
    DataSolveProblems edited #33298
  • 04:04
    DataSolveProblems edited #33298
  • 04:04
    DataSolveProblems edited #33298
  • 04:04
    DataSolveProblems edited #33298
  • 04:04
    DataSolveProblems labeled #33298
Uwe L. Korn
@xhochy
Probably needs a change in line 1019 to no_result:Any = object()
Marco Gorelli
@MarcoGorelli

@xhochy if I do

$ git checkout master
$ git fetch upstream master
$ git reset --hard upstream/master
$ mypy pandas

then I get a really long list of errors, starting with

pandas/io/common.py:503: error: Argument 1 to "writestr" of "ZipFile" has incompatible type "Optional[str]"; expected "Union[str, ZipInfo]"

and ending with

pandas/__init__.py:347: error: Incompatible default for argument "dummy" (default has type "int", argument has type "__SparseArraySub")
Found 201 errors in 24 files (checked 957 source files)

Do you think this means there is something wrong with my development environment? If you do mypy pandas from the master branch do you get no errors?

Uwe L. Korn
@xhochy
The mypy version needs to be pinned to the same as on the master branch (mypy==0.730), with newer mypy you get more errors.
The pre-commit hook though should use exactly that version and should thus report the same errors.
Marco Gorelli
@MarcoGorelli
@xhochy great, thanks! With mypy 0.730, I get no errors :)
Marco Gorelli
@MarcoGorelli

Sorry to keep asking, but I still can't get around this during pre-commit.

I have run

$ pre-commit clean

I have mypy 0.730 installed:

$ version
mypy 0.730

The following returns no errors:

$ mypy pandas
Success: no issues found in 957 source files

but, when I try to commit the diff I showed above, I still get

mypy.....................................................................Failed
- hook id: mypy
- exit code: 1

pandas/core/groupby/generic.py:1064: error: "object" has no attribute "_data"
pandas/core/groupby/generic.py:1073: error: "object" has no attribute "_data"
pandas/core/groupby/generic.py:1074: error: "object" has no attribute "_data"
pandas/core/groupby/generic.py:1109: error: "object" has no attribute "shape"
pandas/core/groupby/generic.py:1112: error: "object" has no attribute "iloc"
Found 5 errors in 1 file (checked 1 source file)

Might have to develop without pre-commit (at least without mypy) for now then

Uwe L. Korn
@xhochy
I can reproduce this behaviour also. This is because the pre-commit hook is only running a stripped-down type check and this seems to miss some inference information.
Joris Van den Bossche
@jorisvandenbossche
Yes, from time to time, it fails on unrelated lines. Then doing a --no-verify when committing might be needed
Uwe L. Korn
@xhochy
Is this then really a useful check or should we better get rid of it? I spent a bit on making it faster but if that leads to false positives, it also of no help.
Joris Van den Bossche
@jorisvandenbossche
Most of the time it is still helpful, I would say
Uwe L. Korn
@xhochy
An alternative would be to use dmypy run -- --follow-imports=skip pandas (probably with --timeout 3600). This would run the checks on the whole scope but through the daemon should be much faster than reloading the cache every time. I'm not sure though how large the branch switching slowdown will be.
Marco Gorelli
@MarcoGorelli

If I'm working with an ArrayLike (which is defined as

ArrayLike = TypeVar("ArrayLike", "ExtensionArray", np.ndarray)

) is there a way to tell mypy that it is specifically of Categorical type, i.e. that it has .categories and .codes attributes? Adding

assert is_categorical(my_object)

doesn't work

William Ayd
@WillAyd
Not at the moment; we don’t define ExtensionArray as a generic type (though ideally maybe should)
Also you can’t currently narrow types using functions like is_categorical (if you search the mypy issue tracker there is a request to allow functions to declaring that they do narrow types, but no one has implemented)
Your best bet for now would be to keep the assert and cast directly thereafter to the type you need
Marco Gorelli
@MarcoGorelli
Thanks @WillAyd , can confirm that works!
Vishesh Mangla
@XtremeGood
import pandas as pd
df = pd.read_json("/kaggle/input/caselaw-dataset-illinois/xml.data.jsonl.xz")
what would be the right way to open this file?
mukul-agrawal-09
@mukul-agrawal-09
Hey Guys, can somebody help me with a agglomerative clustering issue. I have a spatial data and having some weights assigned to each location/object. I want to make a cluster using thresholds of weights and distance between locations. Can somebody suggest me some technique or research paper related to this approach.
Basically any clustering algorithm which takes two parameters for the threshold.
matrixbot
@matrixbot

hannsen94 > <@gitter_xtremegood:matrix.org> what would be the right way to open this file?

It seems like the file is a zipped one (based on the *.xz ending). Maybe you should unzip them either before even using python (tar command line tool) or by using the python package lmza.

Vishesh Mangla
@XtremeGood
done that thanks @matrixbot
Jim Klo
@jimklo
hi, I’ve got a bunch of 2 col DataFrames, that I want to join by their index, but I’d like to add a discriminator to the column indicies… e.g. cols in each DF are “date”, “value”, but when joining I’d like the column MultiIndex to be (state, country, date), (state, country, value)…. can anyone suggest how I do this?
Mr. Motanovici
@Motanovici
@jimklo I have found this link that might be of use, I hope Pandas MultiIndex . In essence, you should make an array and zip the elements in a tuple. You can then form a MultiIndex from that tuple and give your names to them. This is what I have understood from you and thought was the problem.
Jim Klo
@jimklo
@Motanovici I know a MultiIndex is what I want… the problem I was having was how to construct it from the data I had. This is how I ultimately did what I wanted https://bpaste.net/6TGQ, but I think there is a more efficient way.
Gautam
@gautam1858

Hosting TFUG Mysore first meetup : TensorFlow JS - Show and Tell by Jason Mayes - Senior Developer Advocate at Google. 8 presenters showing what they have #MadeWithTFJS with epic demos lined up + more. RSVP now

http://meetu.ps/e/HTwBV/jYwqF/a

RokoMijic
@RokoMijic

Hi everyone;
I have an issue with merge operations in Pandas.

If you take two Pandas dataframes where every column is an int8 datatype and merge them, what you get back is a dataframe of int64 and float64 columns. The float type is used where there are missing values in the join for an outer join.

This is a problem because the size in memory grows by a factor of 8 or more, and I don't see an option to prevent this behavior.

Is this something that's worth opening an issue about or is there a known solution?

(The merge operation also seems to crash Dask when it runs out of memory, which was the first workaround I tried.)

Thanks,

R

Mr. Motanovici
@Motanovici
@RokoMijic If I understand your problem correctly , then you could try and use astype() from numpy to solve the problem. On the resulting merged DataFrame , call this method and specificy the dtype to be used : 'int8' in your case. Here is a link to the documentation : Astype
matrixbot
@matrixbot
Raf Not sure if that would work. By that time the resulting merged df is already big enough to cause problems. I think the dfs should be pre-processed for missing values. If they're int8 now, what represents their missing value? if there are none, but it results from the join, perhaps you have to be more selective/crafty with the join
Manikaran
@Manikaran20
Hi everyone, I'm a Django developer and python lover been working for a software development company. It took me a while to realize that even after doing it fine I don't really enjoy working on the web domain. So I'm thinking of shifting towards data science. I've been learning NumPy and pandas in the beginning can you guys share a platformm to learn and practice at the same time for Numpy and Pandas?
Thanks!
matrixbot
@matrixbot
Raf Read the pandas book, but skip the first 3 chapters or so. Then read some stats books. Your coding is good already
RokoMijic
@RokoMijic

Raf Not sure if that would work. By that time the resulting merged df is already big enough to cause problems.

Exactly. As soon as pandas creates this monstrosity, my machine dies. Doing astype() after your machine died is not helpful.

However

@RokoMijic If I understand your problem correctly , then you could try and use astype() from numpy to solve the problem. On the resulting merged DataFrame , call this method and specificy the dtype to be used : 'int8' in your case.

See above

matrixbot
@matrixbot
Raf What's the source of the dfs
Raf Perhaps you could be more selective with what you load initially
RokoMijic
@RokoMijic
The dataframes are read from CSV files
However I did come up with a solution. i wrote my own merge() function that preserves the input frames' data types and fills NaN with 0 or -1 or some small int of your choice
This was successful. I am considering documenting it on stackoverflow
matrixbot
@matrixbot
Raf I was going to recommend trying Apache Drill to query and do the join before loading it all in pandas. That's how i would do it. But your solution sounds good!
RokoMijic
@RokoMijic
The only thing I confused about is why it also caused dask to crash
does dask also need to create the entire merged dataframe in memory?
matrixbot
@matrixbot
Raf The point of dask is to handle df that are too big for memory by incorporating disk. So I am confused too
RokoMijic
@RokoMijic
I am on AWS so it's possible that it is a problem related to S3 or local storage problems maybe
but I have a series of increasing dataframes, the first 5 fit in memory for the Pandas merge and dask also handles it. If you do the 6th or 7th, pandas crashes with memoryError() and dask just crashes the kernel with no error
matrixbot
@matrixbot
Raf Nice work
RokoMijic
@RokoMijic
Thanks! The code is untested so if anyone knows how to hook it up to the test suite for the pandas merge function that would be nice
or if anyone knows a less hacky way of doing this
lokendra singh chouhan
@loks15

Python Cheatsheet - learn the basics of Python without any book and course or brush up the basic concepts

https://cheatsheets.tutorials24x7.com/programming/python