@xhochy if I do
$ git checkout master $ git fetch upstream master $ git reset --hard upstream/master $ mypy pandas
then I get a really long list of errors, starting with
pandas/io/common.py:503: error: Argument 1 to "writestr" of "ZipFile" has incompatible type "Optional[str]"; expected "Union[str, ZipInfo]"
and ending with
pandas/__init__.py:347: error: Incompatible default for argument "dummy" (default has type "int", argument has type "__SparseArraySub") Found 201 errors in 24 files (checked 957 source files)
Do you think this means there is something wrong with my development environment? If you do
mypy pandas from the master branch do you get no errors?
Sorry to keep asking, but I still can't get around this during pre-commit.
I have run
$ pre-commit clean
I have mypy 0.730 installed:
$ version mypy 0.730
The following returns no errors:
$ mypy pandas Success: no issues found in 957 source files
but, when I try to commit the diff I showed above, I still get
mypy.....................................................................Failed - hook id: mypy - exit code: 1 pandas/core/groupby/generic.py:1064: error: "object" has no attribute "_data" pandas/core/groupby/generic.py:1073: error: "object" has no attribute "_data" pandas/core/groupby/generic.py:1074: error: "object" has no attribute "_data" pandas/core/groupby/generic.py:1109: error: "object" has no attribute "shape" pandas/core/groupby/generic.py:1112: error: "object" has no attribute "iloc" Found 5 errors in 1 file (checked 1 source file)
Might have to develop without pre-commit (at least without mypy) for now then
dmypy run -- --follow-imports=skip pandas(probably with
--timeout 3600). This would run the checks on the whole scope but through the daemon should be much faster than reloading the cache every time. I'm not sure though how large the branch switching slowdown will be.
If I'm working with an
ArrayLike (which is defined as
ArrayLike = TypeVar("ArrayLike", "ExtensionArray", np.ndarray)
) is there a way to tell mypy that it is specifically of Categorical type, i.e. that it has .categories and .codes attributes? Adding
is_categorical(if you search the mypy issue tracker there is a request to allow functions to declaring that they do narrow types, but no one has implemented)
Hosting TFUG Mysore first meetup : TensorFlow JS - Show and Tell by Jason Mayes - Senior Developer Advocate at Google. 8 presenters showing what they have #MadeWithTFJS with epic demos lined up + more. RSVP now
I have an issue with merge operations in Pandas.
If you take two Pandas dataframes where every column is an int8 datatype and merge them, what you get back is a dataframe of int64 and float64 columns. The float type is used where there are missing values in the join for an outer join.
This is a problem because the size in memory grows by a factor of 8 or more, and I don't see an option to prevent this behavior.
Is this something that's worth opening an issue about or is there a known solution?
(The merge operation also seems to crash Dask when it runs out of memory, which was the first workaround I tried.)
RafNot sure if that would work. By that time the resulting merged df is already big enough to cause problems. I think the dfs should be pre-processed for missing values. If they're int8 now, what represents their missing value? if there are none, but it results from the join, perhaps you have to be more selective/crafty with the join
@RokoMijic If I understand your problem correctly , then you could try and use astype() from numpy to solve the problem. On the resulting merged DataFrame , call this method and specificy the dtype to be used : 'int8' in your case.
RafPerhaps you could be more selective with what you load initially