Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 03:33
    fangchenli synchronize #41685
  • 01:49
    jbrockmendel synchronize #42150
  • 01:39
    mroeschke closed #22189
  • 01:39
    mroeschke commented #22189
  • 01:37
    mroeschke labeled #22186
  • 01:36
    mroeschke labeled #22183
  • 01:36
    mroeschke labeled #22171
  • 01:36
    mroeschke unlabeled #22171
  • 01:36
    mroeschke labeled #22171
  • 01:35
    mroeschke commented #22171
  • 01:29
    mroeschke unlabeled #22080
  • 01:29
    mroeschke labeled #22080
  • 01:29
    mroeschke unlabeled #22080
  • 01:29
    mroeschke labeled #22080
  • 01:29
    mroeschke unlabeled #22080
  • 01:28
    mroeschke labeled #22079
  • 01:28
    mroeschke unlabeled #22079
  • 01:28
    mroeschke labeled #22079
  • 01:28
    mroeschke commented #22079
  • 01:27
    mroeschke unlabeled #22078
razou
@razou
Hello
I'm using PySpark on databricks notebook and facing an error due to dataframe.copy() method and needed some help to understant
df_c= df.copy()
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5811, in copy
    data = self._data.copy(deep=deep)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
Olga Matoula
@olgarithms

Hello everyone!

My name is Olga Matoula and I am a software engineer at Bloomberg London and a member of the Python community there. I recently attended a PyLadies workshop led by @MarcoGorelli guiding the participants towards their first contribution to pandas.

The week of 10-14 May Bloomberg is organizing "Guild Week", which is a week full of internal events hosted by the different communities for our Engineering department. On Friday the 7th of May there will be a Python hackathon and our idea is to have it open source focused. I want to encourage more engineers to attempt making a contribution, as well as other folks who used to contribute in sprints but have done less the past year due to cancelled conferences or other reasons. As I have found the pandas documentation very well written and easy to navigate, I have proposed the project as one of the few we will be focusing on. Other Bloomberg engineers have shared a lot of positive feedback about the project too.

It would be great if we could have the participation of a couple of the pandas maintainers during the event, to kick it off, navigate attendees through the process and suggest some issues to be worked on. It would be extremely valuable for everyone attending as problems get resolved more quickly and there's a general encouragement to have someone sharing their experience and feedback. Ideally we would need someone or some ones(!) to cover 1-2 hours each in two timezones (morning BST and EDT).

From our side, we can definitely see good participation, and even though there will be many first time contributors we can expect that good outcomes will come out of it and more advanced issues will be worked on.

Please, let me know your thoughts on this and if there are any questions. Would someone be able to help out on the day? We can always organize a chat to go through the idea in more detail and discuss further!

Olga

A.G.
@Divide-By-0
any way to do pandas map operations on series, but for self functions? i.e. i want to do a .total_seconds() on each value of a pd series
Xavier Olive
@xoolive
@Divide-By-0 how about df.durations.dt.total_seconds() ?
Michael Li 🚀Publish Reproducible Jupyter Notebook
@tianhuil_twitter
Does anyone else find Matplotlib's API hard to remember? I have friends who just export their Pandas data and plot with Excel. I spend a lot of time googling Matplotlib help. How do you make your Python Data Plots?
3 replies
Sweta Rauniyar
@rauniyars
Hello, I am working with my team on Issue #39845. The issue consists of working on the documentation to locations: pandas.core.indexing.IndexingMixin.loc &
pandas.DataFrame.setitem . There is currently no API documentation at all for setitem, and we were wondering if there is a reason behind that? And does anyone have suggestions on fixing this issue?
boris
@pkarpesis:chat.avlikos.gr
[m]
looking at this: https://stackoverflow.com/a/44311454/3058542, I can see that I can change a value to zero if it is bigger than 20000. How can I set it to a different value if its below 20000 though? Obviously using another .loc would be an option but is this the preferred way ? Iterating over the same data twice doesn't seem efficient 🙂
cloudy
@cloudy:cloud-oak.com
[m]
You could try doing it using np.where (as demonstrated in the third answer on the linked post, https://stackoverflow.com/a/48795837)
boris
@pkarpesis:chat.avlikos.gr
[m]
thank you cloudy .was curious why the extra import of numpy. isn't there a pandas solution for what I need? and maybe if not maybe I am not using the correct lib (pandas) ?
I mean without pandas I would iterate over the whole data set line by line
I am very new with pandas so bear with me 🙂
cloudy
@cloudy:cloud-oak.com
[m]
Pandas is using numpy in the background anyways, so I wouldn't worry too much about the additional import.
I believe that if you want to use only pandas methods, doing two .locs back-to-back is already pretty efficient
boris
@pkarpesis:chat.avlikos.gr
[m]
any other way besides pandas maybe cloudy ?
cloudy
@cloudy:cloud-oak.com
[m]
Depends largely on your use case of course. But for general data crunching like this, pandas+numpy is probably the best way to go 🙂
boris
@pkarpesis:chat.avlikos.gr
[m]
Well I need to iterate over all rows in a column and change data based on conditions
cloudy
@cloudy:cloud-oak.com
[m]
For some usecases, there are shortcut-functions: E.g. if you want to replace values based on a dictionary, DataFrame.replace is your friend. For ranges of numerical data, pd.cut might be helpful.
cameron
@cameron:otsuka.haus
[m]
i'm attempting to calculate the best and worst rolling percentage gain across a timeseries dataset, given a 1-year window (and it must be 1-year). is there a way to do this easily within pandas? documentation for the rolling function only specifies a minimum no of observations, but i don't see a way to enforce the window needs to start on a date that is 1-year back
Krishna Chaitanya
@Chaitu17
Hello!
hasan-yaman
@hasan-yaman
Hello!
I have question related to contributing to pandas.
I rebased master branch locally. But i can't import pandas right now. I get AttributeError: module 'pandas._libs.internals' has no attribute 'NumpyBlock'
4 replies
Brijesh Soni
@ibrijeshsoni

Hello @All

I am woking on the PDF Reading , but not able to get the perfect Package to be used, I have priorly use the pdfplumber for extracting Tables .. now this task I have to exctract teh numberic bullet points' e.g.

2. Data
2.1 New Data
2.1.1 Child data
     conclusion: askjkajsjslajskjak
Andrei Berceanu
@berceanu
How can I interpolate a column of floats at integer positions?
Andrei Berceanu
@berceanu
    csv_df:              dN_over_dE
            E_MeV_float            
            71.750300       1.96984
            71.814646      -0.10848
            71.878948      -0.72212
            71.943210      -0.92436
            72.007436      -0.34520
E_MeV_float is my float64 index
And I'd like to have an integer index, 71, 72, .. with the values interpolated from the current df.
Andrei Berceanu
@berceanu
My current workaround is this
    new_index = np.linspace(
        from_energy, to_energy, to_energy - from_energy, endpoint=False
    )
    df_ = csv_df.reindex(new_index, method="nearest")
Which doesn't interpolate, but just takes the nearest value.
Henry Goodwin
@ClariNerd617

Is there a way to use a with statement when, say, extracting data from a DataFrame that is instantiated using, say, S3?

E.g.

with pd.read_csv(“s3://aws-gsod/isd-history.csv”) as data:
    history = data[data[“ICAO”].astype(“str”).str.match(“KASH”)][[“WBAN”, “USAF”]].to_dict()

kash_usaf, kash_wban = history.get(“USAF”), history.get(“WBAN”)
valeriozhang
@valeriozhang
im using postgres and everytime i use to_sql my dates becomes timestamp, how can i cast it as a date only?
Erfan Nariman
@erfannariman
What could be the reason tests are failing in CI but are passing locally? In #41022 the test pandas/tests/reductions/test_reductions.py::TestSeriesReductions::test_all_any_params fails, but on my local machine the test passes.
1 reply
Daniel Ratzlaff
@Initialwave
Anybody know of a python 3 community in here?
ldacey
@ldacey_gitlab
is astype("string") still experimental or is it okay to use?
I have been using the nullable integers for a long time already
also curious if anyone knows how <NA> or pd.NA is represented in pyarrow or a Parquet file - is it just null and takes up no space?
Sweta Rauniyar
@rauniyars
Working with my team on issue #41072. To resolve the issue, does anyone know what method we should be looking into to create the time stamp?
razou
@razou

hello
I wanted to convert a given column (containing some NaN values) into integer like this: df['x'].astype(int) and I got this error

  raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

Is there any workround for this ?
Thanks

Daniel Saxton
@dsaxton
@jorisvandenbossche re: https://github.com/pandas-dev/pandas/issues/40603#issuecomment-830860731 i think this workshop looks really interesting, any idea how to sign up (if there are still slots)? tried posting in the Jupyter discourse but hadn't gotten a response yet
5 replies
Michael Hsieh
@mdhsieh

Hi, I'm new to open source, working on #41421. I'm trying to create the development environment but have an error on the Windows build tools path setup.
Ran:
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat" -vcvars_ver=14.16 10.0.17763.0

And error:

**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.8.9
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************
[ERROR:vcvars.bat] Toolset directory for version '14.16' was not found.
[ERROR:VsDevCmd.bat] *** VsDevCmd.bat encountered errors. Environment may be incomplete and/or incorrect. ***

How do I fix this?

4 replies
Noora Husseini
@nooraLeila
Hi there. I am trying to contribute to the pandas library. After cloning the pandas-dev repo, at the last step after I have activated the conda env, I try to import pandas and get an error ImportError: C extension: 'lib' from 'pandas._libs' (/Users/noonie/Desktop/pandas-noor/pandas/_libs/init.py) not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first. I then try running the command and get an error: command '/usr/bin/clang' failed with exit code 1. I am a little stuck on what to do next
11 replies
Michael Hsieh
@mdhsieh

Hi, I'm working on #41423 and would like to confirm my understanding of view vs. copy when creating a Series with copy=False.

For example, creating Series s with array r as data:

r = [1,2]
s = pd.Series(r)
s.iloc[0] = 999
print(r)
print(s)

r is unchanged since s is a copy of the original data?

But when creating Series s with numpy.array as data:

r = np.array([1,2])
s = pd.Series(r)
s.iloc[0] = 999
print(r)
print(s)

r is changed to [999, 2] since Series s is a view on original numpy array r?
s shares the data of r.

Appreciate any help.

1 reply
deepakdinesh1123
@deepakdinesh1123
I was working on issue #41485, this is my first time contributing to pandas , I added a new test , how can I run that particular test? or the entire test suite without building it from scratch?
4 replies
Michael Hsieh
@mdhsieh
Hi, I'm making some suggested changes on #41423.
I'm only changing 1 file's comments, pandas/core/series.py,
but I have CI checks failing after I push my changes
I don't know why, or how to make them pass?
2 replies
boris
@pkarpesis:chat.avlikos.gr
[m]
hello!
what is the difference between df = df.assign(new_col=another_df['name'].values) vs df = df.assign(new_col=another_df['name'] ? It seems that when using .values there are times new_col = NaN . Can someone please explain ?
Michael Hsieh
@mdhsieh
I'm making a test on #35603. My problem is even when I remove all my changes there's still 1 CI check failing after pushing.
Failing check here
Some errors in pandas/tests/arithmetic/test_datetime64.py?
2 replies
Balaji G
@BalajiG2000
Hi everyone. I need to understand how dtype object, and the dataframe slicing feature has been implemented in Pandas. Like the logic specifically. Is there any source to understand how dtype object or data frame slicing is implemented internally?? .I understand reading the source code on github could help. But any other suggestions?
mocquin
@mocquin
Hello there. I would like to know if there is a way for pandas to automatically detect a custom Dtype that I previously created using pandas extension interface, subclassing pandas.core.dtypes.base.ExtensionDtype (I also created to corresponding ExtensionArray). For now this pd.Series(my_custom_obj, dtype=MyCustomPdDtype)) works, but I would like this pd.Series(my_custom_obj)) which I find simpler, given that I know the dtype of my object.
mocquin
@mocquin
As a comparison, Categorical objects "suffers" from the same behavior : s = pd.Series(["a", "b", "c", "a"], dtype="category")will indeed cast the values in a Categorical series, but pd.Series(["a", "b", "c", "a"]) will use object as dtype. I would like to be able to tell pandas how to behave when the passed object is of type MyCustomClass, kinda like a lookup table : ThisObject -> ThisDtype, ThatObject -> ThatDtype...
mocquin
@mocquin
I created a FR on GH : pandas-dev/pandas#41848
Michael Waskom
@mwaskom
Hi, are there any guidelines for writing type hints for code that interfaces with pandas? There are some third-party stub packages out there but they seem rather incomplete / abandoned.
4 replies