These are chat archives for data-8/datascience

2nd
Dec 2015
Carl Boettiger
@cboettig
Dec 02 2015 00:41
Curious if any connectors are touching on sql databases. Just need some very simple imports from postgres, doesn't look like psycopg2 is available.
Guess I could just export the data to csv for them, but torn between wanting to give a light exposure to databases vs just streamlining things
Stefan van der Walt
@stefanv
Dec 02 2015 00:44
@cboettig Is postgres the system you have to use? Because Python has built in sqlite, if you can port the DB to that.
Carl Boettiger
@cboettig
Dec 02 2015 00:48
yeah, in this particular case data is already in postgres, though pedagogically maybe sqlite makes more sense then
tap2k
@tap2k
Dec 02 2015 01:55
ok got it - chalk it up to the stupid question dept
henryem
@henryem
Dec 02 2015 05:08
@cboettig Tables support a lot of SQL-like things (join, group by, where) that they learn about in the first month or so of the base class. Could be easier to introduce database operations that way, without a new language. (Sorry if you're already aware of that!)
Carl Boettiger
@cboettig
Dec 02 2015 05:29
@henryem right, yes, and I'd probably stick with doing most of these manipulations in tables (though I'm still learning those myself!). More just wondering about it as a data ingest step -- I often see students struggle with importing data from databases even when they are already well equipped to manipulate the data within a given framework once the data are imported. Teaching all that is of course beyond the scope of what I can get into a connector, but just wondering if it's worth giving some glimpse of data read/parse command that isn't csv. More a pedagogy issue than a technical one I suppose, and I'm still on the fence.
henryem
@henryem
Dec 02 2015 05:33
Ah, I see. That sounds cool.
Carl Boettiger
@cboettig
Dec 02 2015 05:50
Hmm, struggling to figure out the best python way to do an operation that is pretty simple in R's dplyr.... I have a table in which one column is a grouping factor, so for each group I want to apply a summary function. Here's my R version: https://gist.github.com/cboettig/7ce0f311daa428b023f9
henryem
@henryem
Dec 02 2015 06:07
I'm not 100% familiar with the dplyr syntax, but I think you would say:
values.select(['assessid', 'ssb']).group('assessid', collapsed)
where collapsed is
def collapsed(an_array):
return an_array[-1] < 0.1*max(an_array)
the main difference, as far as I can tell, is that the tables group() will apply the summary function to every column, whereas group_by lets you apply it only to some columns
though I'm not sure what happens to the columns that are not summarized
in dplyr's group_by, I mean
anyway, the .select(['assessid', 'ssb']) pares down the columns to just the grouping factor and the column you wanted to summarize
if you want to summarize several columns in different ways (or the same column in multiple ways) it takes several steps
Carl Boettiger
@cboettig
Dec 02 2015 17:08
@henryem Thanks! That looks very promising -- However, I'm a puzzled why I get different results in R vs python now! how is max handling the nan values in python?
henryem
@henryem
Dec 02 2015 17:15
Ah, it propagates them, so it will return nan. Looks like there are two options: for max in particular there is nanmax, which ignores nans. In general you could use np.ma.masked_array(my_array, np.isnan(my_array)) to get a view of my_array that doesn't include that nans, and then do whatever computation you wanted on that view.
Carl Boettiger
@cboettig
Dec 02 2015 17:24
thanks, that sounds handy. Curiously I don't get any nans in the output from the original python version, but I get a different set of True/False values in the new column...
Carl Boettiger
@cboettig
Dec 02 2015 19:10
hmm, looks like I just get an error on calling np.ma.masked_array on a datascience Table object
Stefan van der Walt
@stefanv
Dec 02 2015 19:10
I’d steer clear of masked arrays unless you really need them.It’s another layer of complexity on an already complex operation.
Yes, that almost certainly won’t work. NumPy does not know anything about Tables.
Carl Boettiger
@cboettig
Dec 02 2015 19:13
right, okay, will avoid that. Meanwhile still puzzled by the handling of nas and the different results between R and python here.
e.g. starting from the gist, https://gist.github.com/cboettig/7ce0f311daa428b023f9 , I see the groupx = values.select(["assessid", "ssb"]).where("assessid", "AFSC-BKINGCRABPI-1960-2008-JENSEN") collapsed(x["ssb"]) returns False
note that x has nan values, so I'd have expected it to return nan. And in R, when dropping nans, it returns true.
Stefan van der Walt
@stefanv
Dec 02 2015 19:15
Let me install R quickly and take a look at what you’re expecting
What is “collapsed” supposed to do? Check whether the last element is smalled than 0.1 * max of the array, ignoring nans?
Carl Boettiger
@cboettig
Dec 02 2015 19:18
yup
Stefan van der Walt
@stefanv
Dec 02 2015 19:19
Try replacing max(an_array) with np.nanmax(an_array)
Carl Boettiger
@cboettig
Dec 02 2015 19:20
throws error
Stefan van der Walt
@stefanv
Dec 02 2015 19:20
Can you show me the error?
(think the error shows up there, from In [14])
Stefan van der Walt
@stefanv
Dec 02 2015 19:23
Hah, I did not expect an_array to be “a_list” :)
I’ll take a quick look at what’s happening underneath the hood
Carl Boettiger
@cboettig
Dec 02 2015 19:25
yeah, guess columns in Tables are list objects? I'm still a bit foggy on the difference between a list and an array. is an array a numpy object? for doubles only?
and thanks much for the help!
Stefan van der Walt
@stefanv
Dec 02 2015 19:25
I’ve just tried it with the latest version of datascience and it seems to work OK for me
Are you using the same dataset as in the gist?
Carl Boettiger
@cboettig
Dec 02 2015 19:26
yup
Stefan van der Walt
@stefanv
Dec 02 2015 19:27
So, it looks like there’s at least one column with all NaNs
Carl Boettiger
@cboettig
Dec 02 2015 19:27
is the latest version what is on ds8.berkeley.edu? I could switch to that; I'm running Juypter from the jupyter/datascience-notebook docker image, just did a pip install datascience.... not quite sure how to check my module version info
Stefan van der Walt
@stefanv
Dec 02 2015 19:27
Ah, I’m running the latest dev version, 0.3.dev21
You can check the version with:
import datascience as ds ds.__version__
At least, you can do that in the latest version ;)
Carl Boettiger
@cboettig
Dec 02 2015 19:28
errors for me. guess that confirms I have an earlier version
Stefan van der Walt
@stefanv
Dec 02 2015 19:29
Also, in this version “an_array” is an array
Carl Boettiger
@cboettig
Dec 02 2015 19:29
that's good
Stefan van der Walt
@stefanv
Dec 02 2015 19:29
You should be able to pip install the latest version directly from git, but I guess it is important for you that your students can make it work, and I presume they’ll be running the latest released version.
Perhaps this is a good time to release a new version—there’s been quite a few bug-fixes etc.
Carl Boettiger
@cboettig
Dec 02 2015 19:30
looks like the berkeley server isn't up to date either, though I'm sure it will be by spring.
hmm, how do I tell pip to install from git?
Stefan van der Walt
@stefanv
Dec 02 2015 19:30
Let me find the magic incantation
pip install git+git://github.com/dsten/datascience.git
Untested
Carl Boettiger
@cboettig
Dec 02 2015 19:32
:thumbsup: seems to be working
looks like I'm missing some dependencies. apt-get time
Stefan van der Walt
@stefanv
Dec 02 2015 19:33
You’ll need:
folium
sphinx
numpy
scipy
matplotlib
pandas
IPython
I’d apt-get numpy, scipy, matplotlib, and pandas, and pip install the rest
IPython should be replaced by jupyter
Unless you are running anaconda, in which case you should prefer conda install package_name
Carl Boettiger
@cboettig
Dec 02 2015 19:36
Stefan van der Walt
@stefanv
Dec 02 2015 19:36
@cboettig I’m trying to test your R snippet. How do I install dplyr?
Carl Boettiger
@cboettig
Dec 02 2015 19:36
install.packages("dplyr")
(From within R)
Stefan van der Walt
@stefanv
Dec 02 2015 19:37
In that case, you want to do conda install numpy scipy matplotlib pandas jupyter
and sphinx
and then pip install folium
Carl Boettiger
@cboettig
Dec 02 2015 19:45
hmm, conda installs worked, installing from git still failing I think
complaining it cannot find blas libraries, which seems unlikely...
Stefan van der Walt
@stefanv
Dec 02 2015 19:54
Sorry, can you paste the error message here?
Carl Boettiger
@cboettig
Dec 02 2015 19:56
@stefanv are you in BIDS? maybe I can swing by
Stefan van der Walt
@stefanv
Dec 02 2015 19:56
Sure, that’d be good!