These are chat archives for FreeCodeCamp/DataScience

31st
Dec 2017
Eric Leung
@erictleung
Dec 31 2017 09:21
Interesting take on the containerization technology company Docker Inc (the company itself) being "dead". The article praises the software, but criticizes management. I did get wind of the s/Docker/moby/ change, but didn't think of it much. But this article puts it into context. Simply put, Silicon Valley drama :sweat_smile:
evaristoc
@evaristoc
Dec 31 2017 10:13

@erictleung interesting... I was not expecting Docker to fail...

@GoldbergData it seems simple but each chart took me a good bunch of hours to prepare. Not to discourage you! It is that I am not as good in JS yet. I am improving though. Last year a simple chart took me 1 week to make. This time those two charts took me less than a week, so I am doing better.
However, @GoldbergData, my scripts were very much sketched and still messy. Looking forward to the time at which I will code clean in on go.

By the way: checking the charts I noticed I made an error in my Python code (!!). My beeswarm chart was apparently wrong. The corrected chart looks a bit different.

evaristoc
@evaristoc
Dec 31 2017 11:03

Oops... I think I found an error in my analyses that might explain a trend that was not convincing me much... If so, it will mean lots of work trying to correct it, having to run the whole project again from beginning to end. It runs very slow. 4Gb of data, my computer doesn't have enough capacity to handle all that in memory in one go and to recover memory I have to quit python every time it finish with one segment of the datasets.

People

I am not sure about your experiences with big files but if you want to go to data analysis and specially big data, be aware of this:

Debugging big datasets is HARD. Probably harder than you would like. It is easy to work with example files when you have almost everything clean by someone else, but it can be that when at work you will have to deal with non-cleaned data. Big data files with a lot of variables are not that easy to debug. It is very easy to miss things because not all the possible errors are spotted in a sample file.

If you have millions and millions of records, with thousands of variables, there is no PyShark or Scala or anything that will solve that for you - the data might be running for days before you find you missed something, meaning that you have to run the code AGAIN. And I can tell you that might happen MORE THAN ONE TIME with the same dataset.

In fact, it can be that you never spot the error and end up arriving to wrong conclusions (as I maybe did...).

So this is one of the no-nice parts of data analysis and data science. I personally hate this part to be honest, but it is necessary when you have to bring the most faithful results possible.

evaristoc
@evaristoc
Dec 31 2017 11:30

@erictleung already mentioned something about "being skeptical about the data being of good quality" :point_up: .

Well, based on my experience, I would add: be skeptical about analyses over the data, including YOURS. Unless you are ok with fake news. Believe it or not, there is A LOT of serious organisations and even scientific publications that ended up supporting incorrect results, sometimes by omission.

Science, people, is NOT immune to subjectivity. It is not totally impartial. Sometimes it is possible to convince "experts" just because you reported something they believe. This is even easier with non-scientific audiences, which are more prone to reject results opposed to what they think true.

What actually science does is to find the better objective result by replicating. If the results cannot be replicated, they shouldn't be accepted, even if in practice they are true. When it comes to social sciences, the hypothesis tests are even harder, for not saying impossible.

But ideally if you have some results over a dataset, you cannot totally confirm you are right until several people can replicate the result.

If I am usually skeptical, this is why. If you have ever read my responses and comments, I usually try not to say : "I know". I usually write "maybe, I am not sure, it might". Because I don't believe me...

Josh Goldberg
@GoldbergData
Dec 31 2017 17:05
@evaristoc great share man thank you for putting in the time. So you can mix python code with JS to make the chart? Or you used python for data wrangling?
CamperBot
@camperbot
Dec 31 2017 17:05
goldbergdata sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 393 | @evaristoc |http://www.freecodecamp.org/evaristoc