These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
s/Docker/moby/change, but didn't think of it much. But this article puts it into context. Simply put, Silicon Valley drama :sweat_smile:
@erictleung interesting... I was not expecting Docker to fail...
@GoldbergData it seems simple but each chart took me a good bunch of hours to prepare. Not to discourage you! It is that I am not as good in JS yet. I am improving though. Last year a simple chart took me 1 week to make. This time those two charts took me less than a week, so I am doing better.
However, @GoldbergData, my scripts were very much sketched and still messy. Looking forward to the time at which I will code clean in on go.
By the way: checking the charts I noticed I made an error in my Python code (!!). My beeswarm chart was apparently wrong. The corrected chart looks a bit different.
Oops... I think I found an error in my analyses that might explain a trend that was not convincing me much... If so, it will mean lots of work trying to correct it, having to run the whole project again from beginning to end. It runs very slow. 4Gb of data, my computer doesn't have enough capacity to handle all that in memory in one go and to recover memory I have to quit python every time it finish with one segment of the datasets.
I am not sure about your experiences with big files but if you want to go to data analysis and specially big data, be aware of this:
Debugging big datasets is HARD. Probably harder than you would like. It is easy to work with example files when you have almost everything clean by someone else, but it can be that when at work you will have to deal with non-cleaned data. Big data files with a lot of variables are not that easy to debug. It is very easy to miss things because not all the possible errors are spotted in a sample file.
If you have millions and millions of records, with thousands of variables, there is no PyShark or Scala or anything that will solve that for you - the data might be running for days before you find you missed something, meaning that you have to run the code AGAIN. And I can tell you that might happen MORE THAN ONE TIME with the same dataset.
In fact, it can be that you never spot the error and end up arriving to wrong conclusions (as I maybe did...).
So this is one of the no-nice parts of data analysis and data science. I personally hate this part to be honest, but it is necessary when you have to bring the most faithful results possible.
@erictleung already mentioned something about "being skeptical about the data being of good quality" :point_up: .
Well, based on my experience, I would add: be skeptical about analyses over the data, including YOURS. Unless you are ok with fake news. Believe it or not, there is A LOT of serious organisations and even scientific publications that ended up supporting incorrect results, sometimes by omission.
Science, people, is NOT immune to subjectivity. It is not totally impartial. Sometimes it is possible to convince "experts" just because you reported something they believe. This is even easier with non-scientific audiences, which are more prone to reject results opposed to what they think true.
What actually science does is to find the better objective result by replicating. If the results cannot be replicated, they shouldn't be accepted, even if in practice they are true. When it comes to social sciences, the hypothesis tests are even harder, for not saying impossible.
But ideally if you have some results over a dataset, you cannot totally confirm you are right until several people can replicate the result.
If I am usually skeptical, this is why. If you have ever read my responses and comments, I usually try not to say : "I know". I usually write "maybe, I am not sure, it might". Because I don't believe me...
goldbergdata sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles: