These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
I thought the author was a little too “I figured this out” - ish. I do see where he is coming from as I work in the field and deal with some of the issues he’s brought up. Especially with reproducibility. But I’m not sure I’m on board with “software developers can do anything, including data science. They’re just not interested.” That statement is somewhat careless to me and it makes it seem as if true foundational knowledge and expertise in statistics is not necessary. I’ve been to talks where the presenter worked for a data science consulting firm. He had no clue how the algorithms worked (only the arguments specified in the documentation). He couldn’t answer moderate-depth questions that seemed reasonable and something he should know, especially when he’s charging for these models he’s deploying.
I personally know a software developer who’s looking to get into data science. And I wouldn’t say he’s “one online class away” from displacing seasoned statisticians with less coding experience. While functions and packages are built out in popular software domains to execute these tasks with little effort, it is quite dangerous to treat these tools as black boxes or to thinly understand them. Taking a shotgun approach to model selections and hoping something sticks is troubling I think also, since you may not know why one mode necessarily outperforms the other.
I will note that best practices of software development (all the things mentioned in the article) are in the process of being carried over to data science. So I don’t think it’s necessary to become a junior developer first to gain the skills needed to be a modern data scientist. The author kind of mentioned this but I felt a little too softly. Anyway, it’s a good dialogue as I don’t think it’s wise to be either extreme (only software with vague stats or only statisticians with vague software); somewhere in the middle in this case is probably the best.
timjavins sends brownie points to @becausealice2 :sparkles: :thumbsup: :sparkles:
timjavins sends brownie points to @goldbergdata and @mcbarlowe :sparkles: :thumbsup: :sparkles:
becausealice2 sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles:
timjavins sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles:
goldbergdata sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles:
goldbergdata sends brownie points to @timjavins :sparkles: :thumbsup: :sparkles:
quincylarson sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
quincylarson sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles:
@erictleung I'm personally fond of:
What do you call a statistician who lives in San Francisco?
A Data Scientist
erictleung sends brownie points to @quincylarson :sparkles: :thumbsup: :sparkles:
@erictleung @becausealice2 @QuincyLarson and rest. My opinion about Data Scientists?
I try to put attention to the term "science".
I already mentioned here sometime ago what I heard from the Data Science Leader of Amazon Europe when asked a similar question by the audience. Instead of explaining, he posed a real problem Amazon was facing and asked the audience to think of a solution. The problem was more interesting because by the time they solved no data was available.
He asked people to keep their answers for themselves (we were about 200), continued his talk and left their answer for later in the talk.
Then he gave the solution to the problem: they had to design an experiment to collect data. The purpose was to formulate and test a model. Keeping the details apart, some people showed surprise.
He mentioned that many applicants are mainly trained in applying "recipes" to those problems but were unable to think "out of the box".
If you ask what my solution was, I at least passed the test that an experiment was required :). Exactly the details of the experiment I didn't though, sorry.
The conclusion of the talk was that to fulfill the role of a data scientist, ideally you should be able to "design" experiments that are able to bring up valid information from large amount of data and even being able to produce relevant data when it doesn't exist.
In a simple, common scenario, those experiments could be for example model comparisons, and for bringing relevant data where it doesn't exist, feature selection and feature engineering.
Of course, you should be able to discharge results that make no sense, eg. spurious correlations, (some funny examples here).
In real terms though how much ideal data science is done might be decided by the knowledge of the practitioner, business goals and budget, and if you buy or hire. IMO those are key factors determining the spectrum (@erictleung's words). I can certainly confirm the existence of that spectrum based on what I have seen and heard.