These are chat archives for FreeCodeCamp/DataScience

17th
Mar 2016
evaristoc
@evaristoc
Mar 17 2016 21:59

Hi People,

Just coming from a meetup about Data Science. Encountering old friends...
http://www.meetup.com/Digital-Analytics/events/229406157/

Representatives of important companies like booking.com (online hotel bookings, probably the biggest nowadays) or Transavia (airline, Europe) were giving talks. Also a CQM data scientist. The host company was Travix (an European sister company of Travelocity, from Expedia).

I summarise:

  • Corporative, as it should be in these cases: no much information for disclosure
  • Only by watching the presentations, you could claim that the coursera course would be enough for getting into the business... It looked like there is some standardisation and model simplicity (= cost reduction)
  • Guess what? Natural Language Processing is gaining momentum again. The existing available tools are also more advanced and mature than before. Anyway: we (actually I) have been doing some NLP here (yea!!!...?). In particular, user reviews (=user experience) and ratings were very important for them.
  • Those who are expecting to see a lot of supervised machine learning? you would be surprised how much a simple k-mean clustering is preferred...
  • For those who have worked on recommender systems... you would be surprised how useful a non-personalised recommender is hard to beat...
  • The difference to what we have? The amount of data they use (Hugeeee!) AND the need of having a project in batch mode => non-stop. Our projects are occasional demos with no much data.
    • There is a myth-bluster: they don't get every detail about users (user privacy policy; scope of online data; etc) and have hugeeee sparse matrices (a lot of cells with no data to fill in).

About standardisation and model simplicity in the use of models:

I must admit that I was surprised for the common use k-mean clusters or classification trees though. I guess either there should be some differences in some details that were not mentioned or simply put: the simple models were much better than their more expensive counterparts...

Other driver to prefer the simplest model possible could have been cost. When I was practising in kaggle I was actually discussing one project based on that premise: more precise algorithms require A LOT more hardware (remember: I said before that solving for Data Mining shares analogies with solving an NP-hard problem using approximation algorithms!!!). So a company who has to use the results of any Data Mining / Machine Learning implementation should invest a lot more in hardware to effectively go for precision. That precision comes with a cost that MUST be justified, not only in money, but culturally (for those who don't still know: organisations have CULTURES).

evaristoc
@evaristoc
Mar 17 2016 22:12

Something else:

in the next days I will prepare a page to show what the projects of the data science room... again: if you are interested please come and share


Finally:

@QuincyLarson is sharing a tweeter of the last stackoverflow report. Something that may concern us more directly (apart of the "still looking for jobs..."):

techs_for_data.png

In general JS is everywhere and then python. Check also the current salary expectations per technology...

Alice Jiang
@becausealice2
Mar 17 2016 22:37
It's worth it to mention that in that same report there was a list of the most common tech stacks per occupation and the languages most frequently included for data science were Python, R, SQL, Java, and JavaScript, and one (random) appearance of Hadoop
evaristoc
@evaristoc
Mar 17 2016 23:45

I was going to bed when I found an article that still want to share with you but first...

@alicejiang1 perhaps... Spark, together with Cassandra, Kafka, and other Apache suits have been taking over. Anyway --- in general there are interesting trends but the report should be taken cautiously. Example: it shouldn't be many questions about technology specifically used for niche sectors like Big Data or parallel computation in stackoverflow. There shouldn't also be much about Close Software. By the way: a nice link about distributed parallel computing: https://computing.llnl.gov/tutorials/parallel_comp/


And now the link I found in medium: for developers and the rising of technologies around Natural Language Processing and AI:
https://medium.com/swlh/the-future-of-conversational-ui-belongs-to-hybrid-interfaces-8a228de0bdb5#.3u0qj8lxw