These are chat archives for FreeCodeCamp/DataScience

Jan 2017
Xavier Sumba
Jan 14 2017 03:03 UTC
Hello, I'm working on finding relatedness of words. The problem is the following: given 2 groups of words, I need an output which should be some confidence score that says whether there is or not a relationship between g1 and g2.
Anyone knows a way to that? or Where can I start? I was thinking on word2vect.
Alice Jiang
Jan 14 2017 03:54 UTC
I just noticed there's an intro to stats book in that Syncfusion link I posted a while back on Statistics by Katie Kormanik. She's one of the statistics instructors from the Udacity stats courses.
Just a bit of trivia if anyone is interested at all :)
Hèlen Grives
Jan 14 2017 07:27 UTC
@cuent I very new to the topic. I think you need to code the relationship first. Let's say you have setup the relationship matrix in a social network analysis. Than you can calculate the co-appearance of the words. Maybe another way is to use existing corpora and than estimate the likelihood of relationship. I believe Google n-grams can do it for you. Confidence intervals are only for your intuition. It is not written in stone. Hope it helps a bit.😬
Koustuv Sinha
Jan 14 2017 16:55 UTC
hello! anyone here has experience with running LDA to get topics? (specifically gensim lda module) Need some help! :)
Alice Jiang
Jan 14 2017 18:49 UTC
If I have a big ass dataset that includes a categorical feature and I want to do some maths-type stuff on each of the categories would it be best to count the number of times each categories appears in the one feature or create new, binary features for each category?
Jan 14 2017 22:21 UTC


Amazon and the one other main attribute of a DS: able to make "metric linkages"

I recently attended to a talk from the DS manager of Amazon Berlin. Very interesting thing, he mentioned a couple of key stuffs that no-one has mentioned in the past as a key skill from a data science. He called "metric linkage". He said that a data scientist should be able to create them. But what is that?

Here an example by challenging you:
As you know, Amazon is also selling perishable products. In order to satisfy quality standards they have hired a group of QA employees who main role is to verify the quality of those products before being send to buyers.
However, the company is pursuing to automate almost everything. One of the targets is to find ways to involve a system that could also automate the process of QA of perishable products.
The example was for strawberries.

if you are asked to implement a system to evaluate the quality of the fruits, what would you do?

The QA activity is very manual and based on experience. It also involves evaluating not only visual cues but also tactful or organoleptic attributes of the fruit.

  • What do you think, considering the current advances that could be the best simple attribute to measure that you can follow better? Select only one.
  • After selecting the main attribute, how do you design a metric that also gives you a proxy to the other main sensorial attributes of quality?
  • You will be using machine learning.
@koustuvsinha How are you!!!!???
@cuent word2vec is good but as far as I know you cannot apply any algebraic operation between the distances you are going to get.
Koustuv Sinha
Jan 14 2017 22:27 UTC
hey @evaristoc !! i'm fine! :) how are things here?
Jan 14 2017 22:28 UTC
@cuent I actually haven't worked on trying to get a value representing the distance between two groups of words but I guess that Glove seems to be a better option as it is more statistically based? I can try to investigate that for you.
@koustuvsinha no change since you left us here abandoned man!
@koustuvsinha I contact you tomorrow with LDA?
I did something already, just a bit, not much... But LDA don't think the best choice, as far as I remember... IMO... Tomorrow?
Koustuv Sinha
Jan 14 2017 22:35 UTC
@evaristoc sure! and dont say "abandoned", its harsh! i never really went away ;)
Alice Jiang
Jan 14 2017 22:39 UTC
@koustuvsinha @evaristoc Has abandonment issues because of you, now. Whenever I leave he sits by the door and barks :joy:
Jan 14 2017 22:43 UTC


Again this idea of measuring a distance hasn't happened to me yet, but for what I have been discussing with other people I think there are who would refuse to claim that that measuring should be seriously taken. Why? The position of the groups will rely A LOT on your corpus and words are actually NOMINAL variables, so in theory they lack ordering. In theory, you can always re-group them according to the meaning you want to give them.

However, a sort of distance is always implemented in practice. One simple example is Sentiment Analysis. A more clear example in my opinion is just the following:

If you can check what they are doing for creating those grouping you might get a good starting point, I think...

@becausealice2 well... I actually always bark... @koustuvsinha : it doesn't have to do with you...
Jan 14 2017 22:52 UTC
@cuent I mean: synonyms... never worked with that before... I guess is word2vec... I will check what they implement too... I am curious...