These are chat archives for FreeCodeCamp/DataScience

5th
Feb 2018
Ankit Nautiyal
@Ankitnau25
Feb 05 2018 11:44 UTC
I have documents in different languages how can i cluster those documents
evaristoc
@evaristoc
Feb 05 2018 11:47 UTC
@Ankitnau25 Not sure about your question. I guess by language?
Ankit Nautiyal
@Ankitnau25
Feb 05 2018 11:51 UTC
I need to apply clustering algorithm in set of documents but they are in different language @evaristoc how can i do that
Is there any python library which convert different languages to english?
evaristoc
@evaristoc
Feb 05 2018 11:56 UTC

That is another thing. No per se. You can use python to connect to translation API's but depending on the length it might take long or cost money and they are far for being reliable translation (still...). I don't know.

Then you can find more about vector modelling (the simplest approach) and cluster over that. k-means (again, the simplest) is the most used.

If clustering by language, you might not even need to apply any clustering algorithm at all. Just find a way to identify they are in different languages.

@Ankitnau25

Ankit Nautiyal
@Ankitnau25
Feb 05 2018 11:58 UTC
i know k means but if documents contains mixture of other languages that is the issue
@evaristoc
evaristoc
@evaristoc
Feb 05 2018 11:59 UTC
@Ankitnau25 A mix?? Even more complicated.
Ankit Nautiyal
@Ankitnau25
Feb 05 2018 12:01 UTC
yes mixture of both for example in India patients reports are in both Hindi and english
evaristoc
@evaristoc
Feb 05 2018 12:02 UTC
@Ankitnau25 When you say both languages, you are saying the same document contains both languages, aren't you? That is what I say is more complicated.
Ankit Nautiyal
@Ankitnau25
Feb 05 2018 12:02 UTC
yes same document
evaristoc
@evaristoc
Feb 05 2018 12:04 UTC
Tough. You either need to standardise the language or separate them and analyse them independently. The second might not be the best option but it could help that you focus on English only, for example, if you think that could take you to a partial solution.
That is my opinion. You might find other people suggesting other solutions, probably better than mine. I suggest you to ask not only here but different places.
Ankit Nautiyal
@Ankitnau25
Feb 05 2018 12:06 UTC
different places?
any other data science group u knw?
@evaristoc
evaristoc
@evaristoc
Feb 05 2018 13:53 UTC
No sorry, @Ankitnau25. Try to see if you can find something in stackoverflow? Sorry I can't help you more.
Ankit Nautiyal
@Ankitnau25
Feb 05 2018 15:59 UTC
Thanks @evaristoc
CamperBot
@camperbot
Feb 05 2018 15:59 UTC
ankitnau25 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 401 | @evaristoc |http://www.freecodecamp.org/evaristoc
Alice Jiang
@becausealice2
Feb 05 2018 21:37 UTC
Remember the etymological trees I posted not too long ago? The Daily Mail stole them and the original author is suing
They've been doing it to Nathan Yau for some time, as well.
evaristoc
@evaristoc
Feb 05 2018 21:50 UTC
Nathan Yau is my visualization hero, by the way...
Alice Jiang
@becausealice2
Feb 05 2018 23:15 UTC
He seems like a cool guy. I'm pissed on his behalf, tbh.