These are chat archives for FreeCodeCamp/DataScience

2nd
Aug 2016
Lightwaves
@Lightwaves
Aug 02 2016 02:23
@evaristoc
evaristoc
@evaristoc
Aug 02 2016 08:07

@Lightwaves ! Thanks!

I was also to share this one:
http://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00003

I will read that one, although for the project I am preparing I think I found something that is enough to detect the main topic of the week, based on a previous code that I had. If everything is ok, I won't be able to pick every single conversation make a description of the conversation like for example:

X said this. Then Y replied suggesting this. Then X was asking about that...

However I think it will be possible to detect fragments of conversations (my disentangling is fragmenting conversations between different clusters :(), pick a one of those fragments that looks more relevant than any other fragment, and make a summary like for example:

X and Y were talking about a, b, c, and d.

Where 'a, b, c and d' could be relevant words to the topic of the room.

CamperBot
@camperbot
Aug 02 2016 08:07
evaristoc sends brownie points to @lightwaves :sparkles: :thumbsup: :sparkles:
:cookie: 202 | @lightwaves |http://www.freecodecamp.com/lightwaves
Albert Jonathan
@albert2309
Aug 02 2016 13:27
This message was deleted
evaristoc
@evaristoc
Aug 02 2016 20:53

@Lightwaves :

This is the best I have managed to do so far:

val and selcluster  2.2120221841984256 0
val and selcluster  4.853553778366113 0
the following people {'SamAI-Software', 'DanStockham', 'bitgrower', 'jacobbogers', 'evaristoc'}  were talking about  {'better', 'looking', 'math', 'm', 'links', 'seems', 'idea', 'given', 'back', 'key', 'part', 'need', 'stats', 'book', 'found', 'r', 'want', 'depth', 'could', 'week', 'go', 'using', 'leaderboard', 'api', 're', 'wanted', 'based', 'foundational', 'statistics', 'books', 'page', 'project', 'starting', 'coding', 'anything', 'much', 'python', 'analysis', 'bit', 'add', 'looks'}  at  2016-06-12
val and selcluster  2.7977316347158996 2
the following people {'erictleung', 'alicejiang1', 'gayathry2612', 'evaristoc'}  were talking about  {'looking', 'viz', 'm', 'links', 'name', 'back', 'look', 'fcc', 'need', 'missing', 'caching', 'r', 'facebook', 'enough', 'trying', 'group', 'take', 'around', 'go', 'js', 'll', 'pure', 'api', 're', 'easy', 'data', 'way', 'page', 'things', 'project', 'rules', 'long', 'starting', 'care', 'wiki', 'python', 'bit', 'file', 'add', 'looks'}  at  2016-06-13
val and selcluster  2.499714511158921 4
val and selcluster  2.9053663638437164 2
val and selcluster  2.9269721660145147 2
val and selcluster  4.119383636019986 2
val and selcluster  4.137491084998594 4
the following people {'alicejiang1', 'bvi1994', 'evaristoc'}  were talking about  {'ekman', 'm', 'links', 'name', 'far', 'look', 'get', 'fcc', 'little', 'sleep', 'room', 'man', 'cookies', 'found', 'r', 'want', 'else', 'trying', 'something', 'care', 'anything', 'take', 'fbi', 'give', 'imagination'}  at  2016-06-14
val and selcluster  3.7746607157980177 3
val and selcluster  4.882648177701309 3
the following people {'alicejiang1', 'evaristoc', 'Alloffices'}  were talking about  {'repo', 'looking', 'better', 'viz', 'm', 'idea', 'name', 'given', 'back', 'able', 'fcc', 'get', 'need', 'part', 'little', 'master', 'update', 'personality', 'found', 'caching', 'facebook', 'enough', 'trying', 'around', 'googlemaps', 'could', 'js', 'll', 'json', 'using', 'api', 'wanted', 'someone', 'based', 'ds', 'data', 'way', 'point', 'page', 'search', 'current', 'room', 'project', 'rules', 'long', 'every', 'care', 'starting', 'something', 'wiki', 'google', 'bit', 'concern', 'feel', 'give', 'population', 'coordinates', 'file', 'add', 'looks', 'credit'}  at  2016-06-15
val and selcluster  4.058976366690988 5
val and selcluster  4.378106317323027 5
the following people {'alicejiang1', 'evaristoc'}  were talking about  {'repo', 'today', 'care', 'wiki', 'take', 'tried', 'json', 'part', 'fcc', 'coordinates', 'file', 'room'}  at  2016-06-16
val and selcluster  2.2196294192265333 5
val and selcluster  3.036450178392109 5
the following people {'pdurbin', 'alicejiang1', 'jacobbogers', 'evaristoc'}  were talking about  {'m', 'forum', 'idea', 'data', 'interested', 'fcc', 'point', 'things', 'room', 'scraper', 'today', 'enough', 'want', 'discourse', 'take', 'could', 'market', 'free', 'file', 'json', 'api'}  at  2016-06-17
val and selcluster  3.5955265812849024 6
the following people {'shawniscool', 'pdurbin', 'gayathry2612', 'jacobbogers'}  were talking about  {'discourse', 'market', 'bit', 'key', 'interested', 'way', 'data', 'get', 'free', 'leaderboard', 'api'}  at  2016-06-18
As you can see, it is still a bit dirty... you can still use other tools to clean this even further, but this project doesn't handle so well the false positives, so there is a lot of room for letting some unrelated words to go along.
Furthermore, it is better to know what the conversation is about to reveal the content. I would accompany this kind of approximation with a selected message between the whole communication (that could be done better now!!!)
So it is more like an invitation to go further rather than showing the content.
Lightwaves
@Lightwaves
Aug 02 2016 20:57
@evaristoc would dropping words that occur to often help?
I have to take a good look at this this looks intriguing.
evaristoc
@evaristoc
Aug 02 2016 21:00

Good point! I haven't implemented... but it should be words that occur to often... in which context? When you train a classifier you must train also the model to understand what "too often" means... Anyway: mine is just an exercise that hasn't involve additional classification.

I am just testing the concept... but it looks nicer than I thought, seriously!

Give me sec to clean the data to show you.
Lightwaves
@Lightwaves
Aug 02 2016 21:01
I'm actually surprised how well that works
evaristoc
@evaristoc
Aug 02 2016 21:13
  • At 2016-06-12 the following people: 'SamAI-Software', 'DanStockham', 'bitgrower', 'jacobbogers', and 'evaristoc' were talking about:

    • 'math', 'links', 'R', 'leaderboard', 'apis', 'foundational', 'statistics', 'books', 'coding', 'python', 'analysis'
  • At 2016-06-13 the following people: 'erictleung', 'alicejiang1', 'gayathry2612', and 'evaristoc' were talking about:

    • 'viz' (visualizations), 'links', 'fcc', 'R', 'facebook', 'js', 'apis', 'data', 'wiki', 'python'
  • At 2016-06-14 the following people: 'alicejiang1', 'bvi1994', and 'evaristoc' were talking about:

    • 'Ekman' (a recognised researcher), 'links', 'fcc', 'cookies', 'R', 'FBI'
  • At 2016-06-15 the following people: 'alicejiang1', 'evaristoc', and 'Alloffices' were talking about:

    • 'repos', 'viz', 'masters', 'personality', 'facebook', 'googlemaps', 'js', 'json', 'apis', 'ds' (DataScience), 'data', 'search', 'Wiki', 'Google', 'population', 'coordinates'

And so on...

I know there are some errors in the selection of the classifier, and I am letting them for the sake of comparison (for example Alloffices didn't really engaged in a conversation, and I could recognise some words like FBI that were actually part of a trivial, off-topic conversation, but that is to show the scope of the project.

If you can clean the words as much as possible, even if with false positives, but add a well selected post that clearly speak about the topic of a conversation AND add some stats, you can make a contribution to improve the versatility of the chat communication to users, if the chat has at least a mid-level of traffic.

evaristoc
@evaristoc
Aug 02 2016 21:21

@Lightwaves the above is a cleaner version... I tried to exclude the words that looked more off topic in the majority of the cases and keep a trend emerging.

The first question is:

Is this enough to give an idea of the topic of the conversation?

My selection is biased - I am assuming that the conversation is about a topic of interest (I don't know actually why the first group was apparently mentioning math or statistics as topic, or who did it, etc)

Let's assume however that it is enough and the risk of bias is accepted given the nature of the room...

Now assume that other person, not me, is doing the cleaning and that that person doesn't know anything about the content of the communication but have an idea of the topic of the chatroom... Would he/she come with a similar or better "summary"?

Now assume it is a machine...

Anyway I have to go... take care man!!
evaristoc
@evaristoc
Aug 02 2016 21:27
@Lightwaves by the way! One way to test it is to start a search using the search engine of the Gitter Chatroom...
Lightwaves
@Lightwaves
Aug 02 2016 21:28
I'd love to see the code for this once you release it
evaristoc
@evaristoc
Aug 02 2016 21:30
So no too bad... although too raw...
Lightwaves
@Lightwaves
Aug 02 2016 21:31
This is actually a pretty tough problem actively being researched
evaristoc
@evaristoc
Aug 02 2016 21:32
I will let you know... I will send that to you as soon as I have something cleaner (you know how I code... wrack!) so you will have to keep your eyes open!! @Lightwaves
Lightwaves
@Lightwaves
Aug 02 2016 21:32
Getting these results I'd say is worth a beer or your favorite drink
evaristoc
@evaristoc
Aug 02 2016 21:32
:) :) :)
Let's make two! I pay the next round! :)
Alice Jiang
@becausealice2
Aug 02 2016 21:42
You guys should make a script to change "alicejiang1" to literally anything else
evaristoc
@evaristoc
Aug 02 2016 21:43
:) :) :) The ever-changing name...
Or actually I can imagine your name, @alicejiang1, as something flexible
so you can actually shape it into anything
@aaaaaaaaaaaaaaaaaaaaaalicccccccccccccceeeeeeeeeeejiaaaaaaaaaaaaaaaaaannnnnnnnnggggggggggggggg111
Alice Jiang
@becausealice2
Aug 02 2016 21:45
Just make it so it's not me because I don't need people to see how stupid I get in these chat rooms
Lightwaves
@Lightwaves
Aug 02 2016 21:46
No!!!
The world must know!
evaristoc
@evaristoc
Aug 02 2016 21:46
Something that you can stretch or just making very small
''
What is that? Your today's username, of course!!!
The invisible you...
@alicejiang1 take care friend! you are doing good! Go for a walk if this is too much for today... People! Take care!!!
Alice Jiang
@becausealice2
Aug 02 2016 21:52
Why does the world need to know? It's not like I'm running for PotUS
Lightwaves
@Lightwaves
Aug 02 2016 21:53
Because you are Alice
Everyone must hear of your awesome.
Alice Jiang
@becausealice2
Aug 02 2016 21:53
@evaristoc You take care as well, and thanks for all the enthusiasm and support you constantly have :)
CamperBot
@camperbot
Aug 02 2016 21:53
alicejiang1 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 302 | @evaristoc |http://www.freecodecamp.com/evaristoc
Alice Jiang
@becausealice2
Aug 02 2016 21:53
"becauseAlice" Has been my internet name before....
But still, I get incredibly stupid and don't need that to get out
I'm a role model
Lightwaves
@Lightwaves
Aug 02 2016 21:54
I totally got demoralized over here I saw evaristoc results and now I'm like my project noooooo
even though it's slightly different
When he releases his code I'll study and learn from it.
In the meantime I'll look at a intro to machine learning course