These are chat archives for FreeCodeCamp/DataScience

1st
Aug 2016
evaristoc
@evaristoc
Aug 01 2016 15:58

Hi People:

I was until today doing a small practice about how to make summaries of contents in chatrooms. My target has been this room, the DSR. For that I have been using:

  • tfidf
  • lsa
  • clustering (k-means, minibatches)

The cluster was looking to separate the posts using the tfidf values of the words in each posts. The assumption was that the tfidf, after a few standardadisation steps, was enough to determine the relevance of the content when using an additional related corpus as main entry. Some other projects out there use tfidf as a relevance measure too (eg. search engines).

I am arriving to similar conclusions when I was exploring an automated system to detect questions in the chat, with some other important findings. Summarising:

  • the clustering is helping but not much to the disentangling (separation) of the conversations: I am doing a first clustering over the whole data of one week where the clusters trended to overlap a lot... this overlap was slightly reduced when observing the points of the clusters but per day; however there were situations where the overlap didn't disappear.
  • there is a lot of overlapping because the content of the posts is not homogeneous in terms of topic.
  • also important, the level of English is different per person, even if native speaker; this would affect the homogeneity of the content and make difficult for any rating based on words to determine its relevance.
  • then there is the question of relevance: I am using a simple rating that looks for past used words and try to find new ones... this is not enough. Relevance of content could be actually based on:
    • who speaks - the ranking of the person who posts would determine the relevance of the content; keep in mind that the English of that person might be expressively poor BUT the fact that the person is important already determines also the importance of what it is being said/written
    • what it is relevant for the group - just popularity of the post is not enough, and that is even difficult to measure... but determining the relevance to the group is perhaps something that should be done arbitrarily (eg. assigning more weight to words that are considered relevant to that particular group)

This project as it is can help to approximate some summaries but still require intervention. If I find something more interesting I let you know!

Take care!

Albert Jonathan
@albert2309
Aug 01 2016 16:00
@evaristoc Oh wow. That's an ambitious project. Goodluck with that project.
I haven't even touched any machine learning materials yet XD
evaristoc
@evaristoc
Aug 01 2016 16:05
@albert2309 thanks
CamperBot
@camperbot
Aug 01 2016 16:05
:cookie: 361 | @albert2309 |http://www.freecodecamp.com/albert2309
evaristoc sends brownie points to @albert2309 :sparkles: :thumbsup: :sparkles:
Lightwaves
@Lightwaves
Aug 01 2016 16:43
@evaristoc Hey man just saying hello, and wanted to ask a question. Have you figured out how you quantify the relevane?
evaristoc
@evaristoc
Aug 01 2016 17:14

@Lightwaves hey! How are you doing? Long time I haven't seen you here... What are you doing now? Did you finish your inter?

About the relevance... nope. I mean: I have the idea of what affects it, but creating a code that "learn" about how to determine the relevance, no yet... I think it requires more than just neural networks, and I don't have annotated data (again!) so it is hard... I am trying unsupervised methods at the moment but they are not really aware of the dynamic of a dialogue (for example speech order in the conversation): only structure.

I am also looking for the most simple model possible...

@Lightwaves are you going to be around at FCC for a while? Let me know!

Lightwaves
@Lightwaves
Aug 01 2016 17:18

@evaristoc
Let me send some of my jibberish your way.
If a topic is relevant to a group wouldn't the group talk about said topic and use said words more often

So say Important person x uses some words
if everyone excluding x also uses those words then wouldn't those words beconsidered more relevant to the group

You'd also be able to take into account other important people to the group make the weight of the words larger if they match other important people.

probably just jibberish

evaristoc
@evaristoc
Aug 01 2016 18:06

@Lightwaves :

Let me send some of my jibberish your way.

Of course! Your jibberish is always welcome! :)

If a topic is relevant to a group wouldn't the group talk about said topic and use said words more often

Agree, this is what the clustering should detect; I later tell you something I just found... ;)

So say Important person x uses some words
if everyone excluding x also uses those words then wouldn't those words
beconsidered more relevant to the group

Yes... and no. The focus here would be in principle detect the conversation. If everyone is referring to similar objects and content, they could be using similar words, unless the content is intrinsically known from previous contacts. It is expected that we are referring to conversations where no pre-determined content has been communicated.
Additionally, an important person could be communicating something important within a conversation without using the relevant words but which meaning is implicitly understood. Think of the Dalai Lama joining the DSR, for example. His message might be relevant, even if out-of-context. Now suppose everyone start talking to the Dalai Lama... is it relevant the conversation? Or should be consider that a trivial one as a topic for the room, for example? How can my machine recognise that that is the Dalai Lama and that the conversation is NOT trivial? This is why "relevance" is so hard to get... at the end, I think you have to simply decide what is relevant arbitrarily or based on main topics...

You'd also be able to take into account other important people to the group make the weight of the words larger if they match other important people.

Excellent point! If that its jibberish, then I don't know how it goes when you talk seriously!! :)


People:

(@Lightwaves...)
About something that I found:

  • Adding the timestamp (standardised) of the post as additional variable for the clustering procedure seems to improve the clustering. Why? In our case, conversations usually occur in relatively close time. Neither long periods to wait an answer nor substantial overlapping are the norm in this chatroom. That regularity in time helps in disentangling a couple of conversations that occurred relatively close in time when their content showed to be different (Nice...!).

I guess other variables should be added in order to disentangle conversations like in the main chatroom where the overlapping is almost unbearable, but let's see...

@Lightwaves: planning to send you a PM soon? Need to ask you something...

Lightwaves
@Lightwaves
Aug 01 2016 18:15

Wow that's a pretty amazing correlation and it seems to make sense.

Conceptually If two people or a group are having a conversation about some topic then the pauses would be low because they are replying to each other. That's pretty dang cool.

On a chat or a message board this wouldn't always hold someone may reply to me on the same topic yet it'll be quite sometime afterwards for example so time alone won't be the smoking gun to figure out if two people are conversating about similar things.
Still pretty dang cool.
evaristoc
@evaristoc
Aug 01 2016 18:17
:)
It doesn't happens in all chatrooms, OR it happens in other chatrooms rather differently. If you post let's say in the GameDev chatroom today, you might receive an answer several days later... sometimes there is an exceptional conversation in "real time"... well: I don't know how a procedure like this one would behave for that chatroom but I guess there should be other clues to be included in the model, otherwise it won't make the clustering properly...
Lightwaves
@Lightwaves
Aug 01 2016 18:18
Right or someone could cause the conversation to move in a totally different direction.
evaristoc
@evaristoc
Aug 01 2016 18:18
Exactly... we were thinking the same thing, indeed...
That's also true... I expect that the method should detect that too and it would be able to separate a conversation with different topics and separate a conversation by topics... that would be great!
Lightwaves
@Lightwaves
Aug 01 2016 23:46
Hey @evaristoc you here?