These are chat archives for FreeCodeCamp/DataScience

16th
Jan 2017
Xavier Sumba
@cuent
Jan 16 2017 02:32

@mesmoiron Thanks. I do not necessary need a confidence metric but need to know if the 1st group is related with the 2nd.
@evaristoc Thanks. I've used a little bit WordNet and WordNet Domains, and there is a cumbersome because I have groups of words from a variety of domains. Also, I've used cortical API (SimService seems similar). It's really good they use fingerprints to find relationships, but I wouldn't like to depend on an API. The problem is that I want to disambiguate authors, and there are cases that I can't say they are different just comparing their names; I need other attribute. I'm using some subjects, which also varies in language. So I have two groups of words, and I want to determine if those words are similar or not.
Until now, I have a small implementation of Google Distance, but it takes a great amount of time, tried some syntactic similarities, and cortical API.
What I'm going to do now is see exactly what offers SimService API, SEMILAR Project, and review this paper.

Thank for your help. If anyone have other idea thanks.

CamperBot
@camperbot
Jan 16 2017 02:32
cuent sends brownie points to @mesmoiron and @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 325 | @evaristoc |http://www.freecodecamp.com/evaristoc
:cookie: 320 | @mesmoiron |http://www.freecodecamp.com/mesmoiron
Suzanne Atkinson
@AdventureBear
Jan 16 2017 04:37
Hey guys & gals, I published my first npm package to help create React component files via command line. Please check it out if you are using React and let me know what you think. https://github.com/AdventureBear/trot
Amelia
@apottr
Jan 16 2017 04:42
very cool! ty @AdventureBear !
CamperBot
@camperbot
Jan 16 2017 04:42
apottr sends brownie points to @adventurebear :sparkles: :thumbsup: :sparkles:
:cookie: 560 | @adventurebear |http://www.freecodecamp.com/adventurebear
Suzanne Atkinson
@AdventureBear
Jan 16 2017 04:46
Thanks! I hope someone else finds it usefull too.
evaristoc
@evaristoc
Jan 16 2017 11:05

@cuent No worries! Nice project!

@cuent just out of curiosity:

The problem is that I want to disambiguate authors, and there are cases that I can't say they are different just comparing their names; I need other attribute.

And the other attribute is...? Text written by the author? Are they scientific authors? Are you trying to work with citations?

I'm using some subjects, which also varies in language.

Does the project involve translations? Or do you mean the writing style? Or rather the content, assuming that the content is usually specific to the author?
Is Google Distance giving you a nice accuracy?

@AdventureBear Congrats!!!!
Hèlen Grives
@mesmoiron
Jan 16 2017 11:58
@cuent while searching I came accross forensic linguistics. If you search for that you will get resources about author identification using textual properties. I haven't used any of the techniques yet (not my imediate goal); however their insights might be valuable in creating textual attributes we haven't thought about. It is also used for plagiarism. For mr it helps to get a fast overview of possibilities, work flow and areas to look for that might be helpful in tackling the problem at hand. I will definitately check out your resources :+1: for that.
Xavier Sumba
@cuent
Jan 16 2017 12:11
@evaristoc yes, mostly working with research papers. Attributes like affilitiantion, but the problem is that an author could belong to many affilitiantions and have published different publications in each one. That's why I wanna know if the author Juan Perez B. is the same as Juan Perez comparing their subjects.
Yes, involves translation I have text in ES and EN. Yes, I have good results, but the complexity time is the problem.
@mesmoiron thanks I'll check!
CamperBot
@camperbot
Jan 16 2017 12:12
cuent sends brownie points to @mesmoiron :sparkles: :thumbsup: :sparkles:
:cookie: 321 | @mesmoiron |http://www.freecodecamp.com/mesmoiron
evaristoc
@evaristoc
Jan 16 2017 14:40

@cuent: I assume "subject (categories)" of the article ~= "keywords"

@koustuvsinha and I did something similar for another problem, without the translations. @koustuvsinha can you contact @cuent?

I think you are already in the right direction. Yes: you probably need better attributes. Institution is definitively one. If you can get additional data about Main Research Area per institution that could help.

Another problem that you are confronting is that some authors are invited to participate in articles not related to their field. In that case, I suggest to consider the importance of the position in the author list accordingly (if I am not wrong, the importance convention changes per field).

I have the impression that you might need other methods for a better evaluation but I guess you have to keep it simple? :)

Before getting fancier, try simpler solutions?

@mesmoiron forensic linguistics: interesting topic!

Xavier Sumba
@cuent
Jan 16 2017 14:47
@evaristoc Yes, it could be keywords from papers or keyword written by the secretaries, or worst sometimes they don't have keywords ( we could use topic models to extract some) I'd like to have the best solution because after I use those authors to find collaborative work. Yes, it could be a problem if I find a subjects from other area. Besides that data is really noisy.
evaristoc
@evaristoc
Jan 16 2017 15:14
@cuent I don't remember well but I think we were working on a semi-supervised approach.
Hèlen Grives
@mesmoiron
Jan 16 2017 18:03
@evaristoc yes ever since my course forensic accounting I have a crush on forensic topics; especially because of the legal reasoning. It gives interesting perspective on doing analytics and sharpen skills. It is also a nice topic which makes it easier to identify key researchers and other people. A professor once gave the advice that searching for nice areas can actually benefit you. With languages considered fun curiculae :rage: entering the computational linguistic arena kills that argument. :gun: :japanese_ogre: :smile: