Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Dennis.Chen
    @DennisChen0307
    Hi there. Is there any road maps for new release of gensim?
    AMaini503
    @AMaini503
    Should I expect Doc2Vec to use all the cores if I pass workers = #cpus ?
    matanster
    @matanster
    Apologies for adding a 4th question in a row here...
    Does gensim have anything builtin for transforming a document to a bag of n-grams representation, or does it in fact only do bag of words? (words being 1-grams...)
    Radim Řehůřek
    @piskvorky
    @matanster Gensim doesn't actually do the transformation; it already expects a (feature_id, weight) bag-of-whatever pairs on input. How you split the documents into words/ngrams/something else is up to you.
    @DennisChen0307 our aim is one release per month, but last months have been busy at RARE, not much time for open source. We plan a release for the end of this month.
    matanster
    @matanster
    @piskvorky oh, sorry then, I just thought maybe corpora.Dictionary.doc2bow might have some usage form for that... I could swear I saw it computing the bag-of-words in my code, but I should probably start reading the source to answer my own questions
    Jesse Talavera-Greenberg
    @JesseTG
    When training a word2vec model, I need to give it a list of documents. How does word2vec treat unknown words? By giving an unknown word a vector close to a known word?
    phalexo
    @phalexo
    It does the initial pass, compiling a corpus dictionary. If it does not make into the dictionary, I believe, it is totally ignored thereafter.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo So wait, how can I use word2vec to analyze Tweets if slang and trending topics are always changing? Or am I missing something?
    phalexo
    @phalexo
    You would have to continue training. Maybe there is a way to update the dictionary.
    In any case, with Twitter you have a huge problem because people abbreviate everything, make up their own words, etc...
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo What would you suggest?
    phalexo
    @phalexo
    This sounds like the problem posed on experfy.com site.
    Jesse Talavera-Greenberg
    @JesseTG
    What do you mean?
    phalexo
    @phalexo
    Well, you have to experiment. Maybe there is a large repository of tweets. I'd train on that.
    Jesse Talavera-Greenberg
    @JesseTG
    I have lots of data already. Like, terabytes. My problem is deciding what to do with it.
    phalexo
    @phalexo
    There was a project posted in which the wanted to track groups of people by language they use (slang)
    Jesse Talavera-Greenberg
    @JesseTG
    Specifically, I'm trying to detect whether or not a given user is a sock puppet or bot (because Russia screwing with our elections pisses me off)
    I have a list of known bots and I'm currently combing through my data to get some (but not all!) of the tweets made by these bots
    phalexo
    @phalexo
    Well, train the general corpus first.
    including the bots and known bad actors.
    Jesse Talavera-Greenberg
    @JesseTG
    Here's a catch; I also want to consider URLs and usernames that these bots commonly post. Should I consider those to be words?
    Also, not all bots will have the same amount of tweets available. For some bots I might have hundreds, for others I might have tens. I don't know yet, the job is still running
    phalexo
    @phalexo
    mark every tweet with a tag "bad dude" "maybe not bad"
    Ignore URLs too brittle.
    Jesse Talavera-Greenberg
    @JesseTG
    Technically I'm evaluating users, not tweets
    How so?
    phalexo
    @phalexo
    Strip URLs out, junk stuff. Just wastes time and space.
    Jesse Talavera-Greenberg
    @JesseTG
    But part of the process of spreading misinformation is posting links...
    phalexo
    @phalexo
    URLs will change all the time, they tell you nothing.
    And URLs have no natural location within English.
    it is junk.
    for what you want.
    Tweets are supposed to persuade, communicate a message to entice someone to click.
    Have to look for language patterns, not urls.
    Jesse Talavera-Greenberg
    @JesseTG
    I think I read in an article once that based on analysis by other people, a lot of these bots tend to post similar links (but I'll look again to be sure)
    phalexo
    @phalexo
    Their stuff does not work well or for long.
    Jesse Talavera-Greenberg
    @JesseTG
    "Their" being?
    phalexo
    @phalexo
    People who rely on urls to filter.
    Jesse Talavera-Greenberg
    @JesseTG
    I don't want to rely too heavily on URLs or @mentions. But I don't want to throw them out either.
    phalexo
    @phalexo
    Those algos break easily, thus "brittle"
    Jesse Talavera-Greenberg
    @JesseTG
    Ah, one other thing I forgot to consider. I have a list of users known to be bots...but I don't have a list of users known not to be bots.
    phalexo
    @phalexo
    You don't need doc2vec or word2vec to use mentions or urls.
    Jesse Talavera-Greenberg
    @JesseTG
    What would you suggest, then? The attributes of a given user I want to consider are:
    phalexo
    @phalexo
    You are missing the POINT.
    doc2vec allows one to assess semantic similarity.
    mentions and urls are junk data, which you could use separately if you want.
    For training you should strip them out.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo I must be missing it, then. Given that I want to evaluate users, would my document be all of their tweets (that I have available) concatenated?
    phalexo
    @phalexo
    A single user is likely to use the same vocabulary and similar abbreviations, etc... If one takes one user's tweets one expects them to aggregate close to each other, especially if about a similar subject like a political campaign.