Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    phalexo
    @phalexo
    including the bots and known bad actors.
    Jesse Talavera-Greenberg
    @JesseTG
    Here's a catch; I also want to consider URLs and usernames that these bots commonly post. Should I consider those to be words?
    Also, not all bots will have the same amount of tweets available. For some bots I might have hundreds, for others I might have tens. I don't know yet, the job is still running
    phalexo
    @phalexo
    mark every tweet with a tag "bad dude" "maybe not bad"
    Jesse Talavera-Greenberg
    @JesseTG
    Technically I'm evaluating users, not tweets
    phalexo
    @phalexo
    Ignore URLs too brittle.
    Jesse Talavera-Greenberg
    @JesseTG
    How so?
    phalexo
    @phalexo
    Strip URLs out, junk stuff. Just wastes time and space.
    Jesse Talavera-Greenberg
    @JesseTG
    But part of the process of spreading misinformation is posting links...
    phalexo
    @phalexo
    URLs will change all the time, they tell you nothing.
    And URLs have no natural location within English.
    it is junk.
    for what you want.
    Tweets are supposed to persuade, communicate a message to entice someone to click.
    Have to look for language patterns, not urls.
    Jesse Talavera-Greenberg
    @JesseTG
    I think I read in an article once that based on analysis by other people, a lot of these bots tend to post similar links (but I'll look again to be sure)
    phalexo
    @phalexo
    Their stuff does not work well or for long.
    Jesse Talavera-Greenberg
    @JesseTG
    "Their" being?
    phalexo
    @phalexo
    People who rely on urls to filter.
    Jesse Talavera-Greenberg
    @JesseTG
    I don't want to rely too heavily on URLs or @mentions. But I don't want to throw them out either.
    phalexo
    @phalexo
    Those algos break easily, thus "brittle"
    Jesse Talavera-Greenberg
    @JesseTG
    Ah, one other thing I forgot to consider. I have a list of users known to be bots...but I don't have a list of users known not to be bots.
    phalexo
    @phalexo
    You don't need doc2vec or word2vec to use mentions or urls.
    Jesse Talavera-Greenberg
    @JesseTG
    What would you suggest, then? The attributes of a given user I want to consider are:
    phalexo
    @phalexo
    You are missing the POINT.
    doc2vec allows one to assess semantic similarity.
    mentions and urls are junk data, which you could use separately if you want.
    For training you should strip them out.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo I must be missing it, then. Given that I want to evaluate users, would my document be all of their tweets (that I have available) concatenated?
    phalexo
    @phalexo
    A single user is likely to use the same vocabulary and similar abbreviations, etc... If one takes one user's tweets one expects them to aggregate close to each other, especially if about a similar subject like a political campaign.
    So, if the same user is using multiple accounts, one still expects to see that similarity to persist across accounts. It is not about 100% accuracy here.
    I am curious, how would you feel about a bot disseminating accurate, truthful information?
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo With respect to my own opinions, or with the project I'm working on? In my opinion, such a bot would be fine and worthy of existence. With respect to my project, I'm not interested in detecting them (I'm focusing on foreign agents trying to spread misinformation or pretend to be American citizens)
    @phalexo You could say that I'm technically interested in detecting fake accounts rather than fake people (since I understand that one person can run multiple fake accounts)
    ilmucio
    @ilmucio
    Are there some front-end web app that work well with gensim? I wonder if there are some open source project that have build a seach app that can easly use integrate with gensim type output?
    eugines
    @eugines
    hey, I was wondering if anyone might be familiar with the following error
    AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py'>
    it seems to pop up when I'm trying to load a pretrained model from the following project: https://github.com/sb1992/NETL-Automatic-Topic-Labelling-
    the gensim GitHub issue that I found seems to suggest that the error should've been fixed in a later version (I have the latest version of gensim currently)
    any help would be greatly appreciated!
    emnajaoua
    @emnajaoua
    I cannot install gensim with non-pip installation, I have downloaded the tar.gz file, unzipped it and installed the wheel as well, but when I run python setup.py test I get an error
    Did someone encountred this issue before ?
    Radim Řehůřek
    @piskvorky
    @emnajaoua Did the installation from source proceed correctly? What's the error? What do you mean by "wheel" -- that's a binary distribution, not source. Btw the mailing list may be a better place :)
    @eugines that's an interesting project! I was not aware of that. Tweeted it out now :)
    Regarding your question: there has been some refactoring in the past 2-3 releases. We tried to maintain backward compatibility, but if NETL used some older version, there may have been an issue. @gojomo do you know whether the DocvecsArray class moved (and where)?
    If so, we should at least keep an alias in the old location, to maintain backward compatibility.
    phalexo
    @phalexo
    I had to retrain my model (Doc2Vec) from scratch because of the changes in gensim. That repository is 2 years old.
    matanster
    @matanster
    Do we have any convenience function (in gensim) for getting a word vector's length? (as a proxy for its frequency in the data the embedding was trained over)
    Of course this question is motivated by laziness....
    matanster
    @matanster
    ? :)