Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    phalexo
    @phalexo
    This sounds like the problem posed on experfy.com site.
    Jesse Talavera-Greenberg
    @JesseTG
    What do you mean?
    phalexo
    @phalexo
    Well, you have to experiment. Maybe there is a large repository of tweets. I'd train on that.
    Jesse Talavera-Greenberg
    @JesseTG
    I have lots of data already. Like, terabytes. My problem is deciding what to do with it.
    phalexo
    @phalexo
    There was a project posted in which the wanted to track groups of people by language they use (slang)
    Jesse Talavera-Greenberg
    @JesseTG
    Specifically, I'm trying to detect whether or not a given user is a sock puppet or bot (because Russia screwing with our elections pisses me off)
    I have a list of known bots and I'm currently combing through my data to get some (but not all!) of the tweets made by these bots
    phalexo
    @phalexo
    Well, train the general corpus first.
    including the bots and known bad actors.
    Jesse Talavera-Greenberg
    @JesseTG
    Here's a catch; I also want to consider URLs and usernames that these bots commonly post. Should I consider those to be words?
    Also, not all bots will have the same amount of tweets available. For some bots I might have hundreds, for others I might have tens. I don't know yet, the job is still running
    phalexo
    @phalexo
    mark every tweet with a tag "bad dude" "maybe not bad"
    Jesse Talavera-Greenberg
    @JesseTG
    Technically I'm evaluating users, not tweets
    phalexo
    @phalexo
    Ignore URLs too brittle.
    Jesse Talavera-Greenberg
    @JesseTG
    How so?
    phalexo
    @phalexo
    Strip URLs out, junk stuff. Just wastes time and space.
    Jesse Talavera-Greenberg
    @JesseTG
    But part of the process of spreading misinformation is posting links...
    phalexo
    @phalexo
    URLs will change all the time, they tell you nothing.
    And URLs have no natural location within English.
    it is junk.
    for what you want.
    Tweets are supposed to persuade, communicate a message to entice someone to click.
    Have to look for language patterns, not urls.
    Jesse Talavera-Greenberg
    @JesseTG
    I think I read in an article once that based on analysis by other people, a lot of these bots tend to post similar links (but I'll look again to be sure)
    phalexo
    @phalexo
    Their stuff does not work well or for long.
    Jesse Talavera-Greenberg
    @JesseTG
    "Their" being?
    phalexo
    @phalexo
    People who rely on urls to filter.
    Jesse Talavera-Greenberg
    @JesseTG
    I don't want to rely too heavily on URLs or @mentions. But I don't want to throw them out either.
    phalexo
    @phalexo
    Those algos break easily, thus "brittle"
    Jesse Talavera-Greenberg
    @JesseTG
    Ah, one other thing I forgot to consider. I have a list of users known to be bots...but I don't have a list of users known not to be bots.
    phalexo
    @phalexo
    You don't need doc2vec or word2vec to use mentions or urls.
    Jesse Talavera-Greenberg
    @JesseTG
    What would you suggest, then? The attributes of a given user I want to consider are:
    phalexo
    @phalexo
    You are missing the POINT.
    doc2vec allows one to assess semantic similarity.
    mentions and urls are junk data, which you could use separately if you want.
    For training you should strip them out.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo I must be missing it, then. Given that I want to evaluate users, would my document be all of their tweets (that I have available) concatenated?
    phalexo
    @phalexo
    A single user is likely to use the same vocabulary and similar abbreviations, etc... If one takes one user's tweets one expects them to aggregate close to each other, especially if about a similar subject like a political campaign.
    So, if the same user is using multiple accounts, one still expects to see that similarity to persist across accounts. It is not about 100% accuracy here.
    I am curious, how would you feel about a bot disseminating accurate, truthful information?
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo With respect to my own opinions, or with the project I'm working on? In my opinion, such a bot would be fine and worthy of existence. With respect to my project, I'm not interested in detecting them (I'm focusing on foreign agents trying to spread misinformation or pretend to be American citizens)
    @phalexo You could say that I'm technically interested in detecting fake accounts rather than fake people (since I understand that one person can run multiple fake accounts)
    ilmucio
    @ilmucio
    Are there some front-end web app that work well with gensim? I wonder if there are some open source project that have build a seach app that can easly use integrate with gensim type output?
    eugines
    @eugines
    hey, I was wondering if anyone might be familiar with the following error
    AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py'>
    it seems to pop up when I'm trying to load a pretrained model from the following project: https://github.com/sb1992/NETL-Automatic-Topic-Labelling-
    the gensim GitHub issue that I found seems to suggest that the error should've been fixed in a later version (I have the latest version of gensim currently)
    any help would be greatly appreciated!
    emnajaoua
    @emnajaoua
    I cannot install gensim with non-pip installation, I have downloaded the tar.gz file, unzipped it and installed the wheel as well, but when I run python setup.py test I get an error
    Did someone encountred this issue before ?