Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    phalexo
    @phalexo
    And URLs have no natural location within English.
    it is junk.
    for what you want.
    Tweets are supposed to persuade, communicate a message to entice someone to click.
    Have to look for language patterns, not urls.
    Jesse Talavera-Greenberg
    @JesseTG
    I think I read in an article once that based on analysis by other people, a lot of these bots tend to post similar links (but I'll look again to be sure)
    phalexo
    @phalexo
    Their stuff does not work well or for long.
    Jesse Talavera-Greenberg
    @JesseTG
    "Their" being?
    phalexo
    @phalexo
    People who rely on urls to filter.
    Jesse Talavera-Greenberg
    @JesseTG
    I don't want to rely too heavily on URLs or @mentions. But I don't want to throw them out either.
    phalexo
    @phalexo
    Those algos break easily, thus "brittle"
    Jesse Talavera-Greenberg
    @JesseTG
    Ah, one other thing I forgot to consider. I have a list of users known to be bots...but I don't have a list of users known not to be bots.
    phalexo
    @phalexo
    You don't need doc2vec or word2vec to use mentions or urls.
    Jesse Talavera-Greenberg
    @JesseTG
    What would you suggest, then? The attributes of a given user I want to consider are:
    phalexo
    @phalexo
    You are missing the POINT.
    doc2vec allows one to assess semantic similarity.
    mentions and urls are junk data, which you could use separately if you want.
    For training you should strip them out.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo I must be missing it, then. Given that I want to evaluate users, would my document be all of their tweets (that I have available) concatenated?
    phalexo
    @phalexo
    A single user is likely to use the same vocabulary and similar abbreviations, etc... If one takes one user's tweets one expects them to aggregate close to each other, especially if about a similar subject like a political campaign.
    So, if the same user is using multiple accounts, one still expects to see that similarity to persist across accounts. It is not about 100% accuracy here.
    I am curious, how would you feel about a bot disseminating accurate, truthful information?
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo With respect to my own opinions, or with the project I'm working on? In my opinion, such a bot would be fine and worthy of existence. With respect to my project, I'm not interested in detecting them (I'm focusing on foreign agents trying to spread misinformation or pretend to be American citizens)
    @phalexo You could say that I'm technically interested in detecting fake accounts rather than fake people (since I understand that one person can run multiple fake accounts)
    ilmucio
    @ilmucio
    Are there some front-end web app that work well with gensim? I wonder if there are some open source project that have build a seach app that can easly use integrate with gensim type output?
    eugines
    @eugines
    hey, I was wondering if anyone might be familiar with the following error
    AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py'>
    it seems to pop up when I'm trying to load a pretrained model from the following project: https://github.com/sb1992/NETL-Automatic-Topic-Labelling-
    the gensim GitHub issue that I found seems to suggest that the error should've been fixed in a later version (I have the latest version of gensim currently)
    any help would be greatly appreciated!
    emnajaoua
    @emnajaoua
    I cannot install gensim with non-pip installation, I have downloaded the tar.gz file, unzipped it and installed the wheel as well, but when I run python setup.py test I get an error
    Did someone encountred this issue before ?
    Radim Řehůřek
    @piskvorky
    @emnajaoua Did the installation from source proceed correctly? What's the error? What do you mean by "wheel" -- that's a binary distribution, not source. Btw the mailing list may be a better place :)
    @eugines that's an interesting project! I was not aware of that. Tweeted it out now :)
    Regarding your question: there has been some refactoring in the past 2-3 releases. We tried to maintain backward compatibility, but if NETL used some older version, there may have been an issue. @gojomo do you know whether the DocvecsArray class moved (and where)?
    If so, we should at least keep an alias in the old location, to maintain backward compatibility.
    phalexo
    @phalexo
    I had to retrain my model (Doc2Vec) from scratch because of the changes in gensim. That repository is 2 years old.
    matanster
    @matanster
    Do we have any convenience function (in gensim) for getting a word vector's length? (as a proxy for its frequency in the data the embedding was trained over)
    Of course this question is motivated by laziness....
    matanster
    @matanster
    ? :)
    phalexo
    @phalexo
    Apparently laziness is not the mother of all invention as I thought. You have wasted 2 days waiting for someone else to do your work.
    matanster
    @matanster
    @phalexo no I've not, I was just curious how deep does the gensim toolchain go
    @phalexo thanks for being agressive though
    Now ignoring your snarky comment, I'll post another question for others if they care to respond
    I am probably not getting the point of corpus2dense().
    Don't people typically want to deal with sparse matrices in language processing? e.g. when training classification models over sparse language data; for example you typically can't efficiently train/fit a model on a large bag-of-wordish dense matrix in sklearn, without huge memory. What am I missing about it?
    estathop
    @estathop
    Greetings chat, it's nice to have joined you. I have a relative simple question I suppose. I want to use tf-idf , executed the example in comments and successfully the code returned a vector with the index of the word and the probability, how can I map the index of that word to the actual word? is there a function I am not aware of ? Can I implement it somehow ?
    estathop
    @estathop
    I am seeing that sklearn has this method implemented already
    Radim Řehůřek
    @piskvorky
    @matanster for NumPy vectors , you can get their length using vector.shape. See the # numpy vector of a word example at https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors
    @matanster regarding corpus2dense: it's true NLP produces sparse matrices using the bag-of-words representation. But some later transformations, such as LSI, transform the sparse vectors into a dense space. So using a dense representation actually saves you memory (less overhead than representing a dense matrix using sparse structures).
    Plus, there are many other external tools in the Data Science ecosystem that still cannot handle sparse inputs. So you have to export/supply a dense structure => need corpus2dense. Even scikit-learn didn't have sparse support for a while ;)