Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Evgeny Denisov
    @eIGato
    F@#$%! All that time i did infer documents from a generator. One-time generator. They did not re-infer at all. Bleen.
    Evgeny Denisov
    @eIGato
    image.png
    This is similarity between first inferrence and alpha = min_alpha = 0.19, steps = 4500.
    Evgeny Denisov
    @eIGato
    Replaced document vectors with copy of inferred ones. And re-inferred. Similarity is about 1.0 like before. Because infer_vector() doesn't use old docvecs at all.
    But i still don't know if it makes sense.
    phalexo
    @phalexo
    I don't either. Makes no sense to me. Considering that inferred vectors should be different based on parameters, it would seem odd.
    Evgeny Denisov
    @eIGato
    d2v.sample = 1.0 / word_count
    Is it reasonable?
    Evgeny Denisov
    @eIGato
    word_count is the count of different words (provisional length of d2v.wv.index2word).
    Evgeny Denisov
    @eIGato
    How to pick a reasonable sample value?
    Evgeny Denisov
    @eIGato
    What if 90% of words are different from each other? Is it possible to train *2Vec model with such a corpus?
    phalexo
    @phalexo
    Clearly that is not a natural language application.
    Evgeny Denisov
    @eIGato
    @phalexo purpose is to predict phrases, not words. So i use phrases as d2v words. And full texts as d2v docs.
    Saurabh Vyas
    @saurabhvyas
    is there a pretrained lda model available for gensim , just for tinkering ?
    Matan Shenhav
    @ixxie
    @saurabhvyas can LDA even be used in a supervised mode?
    anyway, its been pretty easy for us to train+predict on a given data set
    Dennis.Chen
    @DennisChen0307
    Hi there. Is there any road maps for new release of gensim?
    AMaini503
    @AMaini503
    Should I expect Doc2Vec to use all the cores if I pass workers = #cpus ?
    matanster
    @matanster
    Apologies for adding a 4th question in a row here...
    Does gensim have anything builtin for transforming a document to a bag of n-grams representation, or does it in fact only do bag of words? (words being 1-grams...)
    Radim Řehůřek
    @piskvorky
    @matanster Gensim doesn't actually do the transformation; it already expects a (feature_id, weight) bag-of-whatever pairs on input. How you split the documents into words/ngrams/something else is up to you.
    @DennisChen0307 our aim is one release per month, but last months have been busy at RARE, not much time for open source. We plan a release for the end of this month.
    matanster
    @matanster
    @piskvorky oh, sorry then, I just thought maybe corpora.Dictionary.doc2bow might have some usage form for that... I could swear I saw it computing the bag-of-words in my code, but I should probably start reading the source to answer my own questions
    Jesse Talavera-Greenberg
    @JesseTG
    When training a word2vec model, I need to give it a list of documents. How does word2vec treat unknown words? By giving an unknown word a vector close to a known word?
    phalexo
    @phalexo
    It does the initial pass, compiling a corpus dictionary. If it does not make into the dictionary, I believe, it is totally ignored thereafter.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo So wait, how can I use word2vec to analyze Tweets if slang and trending topics are always changing? Or am I missing something?
    phalexo
    @phalexo
    You would have to continue training. Maybe there is a way to update the dictionary.
    In any case, with Twitter you have a huge problem because people abbreviate everything, make up their own words, etc...
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo What would you suggest?
    phalexo
    @phalexo
    This sounds like the problem posed on experfy.com site.
    Jesse Talavera-Greenberg
    @JesseTG
    What do you mean?
    phalexo
    @phalexo
    Well, you have to experiment. Maybe there is a large repository of tweets. I'd train on that.
    Jesse Talavera-Greenberg
    @JesseTG
    I have lots of data already. Like, terabytes. My problem is deciding what to do with it.
    phalexo
    @phalexo
    There was a project posted in which the wanted to track groups of people by language they use (slang)
    Jesse Talavera-Greenberg
    @JesseTG
    Specifically, I'm trying to detect whether or not a given user is a sock puppet or bot (because Russia screwing with our elections pisses me off)
    I have a list of known bots and I'm currently combing through my data to get some (but not all!) of the tweets made by these bots
    phalexo
    @phalexo
    Well, train the general corpus first.
    including the bots and known bad actors.
    Jesse Talavera-Greenberg
    @JesseTG
    Here's a catch; I also want to consider URLs and usernames that these bots commonly post. Should I consider those to be words?
    Also, not all bots will have the same amount of tweets available. For some bots I might have hundreds, for others I might have tens. I don't know yet, the job is still running
    phalexo
    @phalexo
    mark every tweet with a tag "bad dude" "maybe not bad"
    Jesse Talavera-Greenberg
    @JesseTG
    Technically I'm evaluating users, not tweets
    phalexo
    @phalexo
    Ignore URLs too brittle.
    Jesse Talavera-Greenberg
    @JesseTG
    How so?
    phalexo
    @phalexo
    Strip URLs out, junk stuff. Just wastes time and space.
    Jesse Talavera-Greenberg
    @JesseTG
    But part of the process of spreading misinformation is posting links...
    phalexo
    @phalexo
    URLs will change all the time, they tell you nothing.
    And URLs have no natural location within English.
    it is junk.
    for what you want.
    Tweets are supposed to persuade, communicate a message to entice someone to click.
    Have to look for language patterns, not urls.