by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Kataev Egor
    @Lavabar
    thank you)
    phalexo
    @phalexo
    That said, there is no harm in trying to feed in text sentence by sentence. Maybe you'll get better results. :-)
    Ramin
    @transfluxus
    @Lavabar @phalexo I think often the document is chunked into lines, where you have a sentence per line
    phalexo
    @phalexo
    Sure, but one would lose some information about relationships between adjacent thoughts/ideas. Even a hybrid approach could be used, feeding in self-contained documents and individual sentences.
    Ramin
    @transfluxus
    Hi, I just found from all the en_core_web models only the small one detect stopwords?
    Evgeny Denisov
    @eIGato
    Hi guys. I've thought up a hack for Doc2Vec inferrence. But i don't know if it does make any sense.
    The problem was that infer_vector() produces a vector that is very different from bulk-trained vector of the same document.
    The hack is that after the bulk training i just re-infer all vectors and replace all document vectors with inferred ones.
    Does that make sense?
    phalexo
    @phalexo
    And if you do it a second time, all the inferred vectors are going be different again, because you changed the model.
    Do it a few hundred times, maybe it will converge to something stable. :-)
    Evgeny Denisov
    @eIGato
    @phalexo you are wrong about it.
    In [13]: result = [np.dot(old_infer, new_infer) for old_infer, new_infer in zip(d2v.docvecs.doctag_syn0norm, infer_syn0norm)]
    In [14]: hist = plt.hist(result, bins=sim_borders)
    In [15]: plt.show(block=False)
    image.png
    I've calculated the similarity of first and second inferrence: it's about 1.0 for all the vectors.
    phalexo
    @phalexo
    What parameters are you using with infer_vector? How many steps, alpha, min_alpha?
    Evgeny Denisov
    @eIGato
    @phalexo steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha. d2v.min_alpha = 0
    phalexo
    @phalexo
    What are the actual values, I don't know what you have in your variables?
    Evgeny Denisov
    @eIGato
    print(d2v.alpha, d2v.iter ** 2)
    0.00221920956360714 25
    phalexo
    @phalexo
    To get a reasonable inferred vector steps should be around 500-1000
    Evgeny Denisov
    @eIGato
    @phalexo Tried alpha = min_alpha = 0.19, steps = 4500. Got same result. That's weird.
    phalexo
    @phalexo
    Try alpha = 0.01, min_alpha = 0.0001
    with steps = 1000
    Evgeny Denisov
    @eIGato
    @phalexo Same result((
    phalexo
    @phalexo
    This is before your tweaking vectors?
    Evgeny Denisov
    @eIGato
    No. I do discard the bulk-trained vectors.
    This is similarity between first inferrence (steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha) and the last result from interactive shell.
    Evgeny Denisov
    @eIGato
    F@#$%! All that time i did infer documents from a generator. One-time generator. They did not re-infer at all. Bleen.
    Evgeny Denisov
    @eIGato
    image.png
    This is similarity between first inferrence and alpha = min_alpha = 0.19, steps = 4500.
    Evgeny Denisov
    @eIGato
    Replaced document vectors with copy of inferred ones. And re-inferred. Similarity is about 1.0 like before. Because infer_vector() doesn't use old docvecs at all.
    But i still don't know if it makes sense.
    phalexo
    @phalexo
    I don't either. Makes no sense to me. Considering that inferred vectors should be different based on parameters, it would seem odd.
    Evgeny Denisov
    @eIGato
    d2v.sample = 1.0 / word_count
    Is it reasonable?
    Evgeny Denisov
    @eIGato
    word_count is the count of different words (provisional length of d2v.wv.index2word).
    Evgeny Denisov
    @eIGato
    How to pick a reasonable sample value?
    Evgeny Denisov
    @eIGato
    What if 90% of words are different from each other? Is it possible to train *2Vec model with such a corpus?
    phalexo
    @phalexo
    Clearly that is not a natural language application.
    Evgeny Denisov
    @eIGato
    @phalexo purpose is to predict phrases, not words. So i use phrases as d2v words. And full texts as d2v docs.
    Saurabh Vyas
    @saurabhvyas
    is there a pretrained lda model available for gensim , just for tinkering ?
    Matan Shenhav
    @ixxie
    @saurabhvyas can LDA even be used in a supervised mode?
    anyway, its been pretty easy for us to train+predict on a given data set
    Dennis.Chen
    @DennisChen0307
    Hi there. Is there any road maps for new release of gensim?
    AMaini503
    @AMaini503
    Should I expect Doc2Vec to use all the cores if I pass workers = #cpus ?
    matanster
    @matanster
    Apologies for adding a 4th question in a row here...
    Does gensim have anything builtin for transforming a document to a bag of n-grams representation, or does it in fact only do bag of words? (words being 1-grams...)
    Radim Řehůřek
    @piskvorky
    @matanster Gensim doesn't actually do the transformation; it already expects a (feature_id, weight) bag-of-whatever pairs on input. How you split the documents into words/ngrams/something else is up to you.
    @DennisChen0307 our aim is one release per month, but last months have been busy at RARE, not much time for open source. We plan a release for the end of this month.
    matanster
    @matanster
    @piskvorky oh, sorry then, I just thought maybe corpora.Dictionary.doc2bow might have some usage form for that... I could swear I saw it computing the bag-of-words in my code, but I should probably start reading the source to answer my own questions
    Jesse Talavera-Greenberg
    @JesseTG
    When training a word2vec model, I need to give it a list of documents. How does word2vec treat unknown words? By giving an unknown word a vector close to a known word?
    phalexo
    @phalexo
    It does the initial pass, compiling a corpus dictionary. If it does not make into the dictionary, I believe, it is totally ignored thereafter.
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo So wait, how can I use word2vec to analyze Tweets if slang and trending topics are always changing? Or am I missing something?
    phalexo
    @phalexo
    You would have to continue training. Maybe there is a way to update the dictionary.
    In any case, with Twitter you have a huge problem because people abbreviate everything, make up their own words, etc...
    Jesse Talavera-Greenberg
    @JesseTG
    @phalexo What would you suggest?