Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ivan Menshikh
    @menshikh-iv
    phalexo
    @phalexo
    Any of `learn_doctags', `learn_words`, and `learn_hidden` may be set False to
            prevent learning-updates to those respective model weights, as if using the
            (partially-)frozen model to infer other compatible vectors.
    This seems pretty close within the unoptimized path.
    phalexo
    @phalexo
    Presumably I could use the optimized Doc2Vec for initial training, lock a subset of vectors and continue training using the python version of Doc2Vec. Is the switch between versions automatic or I can control that?
    Are the locks also implemented in the cython version?
    Matan Shenhav
    @ixxie
    @menshikh-iv I don't know exactly if this is useful for others but it should be easy to add a method to Dictionary that does:
    def build_vocab_from_freq(self, freq_dict):
        for key in self:
            self.dfs[key] = freq_dict[self[key]]
    Ivan Menshikh
    @menshikh-iv

    @phalexo

    Is the switch between versions automatic or I can control that?

    Automatic only (if you have compiled extensions and import successful - cython version will be used

    Are the locks also implemented in the cython version?

    Yes

    @ixxie I see no serious reasons for it (this doesn't looks very useful), anyway - you can do it manually (if needed).
    Matan Shenhav
    @ixxie
    @menshikh-iv - we are using gensim to vectorize and cluster a relatively small set of sentances; the frequency counts inside our data set are misleading for the purposes of TFIDF so we need to use external frequencies. I imagine we are not the only ones. I was just wrapping FastText, Doc2Vec and LsiModel into our model and for the first two I could use .build_vocab_from_freq() while for LSI I had to do this work around.
    Ivan Menshikh
    @menshikh-iv
    @ixxie feel free to create PR - we can discuss it on github with other guys from community
    Matan Shenhav
    @ixxie
    @menshikh-iv I think overall gensim is awesome, and it would be great if it could have a more uniform interface between the models to make it easier to use them as drop in replacements for one another
    Ivan Menshikh
    @menshikh-iv
    @ixxie about interface (in general meaning) - completely agree
    Matan Shenhav
    @ixxie
    @menshikh-iv alright, I will do so when I have time!
    Ivan Menshikh
    @menshikh-iv
    :+1:
    Matan Shenhav
    @ixxie
    @menshikh-iv - are there any other models in gensim you would recommend for short sentence vectorization besides LSI and FastText?
    FastText seems to perform very well in quality and speed so far, and requires less parameter tuning
    I think it will be the winner in the end
    I guess I might pull that Sent2Vec PR though because that might have potential too
    Ivan Menshikh
    @menshikh-iv
    probably :) I also can propose any techniques about "mixture of word-vectors"
    Sent2Vec not really good with short documents
    Matan Shenhav
    @ixxie
    oh okay, the name seemed to suggest otherwise xD
    Ivan Menshikh
    @menshikh-iv
    @ixxie you need something like suggested here - RaRe-Technologies/gensim#1879
    @ixxie truth is learned in practice, feel free to try of course!
    Matan Shenhav
    @ixxie
    @menshikh-iv we are on a tight schedule so I will only have explored it if I got some recommendations for it :)
    @menshikh-iv - mixing word vectors is an interesting idea..... doesn't FastText do this internally somewhat?
    Ivan Menshikh
    @menshikh-iv
    @ixxie FastText averaged vectors of n-grams, this is simplest way how to solve this task
    Matan Shenhav
    @ixxie
    I see
    is it correct that the right choice of number of topics for LSI is sensitive to the size of the dataset / diversity of the tokens?
    Matan Shenhav
    @ixxie
    @menshikh-iv - does the vocab which FastText / Doc2Vec use provide a way to do something like TFIDF?
    phalexo
    @phalexo
    @menshikh-iv word_locks and doctag_locks default to None obviously. So, what would be the approriate syntax to lock a list? If I do word_locks['ischemia'] = False have I locked that vector to be immutable during additional training? Seems to me, that's how it should work.
    Ivan Menshikh
    @menshikh-iv
    @phalexo not false, probably 0.0 :)
    this contains float values
    @ixxie I don't think so, for TFIDF please use https://radimrehurek.com/gensim/corpora/dictionary.html
    Matan Shenhav
    @ixxie
    @menshikh-iv I also discovered a comment in fasttext's github about how you can simply do the weighting manually because you add up the word vectors to find the doc vector; since I need to use custom frequencies anyway I have been playing around with custom weighting of the word vectors this morning.
    Ivan Menshikh
    @menshikh-iv
    :+1:
    phalexo
    @phalexo
    @menshikh-iv word_locks and doctag_locks have float elements? Why? The comments seemed clear enough that these control whether weights can be further changed during infer_vector or train execution. Boolean seems the most appropriate type. Would there be any reason for this? Are the values simply used elsewhere to multiply weights? What would be the True value then, 1.0?
    Ivan Menshikh
    @menshikh-iv
    No, float is better here, because this used directly in training (in multiplication operations)
    If you'll use boolean, you need to write some "if <statement>" (and forgot about calculation vectorization) or cast boolean to float (because original matrices in float). This is very well thought out move to optimize
    phalexo
    @phalexo
    OK, I see the rationale now. Is 1.0 used then as "weights are mutable?"
    Ivan Menshikh
    @menshikh-iv
    yes
    phalexo
    @phalexo
    Would I need to set up the arrays only once for multiple training epochs or refresh them every time?
    Ivan Menshikh
    @menshikh-iv
    I think only once
    Matan Shenhav
    @ixxie
    @menshikh-iv do you know of models that focus on the level of token bigrams?
    Ivan Menshikh
    @menshikh-iv
    @ixxie I don't catch question, you can add bigrams to your corpus with Phrases and pass it to any model
    Matan Shenhav
    @ixxie
    @menshikh-iv I see.... I was wondering more if there are word / document embedding models which are trained on a level of token-bigrams for ALL the corpus
    Ivan Menshikh
    @menshikh-iv
    no, nearest variant - Sent2Vec trained on token n-grams
    Matan Shenhav
    @ixxie
    I see
    so that is the main difference
    @menshikh-iv has sent2vec been merged yet?
    Ivan Menshikh
    @menshikh-iv
    no, this is almost done, but not ready (but you already can use it)
    Matan Shenhav
    @ixxie
    cool