Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Matan Shenhav
    @ixxie
    cool thing about this approach is it works for any vectorizor
    Ivan Menshikh
    @menshikh-iv
    @ixxie how you install gensim?
    Matan Shenhav
    @ixxie
    used pip
    I have 3.2.0
    I am trying to upgrade
    @menshikh-iv that fixed it :) thanks
    Ivan Menshikh
    @menshikh-iv
    @ixxie :+1:
    @ixxie where are you from?
    Matan Shenhav
    @ixxie
    @menshikh-iv I am Dutch/Israeli living in Finland
    What kind of data science do you work on @menshikh-iv ?
    Ivan Menshikh
    @menshikh-iv
    @ixxie I prefer NLP and Graph, I'm gensim maintainer.
    phalexo
    @phalexo
    @menshikh-iv Since you maintain Gensim, can you shed some light on my previous questions? Is it possible to mark some vectors immutable (at some stage of training) so that they are no longer updated with further training?
    Matan Shenhav
    @ixxie
    @menshikh-iv yeah I noticed :) thanks for that work!
    Ivan Menshikh
    @menshikh-iv
    @phalexo hm, why you need this? For Doc2Vec?
    phalexo
    @phalexo
    Yes for Doc2Vec.
    Matan Shenhav
    @ixxie
    @menshikh-iv is it possible to populate the gensim.corpora.Dictionary from a frequency dict, similar to .build_vocab_from_freq()?
    I am using the wordfreq module to build a custom corpus from multiple languages and multiple sources ^^
    Ivan Menshikh
    @menshikh-iv
    @ixxie as I remember, no standard way, only create dictionary & fill inner fields manually.
    @phalexo why you need this (because looks strange, probably better way exists)?
    phalexo
    @phalexo
    @menshikh-iv I am not at liberty to talk about other's people's plans/goals. I can certainly keep an array of vectors and overwrite any updates that happened but that is really ugly.
    I should be able to fix a set of vectors so that more training just changes other vectors around the fixed ones.
    Timofey Yefimov
    @anotherbugmaster
    @phalexo, hi. Why do you think it's ugly to overwrite updates? Seems like the easiest way to achieve what you want.
    Ivan Menshikh
    @menshikh-iv
    @phalexo as I know, we have some kind of lock factor for word-vectors, but this is really bad documented
    easiest solution - best solution here
    phalexo
    @phalexo
    Being able to lock certain vectors in place would be ideal.
    @anotherbugmaster It is ugly because it would waste computing resources.
    Timofey Yefimov
    @anotherbugmaster
    How is that?
    phalexo
    @phalexo
    1) time to update 2) then overwrite. It is not obvious?
    Timofey Yefimov
    @anotherbugmaster
    Nope, it's not. Training of a model has higher complexity than overwriting anyway, @menshikh-iv correct me if I'm wrong
    Ivan Menshikh
    @menshikh-iv
    phalexo
    @phalexo
    Any of `learn_doctags', `learn_words`, and `learn_hidden` may be set False to
            prevent learning-updates to those respective model weights, as if using the
            (partially-)frozen model to infer other compatible vectors.
    This seems pretty close within the unoptimized path.
    phalexo
    @phalexo
    Presumably I could use the optimized Doc2Vec for initial training, lock a subset of vectors and continue training using the python version of Doc2Vec. Is the switch between versions automatic or I can control that?
    Are the locks also implemented in the cython version?
    Matan Shenhav
    @ixxie
    @menshikh-iv I don't know exactly if this is useful for others but it should be easy to add a method to Dictionary that does:
    def build_vocab_from_freq(self, freq_dict):
        for key in self:
            self.dfs[key] = freq_dict[self[key]]
    Ivan Menshikh
    @menshikh-iv

    @phalexo

    Is the switch between versions automatic or I can control that?

    Automatic only (if you have compiled extensions and import successful - cython version will be used

    Are the locks also implemented in the cython version?

    Yes

    @ixxie I see no serious reasons for it (this doesn't looks very useful), anyway - you can do it manually (if needed).
    Matan Shenhav
    @ixxie
    @menshikh-iv - we are using gensim to vectorize and cluster a relatively small set of sentances; the frequency counts inside our data set are misleading for the purposes of TFIDF so we need to use external frequencies. I imagine we are not the only ones. I was just wrapping FastText, Doc2Vec and LsiModel into our model and for the first two I could use .build_vocab_from_freq() while for LSI I had to do this work around.
    Ivan Menshikh
    @menshikh-iv
    @ixxie feel free to create PR - we can discuss it on github with other guys from community
    Matan Shenhav
    @ixxie
    @menshikh-iv I think overall gensim is awesome, and it would be great if it could have a more uniform interface between the models to make it easier to use them as drop in replacements for one another
    Ivan Menshikh
    @menshikh-iv
    @ixxie about interface (in general meaning) - completely agree
    Matan Shenhav
    @ixxie
    @menshikh-iv alright, I will do so when I have time!
    Ivan Menshikh
    @menshikh-iv
    :+1:
    Matan Shenhav
    @ixxie
    @menshikh-iv - are there any other models in gensim you would recommend for short sentence vectorization besides LSI and FastText?
    FastText seems to perform very well in quality and speed so far, and requires less parameter tuning
    I think it will be the winner in the end
    I guess I might pull that Sent2Vec PR though because that might have potential too
    Ivan Menshikh
    @menshikh-iv
    probably :) I also can propose any techniques about "mixture of word-vectors"
    Sent2Vec not really good with short documents
    Matan Shenhav
    @ixxie
    oh okay, the name seemed to suggest otherwise xD
    Ivan Menshikh
    @menshikh-iv
    @ixxie you need something like suggested here - RaRe-Technologies/gensim#1879