Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Matan Shenhav
    @ixxie
    @phalexo you could just omit those documents from the training I guess?
    phalexo
    @phalexo
    @ixxie That is not the goal here. I want these documents to be present in the model, but I want other document vectors trained with these constraints in place.
    Matan Shenhav
    @ixxie
    @phalexo I guess I was thinking something like: train constrained docs -> keep vectors -> train other docs
    but I am not sure that works for you
    or at all xD
    Matan Shenhav
    @ixxie
    I am getting:
    AttributeError: 'FastText' object has no attribute 'syn0_vocab_lockf'
    does anybody know what that means?
    Matan Shenhav
    @ixxie
    @menshikh-iv - is this a bug due to a transition to a new API in 4.0?
    Ivan Menshikh
    @menshikh-iv
    @ixxie can you share full code
    phalexo
    @phalexo
    @ixxie I know what I want to do already. I need to know how to do it specifically with Gensim Doc2Vec. Models trained on different sets of documents will have no relationship to each other. And any later training will completely mutate vectors for any original documents.
    The question is how to keep a certain set of vectors fixed while training.
    Matan Shenhav
    @ixxie
    sure:
    docs = [['this', 'is', 'some'], ['this', 'example'], ['example', 'is', 'this']]
    model = FastText(workers=ncores, min_count=1)
    corpus = { 'this': 1, 'is': 200, 'some': 5, 'example': 1 }
    model.build_vocab_from_freq(corpus, corpus_count = len(docs))
    model.train(docs, total_examples=model.corpus_count, epochs=model.iter)
    vecs = []
    for doc in docs:
        docVecs = []
        for token in doc:
            docVecs.append(model[token])
        vecs.append(np.mean(docVecs, axis=0))
    @menshikh-iv the full error:
    Exception in thread Thread-7:
    Traceback (most recent call last):
      File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/usr/lib64/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib64/python3.6/site-packages/gensim/models/word2vec.py", line 992, in worker_loop
        tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
      File "/usr/lib64/python3.6/site-packages/gensim/models/fasttext.py", line 469, in _do_train_job
        tally += train_batch_cbow(self, sentences, alpha, work, neu1)
      File "gensim/models/fasttext_inner.pyx", line 384, in gensim.models.fasttext_inner.train_batch_cbow (./gensim/models/fasttext_inner.c:5078)
    AttributeError: 'FastText' object has no attribute 'syn0_vocab_lockf'
    for what its worth a similar approach worked perfectly for Doc2Vec
    Matan Shenhav
    @ixxie
    @phalexo maybe you can use vector concatenation somehow to achieve your desired result; keep two models in play, train on one set of docs and continuously retrain the other, then combine the vectors.
    Ivan Menshikh
    @menshikh-iv
    so, I check it with 3.3.0, all works fine
    from gensim.models import FastText
    import numpy as np
    
    ncores = 4
    docs = [['this', 'is', 'some'], ['this', 'example'], ['example', 'is', 'this']]
    model = FastText(workers=ncores, min_count=1)
    
    corpus = { 'this': 1, 'is': 200, 'some': 5, 'example': 1 }
    model.build_vocab_from_freq(corpus, corpus_count = len(docs))
    
    model.train(docs, total_examples=model.corpus_count, epochs=model.iter)
    vecs = []
    for doc in docs:
        docVecs = []
        for token in doc:
            docVecs.append(model[token])
        vecs.append(np.mean(docVecs, axis=0))
    Matan Shenhav
    @ixxie
    cool thing about this approach is it works for any vectorizor
    Ivan Menshikh
    @menshikh-iv
    @ixxie how you install gensim?
    Matan Shenhav
    @ixxie
    used pip
    I have 3.2.0
    I am trying to upgrade
    @menshikh-iv that fixed it :) thanks
    Ivan Menshikh
    @menshikh-iv
    @ixxie :+1:
    @ixxie where are you from?
    Matan Shenhav
    @ixxie
    @menshikh-iv I am Dutch/Israeli living in Finland
    What kind of data science do you work on @menshikh-iv ?
    Ivan Menshikh
    @menshikh-iv
    @ixxie I prefer NLP and Graph, I'm gensim maintainer.
    phalexo
    @phalexo
    @menshikh-iv Since you maintain Gensim, can you shed some light on my previous questions? Is it possible to mark some vectors immutable (at some stage of training) so that they are no longer updated with further training?
    Matan Shenhav
    @ixxie
    @menshikh-iv yeah I noticed :) thanks for that work!
    Ivan Menshikh
    @menshikh-iv
    @phalexo hm, why you need this? For Doc2Vec?
    phalexo
    @phalexo
    Yes for Doc2Vec.
    Matan Shenhav
    @ixxie
    @menshikh-iv is it possible to populate the gensim.corpora.Dictionary from a frequency dict, similar to .build_vocab_from_freq()?
    I am using the wordfreq module to build a custom corpus from multiple languages and multiple sources ^^
    Ivan Menshikh
    @menshikh-iv
    @ixxie as I remember, no standard way, only create dictionary & fill inner fields manually.
    @phalexo why you need this (because looks strange, probably better way exists)?
    phalexo
    @phalexo
    @menshikh-iv I am not at liberty to talk about other's people's plans/goals. I can certainly keep an array of vectors and overwrite any updates that happened but that is really ugly.
    I should be able to fix a set of vectors so that more training just changes other vectors around the fixed ones.
    Timofey Yefimov
    @anotherbugmaster
    @phalexo, hi. Why do you think it's ugly to overwrite updates? Seems like the easiest way to achieve what you want.
    Ivan Menshikh
    @menshikh-iv
    @phalexo as I know, we have some kind of lock factor for word-vectors, but this is really bad documented
    easiest solution - best solution here
    phalexo
    @phalexo
    Being able to lock certain vectors in place would be ideal.
    @anotherbugmaster It is ugly because it would waste computing resources.
    Timofey Yefimov
    @anotherbugmaster
    How is that?
    phalexo
    @phalexo
    1) time to update 2) then overwrite. It is not obvious?
    Timofey Yefimov
    @anotherbugmaster
    Nope, it's not. Training of a model has higher complexity than overwriting anyway, @menshikh-iv correct me if I'm wrong
    Ivan Menshikh
    @menshikh-iv
    phalexo
    @phalexo
    Any of `learn_doctags', `learn_words`, and `learn_hidden` may be set False to
            prevent learning-updates to those respective model weights, as if using the
            (partially-)frozen model to infer other compatible vectors.
    This seems pretty close within the unoptimized path.
    phalexo
    @phalexo
    Presumably I could use the optimized Doc2Vec for initial training, lock a subset of vectors and continue training using the python version of Doc2Vec. Is the switch between versions automatic or I can control that?
    Are the locks also implemented in the cython version?
    Matan Shenhav
    @ixxie
    @menshikh-iv I don't know exactly if this is useful for others but it should be easy to add a method to Dictionary that does:
    def build_vocab_from_freq(self, freq_dict):
        for key in self:
            self.dfs[key] = freq_dict[self[key]]