Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ivan Menshikh
    @menshikh-iv
    @ixxie are you sure? maybe after conversion to bag-of-word format some documents is empty (because part of tokens was filered)
    you don't pass list of list of strings to model directly, you pass list of list of (int, int)
    Matan Shenhav
    @ixxie
    @menshikh-iv:
    dictionary = Dictionary(docs)
    corpus = [ dictionary.doc2bow(doc) for doc in docs ]
    model = LsiModel(corpus, id2word=dictionary,  num_topics =10)
    for doc in docs:
        if len(doc) ==0:
            print("doc: ",doc)
    vecs = []
    for doc in list(model[corpus]):
        x = [ vec[1] for vec in doc ]
        if len(x) == 0:
            print("vec: ", x)
        vecs.append(x)
    
    vecs = np.array(vecs, float)
    this produces:
    vec:  []
    vec:  []
    vec:  []
    vec:  []
    I tried to follow your instructions as best I could
    Matan Shenhav
    @ixxie
    @menshikh-iv we have discovered this error seems to occur precisely when the document is a token that occurs only once in the corpus
    Ivan Menshikh
    @menshikh-iv
    @ixxie LSI filtered some values if this very low (almost zero), for this reason you can receive an empty vectors
    Matan Shenhav
    @ixxie
    is there a way to change the threshold @menshikh-iv ?
    Ivan Menshikh
    @menshikh-iv
    @ixxie no way for [corpus] (also, eps=1e-9, this have no sense)
    Matan Shenhav
    @ixxie
    so there is an alternative approach to [corpus]?
    Ivan Menshikh
    @menshikh-iv
    probably only "mokey-patch" full2sparce function https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/lsimodel.py#L486 (but believe me, you do not need it, these values are filtered for a reason, too small values like a noize, this doesn't contain any useful information).
    Matan Shenhav
    @ixxie
    @menshikh-iv so what vectors do you normally substitute for such missing vectors? If we use all zeros the clustering algorithm will put them together
    Ivan Menshikh
    @menshikh-iv
    @ixxie exactly, better way - exclude this type of vectors (anyway, even if you disable "threshold", clustering algo will "glue" all of this vectors to one cluster).
    Matan Shenhav
    @ixxie
    alright, thanks for all the help @menshikh-iv
    phalexo
    @phalexo
    Does anyone know if there is a mechanism in Doc2Vec model to choose and keep some document (and/or word) vectors fixed while training all others?
    Matan Shenhav
    @ixxie
    @phalexo you could just omit those documents from the training I guess?
    phalexo
    @phalexo
    @ixxie That is not the goal here. I want these documents to be present in the model, but I want other document vectors trained with these constraints in place.
    Matan Shenhav
    @ixxie
    @phalexo I guess I was thinking something like: train constrained docs -> keep vectors -> train other docs
    but I am not sure that works for you
    or at all xD
    Matan Shenhav
    @ixxie
    I am getting:
    AttributeError: 'FastText' object has no attribute 'syn0_vocab_lockf'
    does anybody know what that means?
    Matan Shenhav
    @ixxie
    @menshikh-iv - is this a bug due to a transition to a new API in 4.0?
    Ivan Menshikh
    @menshikh-iv
    @ixxie can you share full code
    phalexo
    @phalexo
    @ixxie I know what I want to do already. I need to know how to do it specifically with Gensim Doc2Vec. Models trained on different sets of documents will have no relationship to each other. And any later training will completely mutate vectors for any original documents.
    The question is how to keep a certain set of vectors fixed while training.
    Matan Shenhav
    @ixxie
    sure:
    docs = [['this', 'is', 'some'], ['this', 'example'], ['example', 'is', 'this']]
    model = FastText(workers=ncores, min_count=1)
    corpus = { 'this': 1, 'is': 200, 'some': 5, 'example': 1 }
    model.build_vocab_from_freq(corpus, corpus_count = len(docs))
    model.train(docs, total_examples=model.corpus_count, epochs=model.iter)
    vecs = []
    for doc in docs:
        docVecs = []
        for token in doc:
            docVecs.append(model[token])
        vecs.append(np.mean(docVecs, axis=0))
    @menshikh-iv the full error:
    Exception in thread Thread-7:
    Traceback (most recent call last):
      File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/usr/lib64/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib64/python3.6/site-packages/gensim/models/word2vec.py", line 992, in worker_loop
        tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
      File "/usr/lib64/python3.6/site-packages/gensim/models/fasttext.py", line 469, in _do_train_job
        tally += train_batch_cbow(self, sentences, alpha, work, neu1)
      File "gensim/models/fasttext_inner.pyx", line 384, in gensim.models.fasttext_inner.train_batch_cbow (./gensim/models/fasttext_inner.c:5078)
    AttributeError: 'FastText' object has no attribute 'syn0_vocab_lockf'
    for what its worth a similar approach worked perfectly for Doc2Vec
    Matan Shenhav
    @ixxie
    @phalexo maybe you can use vector concatenation somehow to achieve your desired result; keep two models in play, train on one set of docs and continuously retrain the other, then combine the vectors.
    Ivan Menshikh
    @menshikh-iv
    so, I check it with 3.3.0, all works fine
    from gensim.models import FastText
    import numpy as np
    
    ncores = 4
    docs = [['this', 'is', 'some'], ['this', 'example'], ['example', 'is', 'this']]
    model = FastText(workers=ncores, min_count=1)
    
    corpus = { 'this': 1, 'is': 200, 'some': 5, 'example': 1 }
    model.build_vocab_from_freq(corpus, corpus_count = len(docs))
    
    model.train(docs, total_examples=model.corpus_count, epochs=model.iter)
    vecs = []
    for doc in docs:
        docVecs = []
        for token in doc:
            docVecs.append(model[token])
        vecs.append(np.mean(docVecs, axis=0))
    Matan Shenhav
    @ixxie
    cool thing about this approach is it works for any vectorizor
    Ivan Menshikh
    @menshikh-iv
    @ixxie how you install gensim?
    Matan Shenhav
    @ixxie
    used pip
    I have 3.2.0
    I am trying to upgrade
    @menshikh-iv that fixed it :) thanks
    Ivan Menshikh
    @menshikh-iv
    @ixxie :+1:
    @ixxie where are you from?
    Matan Shenhav
    @ixxie
    @menshikh-iv I am Dutch/Israeli living in Finland
    What kind of data science do you work on @menshikh-iv ?
    Ivan Menshikh
    @menshikh-iv
    @ixxie I prefer NLP and Graph, I'm gensim maintainer.
    phalexo
    @phalexo
    @menshikh-iv Since you maintain Gensim, can you shed some light on my previous questions? Is it possible to mark some vectors immutable (at some stage of training) so that they are no longer updated with further training?
    Matan Shenhav
    @ixxie
    @menshikh-iv yeah I noticed :) thanks for that work!
    Ivan Menshikh
    @menshikh-iv
    @phalexo hm, why you need this? For Doc2Vec?
    phalexo
    @phalexo
    Yes for Doc2Vec.
    Matan Shenhav
    @ixxie
    @menshikh-iv is it possible to populate the gensim.corpora.Dictionary from a frequency dict, similar to .build_vocab_from_freq()?
    I am using the wordfreq module to build a custom corpus from multiple languages and multiple sources ^^
    Ivan Menshikh
    @menshikh-iv
    @ixxie as I remember, no standard way, only create dictionary & fill inner fields manually.