Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Thanos Papaoikonomou
    @erwtokritos_twitter
    @phalexo You could install (system-wide or in a virtualenv) the specific version of Gensim which was state of the art in Feb. 2017 and work with that.
    rambo-yuanbo
    @rambo-yuanbo
    Hi guys, when i write a trim_rule for the Doc2Vec class's constructor, I discard all the numbers (plain numbers and percents) and those words with length 1 (a consideration that might only apply to Chinese words). HOWEVER, i doubt is it a good practice to discard all numbers? What's the effect of discarding a word ? is it the same as if I cut away all those appearances of the word from all the corpus ? If that's right, do i have better not discard the number words, and maybe map all the number words to a single special word, say, "NUMBRE" ? and is there an built-in way to do that with Doc2Vec, instead of replacing all the number words in the corpus myself ?
    rambo-yuanbo
    @rambo-yuanbo
    sorry, i mean replace number words with sth like "some-special-prefix-NUMBER-post-fix" . this input box just show a BOLD "number" instead
    Mohit Rathore
    @markroxor
    I am using gensim's w2v model to create word embeddings. I am training the model myself. After the model is trained and I analyze the model text file I find that certain words are not "vetorized". Why is it so? Am I missing something?
    @menshikh-iv
    Ivan Menshikh
    @menshikh-iv
    @markroxor w2v trimmed unfrequent word, look at min_count=5 argument in __init__ method
    Mohit Rathore
    @markroxor
    thanks, that was a parameter that was set to default so I missed it. Silly.
    Maryam
    @mjahanshahi
    Hi everyone. I tried implementing the awesome new Soft Cosine Similarity model. I had a problem with the command similarity_matrix = w2v_model.similarity_matrix(dictionary). The error I got was AttributeError: 'KeyedVectors' object has no attribute 'similarity_matrix'. I couldn't find references to similarity_matrix in the docs, but could be wrong. Anyone better versed can help me?
    Matan Shenhav
    @ixxie
    Hey; I have been trying to use doc2vec on a corpus of company names. The model seems to perform well generally, but fails for names of a single word. Now this may be an issue with the clustering algorithm (HDBSCAN) but I don't see how it could be. Is there a trick to training / infering when expecting one-word documents?
    Ivan Menshikh
    @menshikh-iv
    @ixxie Doc2Vec typically show bad performance on really short documents, I can offer you to use LSI or FastText.
    Matan Shenhav
    @ixxie
    cheers @menshikh-iv, I will give it a shot
    Matan Shenhav
    @ixxie
    @menshikh-iv - which clustering algorithms generally work well with such vectorizers? We have been exploring HDBSCAN and Affinity Propagation because they don't require knowing the number of clusters before hand, but we were wondering if there is anything better
    Ivan Menshikh
    @menshikh-iv
    @ixxie to be honest, I have no ideas, what's better in this case. I recommend to try different algos (as more as possible) and choice best
    Matan Shenhav
    @ixxie
    yeah that's what I am starting to see.... better to try as many combinations as possible
    Matan Shenhav
    @ixxie
    @menshikh-iv its a bit unclear how FastText is used to vectorize documents; the API reference mentions how to get word vector but not document vectors
    same with LSI
    oh with LSI I see now I can use the [ ] method to get the vector embedding
    Ivan Menshikh
    @menshikh-iv
    LSI - embeddings for document
    you can use model[['my', 'document']]
    for FastText - infer vector for each word and calculate vector of document as average of word-vectors
    Matan Shenhav
    @ixxie
    cheers @menshikh-iv
    Matan Shenhav
    @ixxie
    @menshikh-iv what kind of corpus does LSI expect? I tried supplying a list of token lists and I get a value error
    Ivan Menshikh
    @menshikh-iv
    @ixxie
    from gensim.corpora import Dictionary
    from gensim.models import LsiModel
    
    data = [["a", "a", "b"], ["c", "d"]]
    dictionary = Dictionary(data)
    corpus = [dictionary.doc2bow(doc) for doc in data]
    
    model = LsiModel(corpus, id2word=dictionary)
    list(model[corpus])  # [[(0, 2.236067977499789)], [(1, -1.4142135623730951)]]
    Matan Shenhav
    @ixxie
    @menshikh-iv thank you very much ^^
    Ivan Menshikh
    @menshikh-iv
    @ixxie :+1:
    phalexo
    @phalexo
    I have not followed Gensim threads for a while, was there ever progress made in porting Doc2Vec to run on GPUs?
    I recently retrained Doc2Vec with a pretty large corpus around 30 million documents (about 1 page long on the average). I used 20 epochs and it took like 2.5-3 days. How many epochs are usually good enough? Thanks.
    Ivan Menshikh
    @menshikh-iv
    @phalexo no, this doesn't parallelized on GPU in effective way (on CPU this already fast).
    About number of epoch - you can evaluate with upstream task (for what you use Doc2Vec), if you want to receive better result - you can update model with one more epoch.
    phalexo
    @phalexo
    Do you have any insight why the algo is so resistant to GPU porting? Where is the bottleneck? Communications between GPUs? Any algo that produces similar embeddings but may lend itself to better parallel execution?
    Matan Shenhav
    @ixxie
    @menshikh-iv - I followed your approach for LSI but I get some empty elements in the resulting list
    Ivan Menshikh
    @menshikh-iv
    @ixxie how looks input bag-of-words?
    Matan Shenhav
    @ixxie
    @menshikh-iv a list of lists of strings, as you suggested it should
    none of them empty
    but they do vary in length
    Ivan Menshikh
    @menshikh-iv
    @ixxie are you sure? maybe after conversion to bag-of-word format some documents is empty (because part of tokens was filered)
    you don't pass list of list of strings to model directly, you pass list of list of (int, int)
    Matan Shenhav
    @ixxie
    @menshikh-iv:
    dictionary = Dictionary(docs)
    corpus = [ dictionary.doc2bow(doc) for doc in docs ]
    model = LsiModel(corpus, id2word=dictionary,  num_topics =10)
    for doc in docs:
        if len(doc) ==0:
            print("doc: ",doc)
    vecs = []
    for doc in list(model[corpus]):
        x = [ vec[1] for vec in doc ]
        if len(x) == 0:
            print("vec: ", x)
        vecs.append(x)
    
    vecs = np.array(vecs, float)
    this produces:
    vec:  []
    vec:  []
    vec:  []
    vec:  []
    I tried to follow your instructions as best I could
    Matan Shenhav
    @ixxie
    @menshikh-iv we have discovered this error seems to occur precisely when the document is a token that occurs only once in the corpus
    Ivan Menshikh
    @menshikh-iv
    @ixxie LSI filtered some values if this very low (almost zero), for this reason you can receive an empty vectors
    Matan Shenhav
    @ixxie
    is there a way to change the threshold @menshikh-iv ?
    Ivan Menshikh
    @menshikh-iv
    @ixxie no way for [corpus] (also, eps=1e-9, this have no sense)
    Matan Shenhav
    @ixxie
    so there is an alternative approach to [corpus]?
    Ivan Menshikh
    @menshikh-iv
    probably only "mokey-patch" full2sparce function https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/lsimodel.py#L486 (but believe me, you do not need it, these values are filtered for a reason, too small values like a noize, this doesn't contain any useful information).
    Matan Shenhav
    @ixxie
    @menshikh-iv so what vectors do you normally substitute for such missing vectors? If we use all zeros the clustering algorithm will put them together
    Ivan Menshikh
    @menshikh-iv
    @ixxie exactly, better way - exclude this type of vectors (anyway, even if you disable "threshold", clustering algo will "glue" all of this vectors to one cluster).
    Matan Shenhav
    @ixxie
    alright, thanks for all the help @menshikh-iv