Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    rambo-yuanbo
    @rambo-yuanbo
    Hi, @anotherbugmaster , still trying to understand the tags here. Reading the stockoverflow link, if one tag is provided over multiple training documents , like the 'action' tag in that post, then model.docvecs['action'] is some feature vector that is about/shared by all those documents with that tag? and thinking about the model, does the model try to predict the current word with the context and any one of the tag's vector, as the random gradient descent algorithm would like to pick, or with the multiple tag's vector all together? Thanks a lot !
    a guess since the NN's structure is fixed, only a single tag's vector is selected with the context words' vectors to predict the current word ?
    Timofey Yefimov
    @anotherbugmaster
    Don't know for sure, but I suppose that all document's tags are used simultaneously in one vector
    Let's say we have a set of documents with tags ((a, b), (b, c), (c, d)). Then the vectors will be ((1, 1, 0, 0), (0, 1, 1, 0), (0, 0, 1, 1))
    Just like with a word context vector
    rambo-yuanbo
    @rambo-yuanbo
    hi, @anotherbugmaster , i don't think there's a single vector for the the list of tags of the document, unless there's a problem in gojomo's replay in that post you gave. The replay suggested that you can get a vector for each of the tags, like model.docvecs['UID_1'], model.docvecs['action']
    Timofey Yefimov
    @anotherbugmaster
    I meant in-model vector, sorry if I didn't make it clear. Of course you can get embeddings for each tag.
    When you train a classic doc2vec model, you give it word id's vector + document id's vector as an input, right? In the case of multiple tags your input vector is word id's vector + tag vector basically
    rambo-yuanbo
    @rambo-yuanbo
    @anotherbugmaster hi , I kinda started to got your point here. so for a list of tags on a document, say ['UID_1', 'action'], the unique tags each got a 0-1 encoding in the tag/doc-vector, so maybe a doc with ['UID_1', 'action'] got a vector like [1,1,0,0], and a doc with ['UID_1'] got a vector like [1,0,0,0], right?
    rambo-yuanbo
    @rambo-yuanbo
    another issue here, the doc id in the similarity tuple returned by model.docvecs.most_similar() is of type int64, and i got an error when i query my doc corpus with the returned tag, because my corpus is taged with simple python int (int32). So, First, how do I find about what data type is returned by a gensim function, besides by debugging and stepping into the gensim code to find out?
    Thanos Papaoikonomou
    @erwtokritos_twitter
    Hello! I would like to build a type of regression suite for Gensim models i.e. ensure that pairs or pairs of pairs of words have expected semantic relationships as in the original Google paper. So far I have found 'evaluate_word_pairs' and 'accuracy' for this purpose. Is there anything else that I could use? Thanks
    phalexo
    @phalexo
    Hello. I tried to load a model that I trained in Feb. 2017 and use it to infer a vector, I got an exception. Should the current version of Gensim be able to use a year old model or do I have to retrain it? Thanks.
    Thanos Papaoikonomou
    @erwtokritos_twitter
    @phalexo You could install (system-wide or in a virtualenv) the specific version of Gensim which was state of the art in Feb. 2017 and work with that.
    rambo-yuanbo
    @rambo-yuanbo
    Hi guys, when i write a trim_rule for the Doc2Vec class's constructor, I discard all the numbers (plain numbers and percents) and those words with length 1 (a consideration that might only apply to Chinese words). HOWEVER, i doubt is it a good practice to discard all numbers? What's the effect of discarding a word ? is it the same as if I cut away all those appearances of the word from all the corpus ? If that's right, do i have better not discard the number words, and maybe map all the number words to a single special word, say, "NUMBRE" ? and is there an built-in way to do that with Doc2Vec, instead of replacing all the number words in the corpus myself ?
    rambo-yuanbo
    @rambo-yuanbo
    sorry, i mean replace number words with sth like "some-special-prefix-NUMBER-post-fix" . this input box just show a BOLD "number" instead
    Mohit Rathore
    @markroxor
    I am using gensim's w2v model to create word embeddings. I am training the model myself. After the model is trained and I analyze the model text file I find that certain words are not "vetorized". Why is it so? Am I missing something?
    @menshikh-iv
    Ivan Menshikh
    @menshikh-iv
    @markroxor w2v trimmed unfrequent word, look at min_count=5 argument in __init__ method
    Mohit Rathore
    @markroxor
    thanks, that was a parameter that was set to default so I missed it. Silly.
    Maryam
    @mjahanshahi
    Hi everyone. I tried implementing the awesome new Soft Cosine Similarity model. I had a problem with the command similarity_matrix = w2v_model.similarity_matrix(dictionary). The error I got was AttributeError: 'KeyedVectors' object has no attribute 'similarity_matrix'. I couldn't find references to similarity_matrix in the docs, but could be wrong. Anyone better versed can help me?
    Matan Shenhav
    @ixxie
    Hey; I have been trying to use doc2vec on a corpus of company names. The model seems to perform well generally, but fails for names of a single word. Now this may be an issue with the clustering algorithm (HDBSCAN) but I don't see how it could be. Is there a trick to training / infering when expecting one-word documents?
    Ivan Menshikh
    @menshikh-iv
    @ixxie Doc2Vec typically show bad performance on really short documents, I can offer you to use LSI or FastText.
    Matan Shenhav
    @ixxie
    cheers @menshikh-iv, I will give it a shot
    Matan Shenhav
    @ixxie
    @menshikh-iv - which clustering algorithms generally work well with such vectorizers? We have been exploring HDBSCAN and Affinity Propagation because they don't require knowing the number of clusters before hand, but we were wondering if there is anything better
    Ivan Menshikh
    @menshikh-iv
    @ixxie to be honest, I have no ideas, what's better in this case. I recommend to try different algos (as more as possible) and choice best
    Matan Shenhav
    @ixxie
    yeah that's what I am starting to see.... better to try as many combinations as possible
    Matan Shenhav
    @ixxie
    @menshikh-iv its a bit unclear how FastText is used to vectorize documents; the API reference mentions how to get word vector but not document vectors
    same with LSI
    oh with LSI I see now I can use the [ ] method to get the vector embedding
    Ivan Menshikh
    @menshikh-iv
    LSI - embeddings for document
    you can use model[['my', 'document']]
    for FastText - infer vector for each word and calculate vector of document as average of word-vectors
    Matan Shenhav
    @ixxie
    cheers @menshikh-iv
    Matan Shenhav
    @ixxie
    @menshikh-iv what kind of corpus does LSI expect? I tried supplying a list of token lists and I get a value error
    Ivan Menshikh
    @menshikh-iv
    @ixxie
    from gensim.corpora import Dictionary
    from gensim.models import LsiModel
    
    data = [["a", "a", "b"], ["c", "d"]]
    dictionary = Dictionary(data)
    corpus = [dictionary.doc2bow(doc) for doc in data]
    
    model = LsiModel(corpus, id2word=dictionary)
    list(model[corpus])  # [[(0, 2.236067977499789)], [(1, -1.4142135623730951)]]
    Matan Shenhav
    @ixxie
    @menshikh-iv thank you very much ^^
    Ivan Menshikh
    @menshikh-iv
    @ixxie :+1:
    phalexo
    @phalexo
    I have not followed Gensim threads for a while, was there ever progress made in porting Doc2Vec to run on GPUs?
    I recently retrained Doc2Vec with a pretty large corpus around 30 million documents (about 1 page long on the average). I used 20 epochs and it took like 2.5-3 days. How many epochs are usually good enough? Thanks.
    Ivan Menshikh
    @menshikh-iv
    @phalexo no, this doesn't parallelized on GPU in effective way (on CPU this already fast).
    About number of epoch - you can evaluate with upstream task (for what you use Doc2Vec), if you want to receive better result - you can update model with one more epoch.
    phalexo
    @phalexo
    Do you have any insight why the algo is so resistant to GPU porting? Where is the bottleneck? Communications between GPUs? Any algo that produces similar embeddings but may lend itself to better parallel execution?
    Matan Shenhav
    @ixxie
    @menshikh-iv - I followed your approach for LSI but I get some empty elements in the resulting list
    Ivan Menshikh
    @menshikh-iv
    @ixxie how looks input bag-of-words?
    Matan Shenhav
    @ixxie
    @menshikh-iv a list of lists of strings, as you suggested it should
    none of them empty
    but they do vary in length
    Ivan Menshikh
    @menshikh-iv
    @ixxie are you sure? maybe after conversion to bag-of-word format some documents is empty (because part of tokens was filered)
    you don't pass list of list of strings to model directly, you pass list of list of (int, int)
    Matan Shenhav
    @ixxie
    @menshikh-iv:
    dictionary = Dictionary(docs)
    corpus = [ dictionary.doc2bow(doc) for doc in docs ]
    model = LsiModel(corpus, id2word=dictionary,  num_topics =10)
    for doc in docs:
        if len(doc) ==0:
            print("doc: ",doc)
    vecs = []
    for doc in list(model[corpus]):
        x = [ vec[1] for vec in doc ]
        if len(x) == 0:
            print("vec: ", x)
        vecs.append(x)
    
    vecs = np.array(vecs, float)