Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Radim Řehůřek
    @piskvorky
    @emnajaoua Did the installation from source proceed correctly? What's the error? What do you mean by "wheel" -- that's a binary distribution, not source. Btw the mailing list may be a better place :)
    @eugines that's an interesting project! I was not aware of that. Tweeted it out now :)
    Regarding your question: there has been some refactoring in the past 2-3 releases. We tried to maintain backward compatibility, but if NETL used some older version, there may have been an issue. @gojomo do you know whether the DocvecsArray class moved (and where)?
    If so, we should at least keep an alias in the old location, to maintain backward compatibility.
    phalexo
    @phalexo
    I had to retrain my model (Doc2Vec) from scratch because of the changes in gensim. That repository is 2 years old.
    matanster
    @matanster
    Do we have any convenience function (in gensim) for getting a word vector's length? (as a proxy for its frequency in the data the embedding was trained over)
    Of course this question is motivated by laziness....
    matanster
    @matanster
    ? :)
    phalexo
    @phalexo
    Apparently laziness is not the mother of all invention as I thought. You have wasted 2 days waiting for someone else to do your work.
    matanster
    @matanster
    @phalexo no I've not, I was just curious how deep does the gensim toolchain go
    @phalexo thanks for being agressive though
    Now ignoring your snarky comment, I'll post another question for others if they care to respond
    I am probably not getting the point of corpus2dense().
    Don't people typically want to deal with sparse matrices in language processing? e.g. when training classification models over sparse language data; for example you typically can't efficiently train/fit a model on a large bag-of-wordish dense matrix in sklearn, without huge memory. What am I missing about it?
    estathop
    @estathop
    Greetings chat, it's nice to have joined you. I have a relative simple question I suppose. I want to use tf-idf , executed the example in comments and successfully the code returned a vector with the index of the word and the probability, how can I map the index of that word to the actual word? is there a function I am not aware of ? Can I implement it somehow ?
    estathop
    @estathop
    I am seeing that sklearn has this method implemented already
    Radim Řehůřek
    @piskvorky
    @matanster for NumPy vectors , you can get their length using vector.shape. See the # numpy vector of a word example at https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors
    @matanster regarding corpus2dense: it's true NLP produces sparse matrices using the bag-of-words representation. But some later transformations, such as LSI, transform the sparse vectors into a dense space. So using a dense representation actually saves you memory (less overhead than representing a dense matrix using sparse structures).
    Plus, there are many other external tools in the Data Science ecosystem that still cannot handle sparse inputs. So you have to export/supply a dense structure => need corpus2dense. Even scikit-learn didn't have sparse support for a while ;)
    @estathop converting between words and ids is the job of the Dictionary: https://radimrehurek.com/gensim/corpora/dictionary.html
    And see also the Tutorials: https://radimrehurek.com/gensim/tutorial.html
    estathop
    @estathop
    I want to perform topic extraction with the use of TF-IDF, not similarity measures
    estathop
    @estathop
    I am curious why this doesn't work, it has to do that the "fake-news" dataset is in different format than that of "text8", is there an easy way to handle it and perform the exact same thing ?

    import gensim.downloader as api
    from gensim.models import TfidfModel
    from gensim.corpora import Dictionary

    dataset = api.load("fake-news")
    dct = Dictionary(dataset) # fit dictionary
    corpus = [dct.doc2bow(line) for line in dataset] # convert corpus to BoW format

    model = TfidfModel(corpus) # fit model
    vector = model[corpus[0]] # apply model to the first corpus document

    the vector seems to be empty and every element in corpus seems to be the same
    matanster
    @matanster
    @piskvorky thanks a lot for the enlightening comments!
    @piskvorky and I do not envy those who need to convert a big corpus into a dense rep with corpus2dense, I had it overflow the memory of a large machine on just 250,000 documents .....
    Then again, it was a bag-of-words drill, with word vectors it might be harmless
    matanster
    @matanster
    One last, I'm probably not very good with pythonic documentation, does the corpus object easily expose a sparse representation as well?
    tastyminerals
    @tastyminerals
    Hi guys. I am trying to use gensim fasttext library to train custom word embeddings and then load it into a network. However, the embedding bins created by gensim seem not to work while facebook github version bin does. Why? Because whenever I use pretrained bin by gensim my loss does not drop, it stays the same every single time disregarding the different params I use for fasttext model. Facebook version gives me a x2 drop loss. I wonder if I use gensim.FastText correctly:
    
    def read_dataset(fpath):
        with open(fpath, "r") as f:
            for line in f:
                # <sentence creation logic>
            yield sentence
    
    model = FastText(min_count=1, size=50, window=5, workers=8, sg=1, word_ngrams=1, min_n=3, max_n=6, iter=5, negative=0)
    model.build_vocab(read_dataset(args.fpath))
    model.train(read_dataset(args.fpath), total_examples=model.corpus_count, epochs=model.iter)
    model.save("custom_model")
    estathop
    @estathop
    I have a question about gensim's model word2vec, there are numerous ways to estimate similarity, but is there something implemented that I want to give it a word and return me the most distant, the most dissimilar value ?
    word^
    estathop
    @estathop
    is this the answer ?
    model.wv.most_similar( negative = ['myword'])
    the antonym
    Pranav Subramani
    @PranavSubramani_twitter
    Hey, I have a question, I've used gensim for a while now and was using the infer_vector for an application I wrote and started noticing vastly different values for the document vector (between dm=1 and dm=0). This is in version 3.5.0 but in all versions prior, the results are very different, I checked the logs and there was a learning rate bug fix in 3.5, could this be the reason for this vastly different values in document vectors?
    Pranav Subramani
    @PranavSubramani_twitter
    Was hoping anyone else could chime in? Because the differences are really really observable
    Radim Řehůřek
    @piskvorky
    @PranavSubramani_twitter this gitter is rarely visited by the Gensim devs. you may have a better chance at the mailing list, https://groups.google.com/forum/#!forum/gensim
    Somnath Rakshit
    @somnathrakshit
    hey, I have a question. I am using fasttext and using the wikipedia pretrained english model
    is it possible to get the most similar words for some unknown word which might not have existed in the dictionary before?
    phalexo
    @phalexo
    @somnathrakshit The model would have to read minds to know what a never seen before word means.
    Somnath Rakshit
    @somnathrakshit
    how about looking for substrings and then deciding?
    AMaini503
    @AMaini503
    I'm training a LDA Model on a corpus of ~20k documents. I have read about a few heuristics that indicate convergence of the model like "variational params stop changing", #documents converged.
    I'm not aware about the interpretation of variational params. So, I'm was looking to use the second heuristic. With passes = 50, I see that nearly 90% of the documents converge on the held-out set.
    Is proportion of documents converged a good heuristic for convergence ?
    And, how can I use these heuristics to adjust passes/iterations to ensure convergence when the model is updated with a new batch (online training) ?
    ghost
    @ghost_intern_twitter
    Hi, I've never worked with Cython before, are there docs on how to run the project?
    Gregory Werbin
    @gwerbin
    is there a way to "fit" and "transform" a corpus in one shot?
    e.g. with TfidfModel or Dictionary
    it seems like right now i need to make 2 passes which is extremely inefficient
    Sidharth Bansal
    @SidharthBansal
    Hi
    I am new here