Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    phalexo
    @phalexo
    Apparently laziness is not the mother of all invention as I thought. You have wasted 2 days waiting for someone else to do your work.
    matanster
    @matanster
    @phalexo no I've not, I was just curious how deep does the gensim toolchain go
    @phalexo thanks for being agressive though
    Now ignoring your snarky comment, I'll post another question for others if they care to respond
    I am probably not getting the point of corpus2dense().
    Don't people typically want to deal with sparse matrices in language processing? e.g. when training classification models over sparse language data; for example you typically can't efficiently train/fit a model on a large bag-of-wordish dense matrix in sklearn, without huge memory. What am I missing about it?
    estathop
    @estathop
    Greetings chat, it's nice to have joined you. I have a relative simple question I suppose. I want to use tf-idf , executed the example in comments and successfully the code returned a vector with the index of the word and the probability, how can I map the index of that word to the actual word? is there a function I am not aware of ? Can I implement it somehow ?
    estathop
    @estathop
    I am seeing that sklearn has this method implemented already
    Radim Řehůřek
    @piskvorky
    @matanster for NumPy vectors , you can get their length using vector.shape. See the # numpy vector of a word example at https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors
    @matanster regarding corpus2dense: it's true NLP produces sparse matrices using the bag-of-words representation. But some later transformations, such as LSI, transform the sparse vectors into a dense space. So using a dense representation actually saves you memory (less overhead than representing a dense matrix using sparse structures).
    Plus, there are many other external tools in the Data Science ecosystem that still cannot handle sparse inputs. So you have to export/supply a dense structure => need corpus2dense. Even scikit-learn didn't have sparse support for a while ;)
    @estathop converting between words and ids is the job of the Dictionary: https://radimrehurek.com/gensim/corpora/dictionary.html
    And see also the Tutorials: https://radimrehurek.com/gensim/tutorial.html
    estathop
    @estathop
    I want to perform topic extraction with the use of TF-IDF, not similarity measures
    estathop
    @estathop
    I am curious why this doesn't work, it has to do that the "fake-news" dataset is in different format than that of "text8", is there an easy way to handle it and perform the exact same thing ?

    import gensim.downloader as api
    from gensim.models import TfidfModel
    from gensim.corpora import Dictionary

    dataset = api.load("fake-news")
    dct = Dictionary(dataset) # fit dictionary
    corpus = [dct.doc2bow(line) for line in dataset] # convert corpus to BoW format

    model = TfidfModel(corpus) # fit model
    vector = model[corpus[0]] # apply model to the first corpus document

    the vector seems to be empty and every element in corpus seems to be the same
    matanster
    @matanster
    @piskvorky thanks a lot for the enlightening comments!
    @piskvorky and I do not envy those who need to convert a big corpus into a dense rep with corpus2dense, I had it overflow the memory of a large machine on just 250,000 documents .....
    Then again, it was a bag-of-words drill, with word vectors it might be harmless
    matanster
    @matanster
    One last, I'm probably not very good with pythonic documentation, does the corpus object easily expose a sparse representation as well?
    tastyminerals
    @tastyminerals
    Hi guys. I am trying to use gensim fasttext library to train custom word embeddings and then load it into a network. However, the embedding bins created by gensim seem not to work while facebook github version bin does. Why? Because whenever I use pretrained bin by gensim my loss does not drop, it stays the same every single time disregarding the different params I use for fasttext model. Facebook version gives me a x2 drop loss. I wonder if I use gensim.FastText correctly:
    
    def read_dataset(fpath):
        with open(fpath, "r") as f:
            for line in f:
                # <sentence creation logic>
            yield sentence
    
    model = FastText(min_count=1, size=50, window=5, workers=8, sg=1, word_ngrams=1, min_n=3, max_n=6, iter=5, negative=0)
    model.build_vocab(read_dataset(args.fpath))
    model.train(read_dataset(args.fpath), total_examples=model.corpus_count, epochs=model.iter)
    model.save("custom_model")
    estathop
    @estathop
    I have a question about gensim's model word2vec, there are numerous ways to estimate similarity, but is there something implemented that I want to give it a word and return me the most distant, the most dissimilar value ?
    word^
    estathop
    @estathop
    is this the answer ?
    model.wv.most_similar( negative = ['myword'])
    the antonym
    Pranav Subramani
    @PranavSubramani_twitter
    Hey, I have a question, I've used gensim for a while now and was using the infer_vector for an application I wrote and started noticing vastly different values for the document vector (between dm=1 and dm=0). This is in version 3.5.0 but in all versions prior, the results are very different, I checked the logs and there was a learning rate bug fix in 3.5, could this be the reason for this vastly different values in document vectors?
    Pranav Subramani
    @PranavSubramani_twitter
    Was hoping anyone else could chime in? Because the differences are really really observable
    Radim Řehůřek
    @piskvorky
    @PranavSubramani_twitter this gitter is rarely visited by the Gensim devs. you may have a better chance at the mailing list, https://groups.google.com/forum/#!forum/gensim
    Somnath Rakshit
    @somnathrakshit
    hey, I have a question. I am using fasttext and using the wikipedia pretrained english model
    is it possible to get the most similar words for some unknown word which might not have existed in the dictionary before?
    phalexo
    @phalexo
    @somnathrakshit The model would have to read minds to know what a never seen before word means.
    Somnath Rakshit
    @somnathrakshit
    how about looking for substrings and then deciding?
    AMaini503
    @AMaini503
    I'm training a LDA Model on a corpus of ~20k documents. I have read about a few heuristics that indicate convergence of the model like "variational params stop changing", #documents converged.
    I'm not aware about the interpretation of variational params. So, I'm was looking to use the second heuristic. With passes = 50, I see that nearly 90% of the documents converge on the held-out set.
    Is proportion of documents converged a good heuristic for convergence ?
    And, how can I use these heuristics to adjust passes/iterations to ensure convergence when the model is updated with a new batch (online training) ?
    ghost
    @ghost_intern_twitter
    Hi, I've never worked with Cython before, are there docs on how to run the project?
    Gregory Werbin
    @gwerbin
    is there a way to "fit" and "transform" a corpus in one shot?
    e.g. with TfidfModel or Dictionary
    it seems like right now i need to make 2 passes which is extremely inefficient
    Sidharth Bansal
    @SidharthBansal
    Hi
    I am new here
    Is this machine learning related stuff here?
    estathop
    @estathop
    nlp mostly
    Rohit Kumar
    @aquatiko
    while fixing an issue i have been told to add a test for the case, can someone clarify this??
    Julian Gonggrijp
    @jgonggrijp
    Hi, I'm about to start creating w2v models from a large corpus (up to 50GB of text per model, English and Dutch). I'm told gensim may require large-ish amounts of memory. Assuming reading the text from disk can be done in small chunks at a time, could somebody give me a ballpark estimate of how much RAM I'll need to request for the VPS in order for gensim to be able to do its job at least somewhat smoothly? Thanks in advance!
    Stergiadis Manos
    @steremma
    One of gensim's most important properties is the ability to perform out-of-core computation, using generators instead of, say lists. Which means you might not even need to write the chunking logic yourself and RAM is not a consideration, at least not in terms of gensim's ability to complete the task. That being said, the more RAM you got the faster the training will go presumably!
    Stergiadis Manos
    @steremma
    @aquatiko You would need to add a set of unit tests that proves that your feature works (yields the expected output for a sample input). For example if you wrote a tokenizer, something like assert tokenize("That's a sentence") == ["that", "a", "sentence"]. You can see examples as most modules are currently tested
    Julian Gonggrijp
    @jgonggrijp
    @steremma Thanks for answering my question. So in general, for a Debian machine running gensim, would you say that 4GB of RAM is an OK amount, or likely to be a bit on the tight side?
    Stergiadis Manos
    @steremma
    Im far from being an expert but on Ubuntu I didn't have any trouble with Ubuntu 4GB (of course another machine with 16 was faster but that one also had better CPU etc.) In theory the out of core feature guarantees completion regardless of the RAM, but when it comes to timing estimates I would experiment myself. For example estimate the complexity by running for 10 - 100 - 1000 docs and then estimate it on your real dataset. Or look in the literature of the given model since gensim in most cases closely follows the papers mentioned in the docstrings