Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    is this the answer ?
    model.wv.most_similar( negative = ['myword'])
    the antonym
    Pranav Subramani
    Hey, I have a question, I've used gensim for a while now and was using the infer_vector for an application I wrote and started noticing vastly different values for the document vector (between dm=1 and dm=0). This is in version 3.5.0 but in all versions prior, the results are very different, I checked the logs and there was a learning rate bug fix in 3.5, could this be the reason for this vastly different values in document vectors?
    Pranav Subramani
    Was hoping anyone else could chime in? Because the differences are really really observable
    Radim Řehůřek
    @PranavSubramani_twitter this gitter is rarely visited by the Gensim devs. you may have a better chance at the mailing list, https://groups.google.com/forum/#!forum/gensim
    Somnath Rakshit
    hey, I have a question. I am using fasttext and using the wikipedia pretrained english model
    is it possible to get the most similar words for some unknown word which might not have existed in the dictionary before?
    @somnathrakshit The model would have to read minds to know what a never seen before word means.
    Somnath Rakshit
    how about looking for substrings and then deciding?
    I'm training a LDA Model on a corpus of ~20k documents. I have read about a few heuristics that indicate convergence of the model like "variational params stop changing", #documents converged.
    I'm not aware about the interpretation of variational params. So, I'm was looking to use the second heuristic. With passes = 50, I see that nearly 90% of the documents converge on the held-out set.
    Is proportion of documents converged a good heuristic for convergence ?
    And, how can I use these heuristics to adjust passes/iterations to ensure convergence when the model is updated with a new batch (online training) ?
    Hi, I've never worked with Cython before, are there docs on how to run the project?
    Gregory Werbin
    is there a way to "fit" and "transform" a corpus in one shot?
    e.g. with TfidfModel or Dictionary
    it seems like right now i need to make 2 passes which is extremely inefficient
    Sidharth Bansal
    I am new here
    Is this machine learning related stuff here?
    nlp mostly
    Rohit Kumar
    while fixing an issue i have been told to add a test for the case, can someone clarify this??
    Julian Gonggrijp
    Hi, I'm about to start creating w2v models from a large corpus (up to 50GB of text per model, English and Dutch). I'm told gensim may require large-ish amounts of memory. Assuming reading the text from disk can be done in small chunks at a time, could somebody give me a ballpark estimate of how much RAM I'll need to request for the VPS in order for gensim to be able to do its job at least somewhat smoothly? Thanks in advance!
    Stergiadis Manos
    One of gensim's most important properties is the ability to perform out-of-core computation, using generators instead of, say lists. Which means you might not even need to write the chunking logic yourself and RAM is not a consideration, at least not in terms of gensim's ability to complete the task. That being said, the more RAM you got the faster the training will go presumably!
    Stergiadis Manos
    @aquatiko You would need to add a set of unit tests that proves that your feature works (yields the expected output for a sample input). For example if you wrote a tokenizer, something like assert tokenize("That's a sentence") == ["that", "a", "sentence"]. You can see examples as most modules are currently tested
    Julian Gonggrijp
    @steremma Thanks for answering my question. So in general, for a Debian machine running gensim, would you say that 4GB of RAM is an OK amount, or likely to be a bit on the tight side?
    Stergiadis Manos
    Im far from being an expert but on Ubuntu I didn't have any trouble with Ubuntu 4GB (of course another machine with 16 was faster but that one also had better CPU etc.) In theory the out of core feature guarantees completion regardless of the RAM, but when it comes to timing estimates I would experiment myself. For example estimate the complexity by running for 10 - 100 - 1000 docs and then estimate it on your real dataset. Or look in the literature of the given model since gensim in most cases closely follows the papers mentioned in the docstrings
    You can also check the callbacks package and pass some of them to your model for inspection purposes. For example you can use a logging callback on_epoch_end so that you can see visualize the progress at runtime.
    Julian Gonggrijp
    Stergiadis Manos
    np :)
    Stergiadis Manos

    Hello everyone. I spent some time over the weekend playing with a news dataset I found on Kaggle, and thought of showing Gensim's capabilities to the machine learning community. I was inspired by the fact that Kaggle recently partnered up with SpaCy by promoting kernels that use this package. So I thought of also showing an alternative. The kernel is only the result of 6-7 hours of Sunday work, however it achieves some interesting results with no more than 5 lines of Gensim code.

    You can take a look here: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn

    PS: It's my first kaggle kernel ever so please go easy on me :D

    Radim Řehůřek
    Hi @steremma that's very interesting :) What's the advantage of "promoting kernels", how does that work?
    Stergiadis Manos
    Well they added a custom tag called SpaCy and then kernels featuring the package with many voted would win some reward or something.
    actually I took my kernel private for now because I realized that the topics produced are not reproducible unless I set the random state, I will fix and publish tomorrow
    as a result there was an explosion of SpaCy featuring kernels. However I don't know how much those affect popularity
    Stergiadis Manos

    Here is the new public link: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn?scriptVersionId=6727890

    Looking forward to any feedback in case you have time to go through it @piskvorky

    Nabanita Dash
    Hi!! I am new to gensim
    I want to understand the code base.I need help!!
    Jeff Schneider
    Are there any good examples for the creation of a topic model on a specific taxonomy such as IAB or IPTC?
    hi there! first of all, sorry for my poor english. I read about wotd2vec, and i think i found mistake in movie-plots-by-genre, in jupyter notebook. I open issue and create pullrequest on github (RaRe-Technologies/movie-plots-by-genre#14). For now i try to understand is it a real mistake or my fault?
    Joseph Bullock

    Hi, I am trying to run dynamic topic modelling through the wrapper DtmModel. I am running this on a dataset of radon 1.5M documents each several sentences long. I'm getting the error:

    subprocess.CalledProcessError: Command '['/efs/data/jpb/Qatalog/dtm/dtm/main', '--ntopics=10', '--model=dtm', '--mode=fit', '--initialize_lda=true', '--corpus_prefix=/tmp/2beb46_train', '--outname=/tmp/2beb46_train_out', '--alpha=0.01', '--lda_max_em_iter=10', '--lda_sequence_min_iter=6', '--lda_sequence_max_iter=10', '--top_chain_var=0.005', '--rng_seed=0']' died with <Signals.SIGABRT: 6>.

    Any thoughts would be appreciated.

    By the way, I know it works on smaller datasets

    Syed Farhan Ahmad
    Hi everyone, I am an engineering student from Bangalore, India looking forward to contributing to gensim as part of GSoC 2019. Please guide me with the steps to get started.
    Harshal Mittal
    Hi, I am a junior at IIT Roorkee India, and would like to contribute to the Gensim project for Gsoc19. May I get some beginner's guidance for the same, also some idea about the current year's projects for Gsoc would help. Thanks :)
    Julian Gonggrijp
    The first line of a w2v file contains two numbers in plaintext. What do these numbers mean? Number of unique tokens and vector size?
    Philippe Rivière
    hello everyone; what kind of clustering and visualization techniques do you usually apply to the embeddings you compute with gensim?
    I'm trying UMAP+HDBSCAN with various parameters
    Ahmed T. Hmmad
    Hi guys, any idea on how to bring the prototype vector generated from the HDP model function to the list of documents? I have seen many examples with LDA but none with HDP
    V.Prasanna kumar
    Hey everyone ..Can any one help me in Labelling the topics using Gensim module