Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Stergiadis Manos

    Hello everyone. I spent some time over the weekend playing with a news dataset I found on Kaggle, and thought of showing Gensim's capabilities to the machine learning community. I was inspired by the fact that Kaggle recently partnered up with SpaCy by promoting kernels that use this package. So I thought of also showing an alternative. The kernel is only the result of 6-7 hours of Sunday work, however it achieves some interesting results with no more than 5 lines of Gensim code.

    You can take a look here: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn

    PS: It's my first kaggle kernel ever so please go easy on me :D

    Radim Řehůřek
    Hi @steremma that's very interesting :) What's the advantage of "promoting kernels", how does that work?
    Stergiadis Manos
    Well they added a custom tag called SpaCy and then kernels featuring the package with many voted would win some reward or something.
    actually I took my kernel private for now because I realized that the topics produced are not reproducible unless I set the random state, I will fix and publish tomorrow
    as a result there was an explosion of SpaCy featuring kernels. However I don't know how much those affect popularity
    Stergiadis Manos

    Here is the new public link: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn?scriptVersionId=6727890

    Looking forward to any feedback in case you have time to go through it @piskvorky

    Nabanita Dash
    Hi!! I am new to gensim
    I want to understand the code base.I need help!!
    Jeff Schneider
    Are there any good examples for the creation of a topic model on a specific taxonomy such as IAB or IPTC?
    hi there! first of all, sorry for my poor english. I read about wotd2vec, and i think i found mistake in movie-plots-by-genre, in jupyter notebook. I open issue and create pullrequest on github (RaRe-Technologies/movie-plots-by-genre#14). For now i try to understand is it a real mistake or my fault?
    Joseph Bullock

    Hi, I am trying to run dynamic topic modelling through the wrapper DtmModel. I am running this on a dataset of radon 1.5M documents each several sentences long. I'm getting the error:

    subprocess.CalledProcessError: Command '['/efs/data/jpb/Qatalog/dtm/dtm/main', '--ntopics=10', '--model=dtm', '--mode=fit', '--initialize_lda=true', '--corpus_prefix=/tmp/2beb46_train', '--outname=/tmp/2beb46_train_out', '--alpha=0.01', '--lda_max_em_iter=10', '--lda_sequence_min_iter=6', '--lda_sequence_max_iter=10', '--top_chain_var=0.005', '--rng_seed=0']' died with <Signals.SIGABRT: 6>.

    Any thoughts would be appreciated.

    By the way, I know it works on smaller datasets

    Syed Farhan Ahmad
    Hi everyone, I am an engineering student from Bangalore, India looking forward to contributing to gensim as part of GSoC 2019. Please guide me with the steps to get started.
    Harshal Mittal
    Hi, I am a junior at IIT Roorkee India, and would like to contribute to the Gensim project for Gsoc19. May I get some beginner's guidance for the same, also some idea about the current year's projects for Gsoc would help. Thanks :)
    Julian Gonggrijp
    The first line of a w2v file contains two numbers in plaintext. What do these numbers mean? Number of unique tokens and vector size?
    Philippe Rivière
    hello everyone; what kind of clustering and visualization techniques do you usually apply to the embeddings you compute with gensim?
    I'm trying UMAP+HDBSCAN with various parameters
    Ahmed T. Hmmad
    Hi guys, any idea on how to bring the prototype vector generated from the HDP model function to the list of documents? I have seen many examples with LDA but none with HDP
    V.Prasanna kumar
    Hey everyone ..Can any one help me in Labelling the topics using Gensim module
    Brendan Reed
    hi all, does anyone have experience in debugging a windows install? have mingw in my path but gensim still can't find the C compiler, using win10 py 3.7
    @piskvorky Normally documents are tagged with a single unique identifier when training a Doc2Vec model. That said, one can use multiple tags. If I use a single tag associated with multiple documents, a vector is generated for that multi-document tag. My current understanding is that a multi-document tag vector will end up somewhere in the center of documents' vectors (close to the center of mass so to speak). Is my understanding correct? Thanks.
    @/all Has this channel died? If so, where is the community support now? Thanks.
    Brenner Haverlock
    Radim Řehůřek
    @officialbrenner why not install from the precompiled Python Wheel?
    other than that, I'm afraid we have little ability to debug Windows: none of us have Windows, or use that platform.
    @phalexo the primary support channel is the mailing list: https://radimrehurek.com/gensim/support.html . I personally don't check Gitter at all.
    Hi everyone, i'm using gensim for create word embedding model that includes additional linguistic features such as pos tag, lemma, named entity,... Are there any options for me to implement this idea ?
    Hi everyone , can somebody tell me why gensim LDA model output different topic distribution for same sentence, everytime run it the result is diffent ,plz?
    Joseph Bullock
    @ggqshr the model is randomly seeded every time before performing clustering - this means that sometimes sentences can belong to different clusters. However, each sentence should have a probability of being assigned to each possible topic, so you might also be seeing that the probability isn't changing massively, but it is changing enough to alter the most likely topic
    @JosephPB it's very helpful! Thank you very much!
    @JosephPB hey ,i want to ask an other stupid question,How to avoid the previous situation,I have increased the value of the parameter passes, but the same sentence will still get different results, and the result varies greatly, can someone tell me how to avoid this and why?
    Joseph Bullock

    Hi @ggqshr No problem. If you want to be able to reproduce the same result each time then you can set the random_state to an interger value. See the parameters on the gensim page: https://radimrehurek.com/gensim/models/ldamodel.html

    Hope this helps :)

    @JosephPB unfortunately, I have set random_state, but the results are still different each time.My situation is the same as the following page, but the passes parameter does not work.
    Philippe Rivière
    hello, I'm using gensim to generate an LDA model of my documents. Then I export the vectors to matrixmarket format, and create a 2D embedding with UMAP in JavaScript. So far so good. Now I would like to do this UMAP transform in python, but I can't find out how to "convert" the documents vectors in the LDA topic space… It should be "obvious" in the sense that what I need is a n * m matrix when n is the numbers of documents and m the number of topics.
    Philippe Rivière

    I'm blocked here:

    transformed = lda[corpus_lda]
    X = np.array(transformed)
    embedding = umap.UMAP().fit_transform(X)

    the value of X is an array of lists instead of a numpy array expected by umap.

    Philippe Rivière
    I built the np.array by hand and it works
    Herli Menezes
    Hi, is there any gensim module for portuguese language?
    Herli Menezes
    More specifically. How to manage diacritics in gensim?
    Hi All, is this channel active?
    @piskvorky Quick question. I know that LSI can return less than requested number of topics (for short texts, usually). I think LDA does that, too. How about HDP? Could it ever return less than the requested number of topics (in my interpretation, that is the m_T property)?
    Andrew M Olney
    Greetings, I'm teaching a class using gensim at this very moment. All my windows users have hit “OverflowError: Python int too large to convert to C long” when executing this line of code: fakeDataset = downloader.load('fake-news') I could try to distribute the dataset manually, but are there any other suggestions?
    Andrew M Olney
    I'll put an issue on GitHub. Thanks :)
    I am having issue with LDA model, after training when i try to see the topics distribution of some terms it gives an empty list [], could anyone tell why it is happening.. Thanks in advance.. :)
    Rob Creel

    Good day. I'm going through the tutorials and I'm getting an error. On the run_corpora_and_vector_spaces.ipynb notebook, in the cell with the following code

    for vector in corpus_memory_friendly:  # load one vector into memory at a time

    I get this error

    HTTPError: 404 Client Error: Not Found for url: https://radimrehurek.com/gensim/mycorpus.txt

    The code does not look like it should be calling/visiting a URL, but it seems to be trying and failing to. What's going on here? How may I run the tutorial?

    Machine specs:
    Operating System: Manjaro Linux
    Processors: 4 × Intel® Core™ i5-3320M CPU @ 2.60GHz
    Memory: 15.5 GiB of RAM
    Notebook is running in Jupyter Lab in Firefox 77.0.1 (64-bit)

    Hi. In word2vec it can be useful to distinguish the context to the left of a word and the context to the right.
    does gensim support this?
    Qi Wang
    Has sent2vec been merged yet?
    Data Knight 🎠
    How can i start contributing to genism