Hello everyone. I spent some time over the weekend playing with a news dataset I found on Kaggle, and thought of showing Gensim's capabilities to the machine learning community. I was inspired by the fact that Kaggle recently partnered up with SpaCy by promoting kernels that use this package. So I thought of also showing an alternative. The kernel is only the result of 6-7 hours of Sunday work, however it achieves some interesting results with no more than 5 lines of Gensim code.
You can take a look here: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn
PS: It's my first kaggle kernel ever so please go easy on me :D
Here is the new public link: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn?scriptVersionId=6727890
Looking forward to any feedback in case you have time to go through it @piskvorky
Hi, I am trying to run dynamic topic modelling through the wrapper DtmModel. I am running this on a dataset of radon 1.5M documents each several sentences long. I'm getting the error:
subprocess.CalledProcessError: Command '['/efs/data/jpb/Qatalog/dtm/dtm/main', '--ntopics=10', '--model=dtm', '--mode=fit', '--initialize_lda=true', '--corpus_prefix=/tmp/2beb46_train', '--outname=/tmp/2beb46_train_out', '--alpha=0.01', '--lda_max_em_iter=10', '--lda_sequence_min_iter=6', '--lda_sequence_max_iter=10', '--top_chain_var=0.005', '--rng_seed=0']' died with <Signals.SIGABRT: 6>.
Any thoughts would be appreciated.
By the way, I know it works on smaller datasets
stigjbThat seems right, Julian, here is the relevant line of code: https://github.com/RaRe-Technologies/gensim/blob/949213a58db8532bf86fd4bd424e256a83474d3e/gensim/models/utils_any2vec.py#L283
Hi @ggqshr No problem. If you want to be able to reproduce the same result each time then you can set the random_state to an interger value. See the parameters on the gensim page: https://radimrehurek.com/gensim/models/ldamodel.html
Hope this helps :)
fakeDataset = downloader.load('fake-news')I could try to distribute the dataset manually, but are there any other suggestions?
Good day. I'm going through the tutorials and I'm getting an error. On the run_corpora_and_vector_spaces.ipynb notebook, in the cell with the following code
for vector in corpus_memory_friendly: # load one vector into memory at a time print(vector)
I get this error
HTTPError: 404 Client Error: Not Found for url: https://radimrehurek.com/gensim/mycorpus.txt
The code does not look like it should be calling/visiting a URL, but it seems to be trying and failing to. What's going on here? How may I run the tutorial?
Operating System: Manjaro Linux
Processors: 4 × Intel® Core™ i5-3320M CPU @ 2.60GHz
Memory: 15.5 GiB of RAM
Notebook is running in Jupyter Lab in Firefox 77.0.1 (64-bit)