corpus2dense, I had it overflow the memory of a large machine on just 250,000 documents .....
def read_dataset(fpath): with open(fpath, "r") as f: for line in f: # <sentence creation logic> yield sentence model = FastText(min_count=1, size=50, window=5, workers=8, sg=1, word_ngrams=1, min_n=3, max_n=6, iter=5, negative=0) model.build_vocab(read_dataset(args.fpath)) model.train(read_dataset(args.fpath), total_examples=model.corpus_count, epochs=model.iter) model.save("custom_model")
assert tokenize("That's a sentence") == ["that", "a", "sentence"]. You can see examples as most modules are currently tested
callbackspackage and pass some of them to your model for inspection purposes. For example you can use a logging callback
on_epoch_endso that you can see visualize the progress at runtime.
Hello everyone. I spent some time over the weekend playing with a news dataset I found on Kaggle, and thought of showing Gensim's capabilities to the machine learning community. I was inspired by the fact that Kaggle recently partnered up with SpaCy by promoting kernels that use this package. So I thought of also showing an alternative. The kernel is only the result of 6-7 hours of Sunday work, however it achieves some interesting results with no more than 5 lines of Gensim code.
You can take a look here: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn
PS: It's my first kaggle kernel ever so please go easy on me :D
Here is the new public link: https://www.kaggle.com/steremma/news-exploration-using-gensim-and-sklearn?scriptVersionId=6727890
Looking forward to any feedback in case you have time to go through it @piskvorky
Hi, I am trying to run dynamic topic modelling through the wrapper DtmModel. I am running this on a dataset of radon 1.5M documents each several sentences long. I'm getting the error:
subprocess.CalledProcessError: Command '['/efs/data/jpb/Qatalog/dtm/dtm/main', '--ntopics=10', '--model=dtm', '--mode=fit', '--initialize_lda=true', '--corpus_prefix=/tmp/2beb46_train', '--outname=/tmp/2beb46_train_out', '--alpha=0.01', '--lda_max_em_iter=10', '--lda_sequence_min_iter=6', '--lda_sequence_max_iter=10', '--top_chain_var=0.005', '--rng_seed=0']' died with <Signals.SIGABRT: 6>.
Any thoughts would be appreciated.
By the way, I know it works on smaller datasets