Hi, I'm curious about large scale training of word vectors. If I have a lot of rows of sentences in Cassandra DB: should I write the table to a file first, then feed it to Gensim with LineIterator, or how? Thanks
phalexo
@phalexo
@rilut If you have a python API to the database, you should be able to feed data directly into a Word2Vec or Doc2Vec constructor. Might want to write a generator function for that purpose.
phalexo
@phalexo
If you don't have a python API to Cassandra, you could use a different language and dump the results into a pipe, from where python can read it and feed data into Gensim models.
Kataev Egor
@Lavabar
Hello, I have a question about word2vec. As for skipgrams, do you make it on just one sentence(separately from other sentences) or skipgrams use parts of other sentences too?
phalexo
@phalexo
@Lavabar I think the usual thing to do would be to strip all the punctuation and use an entire text as a single blob. Effective window size is around 15-17 words, most sentences are probably shorter unless unless you're playing with German. :-)
Kataev Egor
@Lavabar
thank you)
phalexo
@phalexo
That said, there is no harm in trying to feed in text sentence by sentence. Maybe you'll get better results. :-)
Ramin
@transfluxus
@Lavabar@phalexo I think often the document is chunked into lines, where you have a sentence per line
phalexo
@phalexo
Sure, but one would lose some information about relationships between adjacent thoughts/ideas. Even a hybrid approach could be used, feeding in self-contained documents and individual sentences.
Ramin
@transfluxus
Hi, I just found from all the en_core_web models only the small one detect stopwords?
Evgeny Denisov
@eIGato
Hi guys. I've thought up a hack for Doc2Vec inferrence. But i don't know if it does make any sense. The problem was that infer_vector() produces a vector that is very different from bulk-trained vector of the same document. The hack is that after the bulk training i just re-infer all vectors and replace all document vectors with inferred ones. Does that make sense?
phalexo
@phalexo
And if you do it a second time, all the inferred vectors are going be different again, because you changed the model.
Do it a few hundred times, maybe it will converge to something stable. :-)
Evgeny Denisov
@eIGato
@phalexo you are wrong about it.
In [13]: result = [np.dot(old_infer, new_infer)for old_infer, new_infer inzip(d2v.docvecs.doctag_syn0norm, infer_syn0norm)]
In [14]: hist = plt.hist(result, bins=sim_borders)
In [15]: plt.show(block=False)
I've calculated the similarity of first and second inferrence: it's about 1.0 for all the vectors.
phalexo
@phalexo
What parameters are you using with infer_vector? How many steps, alpha, min_alpha?
This is similarity between first inferrence (steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha) and the last result from interactive shell.
Evgeny Denisov
@eIGato
F@#$%! All that time i did infer documents from a generator. One-time generator. They did not re-infer at all. Bleen.
Evgeny Denisov
@eIGato
This is similarity between first inferrence and alpha = min_alpha = 0.19, steps = 4500.
Evgeny Denisov
@eIGato
Replaced document vectors with copy of inferred ones. And re-inferred. Similarity is about 1.0 like before. Because infer_vector() doesn't use old docvecs at all.
But i still don't know if it makes sense.
phalexo
@phalexo
I don't either. Makes no sense to me. Considering that inferred vectors should be different based on parameters, it would seem odd.
Evgeny Denisov
@eIGato
d2v.sample = 1.0 / word_count
Is it reasonable?
Evgeny Denisov
@eIGato
word_count is the count of different words (provisional length of d2v.wv.index2word).
Evgeny Denisov
@eIGato
How to pick a reasonable sample value?
Evgeny Denisov
@eIGato
What if 90% of words are different from each other? Is it possible to train *2Vec model with such a corpus?
phalexo
@phalexo
Clearly that is not a natural language application.
Evgeny Denisov
@eIGato
@phalexo purpose is to predict phrases, not words. So i use phrases as d2v words. And full texts as d2v docs.
Saurabh Vyas
@saurabhvyas
is there a pretrained lda model available for gensim , just for tinkering ?
Matan Shenhav
@ixxie
@saurabhvyas can LDA even be used in a supervised mode?
anyway, its been pretty easy for us to train+predict on a given data set
Dennis.Chen
@DennisChen0307
Hi there. Is there any road maps for new release of gensim?
AMaini503
@AMaini503
Should I expect Doc2Vec to use all the cores if I pass workers = #cpus ?
matanster
@matanster
Apologies for adding a 4th question in a row here... Does gensim have anything builtin for transforming a document to a bag of n-grams representation, or does it in fact only do bag of words? (words being 1-grams...)
Radim Řehůřek
@piskvorky
@matanster Gensim doesn't actually do the transformation; it already expects a (feature_id, weight) bag-of-whatever pairs on input. How you split the documents into words/ngrams/something else is up to you.
@DennisChen0307 our aim is one release per month, but last months have been busy at RARE, not much time for open source. We plan a release for the end of this month.
matanster
@matanster
@piskvorky oh, sorry then, I just thought maybe corpora.Dictionary.doc2bow might have some usage form for that... I could swear I saw it computing the bag-of-words in my code, but I should probably start reading the source to answer my own questions
Jesse Talavera-Greenberg
@JesseTG
When training a word2vec model, I need to give it a list of documents. How does word2vec treat unknown words? By giving an unknown word a vector close to a known word?