## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Evgeny Denisov
@eIGato
And how much should be negative? I use default of 5.
Ivan Menshikh
@menshikh-iv
@eIGato this is OK value, are you sure that your corpus doesn't contains only very similar documents?
rambo-yuanbo
@rambo-yuanbo
thanks a lot! @phalexo @ixxie
rambo-yuanbo
@rambo-yuanbo
hi guys. as suggested by many I simply average the word2vec vectors of the words in a document , and calculate the inner product between the average vector and some query word's vector as a simlilarity/relevence measure. As some results look weird (the similarity is obviously too high), I also average the similarity between the document's each word and the query word. The two measues are quite different, that is, average vectors then inner product VS inner product then avarage. Somehow i feel the latter make sense but why does most literature i found suggest the former way?
average vectors then inner product VS inner product then avarage, which makes more sense?
what does a word2vec's norm mean, anyway?
Evgeny Denisov
@eIGato
@menshikh-iv, thanks. That was not only alpha, but window was too small. When it is less than 3, nothing works properly.
@rambo-yuanbo, norm means "vector with length of 1".
rambo-yuanbo
@rambo-yuanbo
@eIGato surely i understand that norm means a vector's length. But i was confused about whether I average the vectors first, then normalized the avg, and calculate the inner product, OR i normalize each vector and calculate an inner product , and then take the average (or equivalently average and then inner product) ?
the result will be quite different between the two ways
the difference is to normalize the average vector, OR to normalize each vector and then take the average
Evgeny Denisov
@eIGato
@rambo-yuanbo do you think that longer vectors are more significant?
rambo-yuanbo
@rambo-yuanbo
Exactly. if some word's length is high, it will have more impact on the average. If the word2vec's length does have some significant meaning, it's more justified to average the raw vector. But some of my document's average words vector does give some unexpected high similarities to some obviously un-relevent query word. That
That is why i am wondering whether averaging raw word2vec vectors make more sense than average each word's similarity
Evgeny Denisov
@eIGato
You may, for example, try to monkey-patch the target method and look whether there are any improvements.
phalexo
@phalexo
@eIGato use window size 13-15
@rambo-yuanbo What are you trying to do anyway? Forget the mechanics of it, what is the general goal?
phalexo
@phalexo
@eIGato "norm" actually means simply vector' length. When people talk about "normalizing" a vector they usually mean A/A.dot(A), which would be an unit vector with A's direction, i.e. a vector divided by its length.
Evgeny Denisov
@eIGato
@phalexo you are right. I just thought about syn0norm when i answered. And by the way, A.dot(A) equals length squared, not just length.
phalexo
@phalexo
@eIGato Yep, I forgot the sqrt.
Ivan Menshikh
@menshikh-iv
1
pramodith
@pramodith
Hi everyone my name is Pramodith and I'm a graduate student at the Georgia Institute of Technology, I'm interested in contributing to the gensim library as a part of GSOC 2018. I would really like to work on Neural Networks and evaluate and implement a published paper. Am I too late to the party? And can anyone give me more guidance as to how to move forward?
Yu-Sheng Su
@CoolSheng
@menshikh-iv I have interest in gsoc project: Neural networks (similarity learning, state-of-the-art language models). I would like to know more in advance. The main propose is to get better performance than gensim result , right?
Ivan Menshikh
@menshikh-iv
@CoolSheng this is too, gensim more about "create embedding for later similarity indexing", here, we don't worry about embedding, target is similarity.
singhsanjee
@singhsanjee
I have installed gensim and all other supporting libraries in an environment, on Mac OS. it is in pip list, it is updated and upgraded and everything is ok but I always get import error in jupyter notebook
any solution
Evgeny Denisov
@eIGato
@singhsanjee does jupyter run in the same env?
phalexo
@phalexo
Is there a way to insure that initial random weights are always initialized the same way? With numpy, for example, I can set a seed, and the random number generator then always produces the same data.
Evgeny Denisov
@eIGato
phalexo
@phalexo
@eIGato Ok, that's interesting. Thanks for the reference.
Previously I was asking about fixing certain vectors in place, and found out that there was a mechanism to do that. I am thinking it should be possible to have a model converge to more or less the same state, if I were to keep certain words/documents' vectors fixed while mutating all others.
singhsanjee
@singhsanjee
@eIGato yes jupyter and all other libraries run perfectly
Rizky Luthfianto
@rilut
Hi, I'm curious about large scale training of word vectors. If I have a lot of rows of sentences in Cassandra DB: should I write the table to a file first, then feed it to Gensim with LineIterator, or how?
Thanks
phalexo
@phalexo
@rilut If you have a python API to the database, you should be able to feed data directly into a Word2Vec or Doc2Vec constructor. Might want to write a generator function for that purpose.
phalexo
@phalexo
If you don't have a python API to Cassandra, you could use a different language and dump the results into a pipe, from where python can read it and feed data into Gensim models.
Kataev Egor
@Lavabar
Hello, I have a question about word2vec. As for skipgrams, do you make it on just one sentence(separately from other sentences) or skipgrams use parts of other sentences too?
phalexo
@phalexo
@Lavabar I think the usual thing to do would be to strip all the punctuation and use an entire text as a single blob. Effective window size is around 15-17 words, most sentences are probably shorter unless unless you're playing with German. :-)
Kataev Egor
@Lavabar
thank you)
phalexo
@phalexo
That said, there is no harm in trying to feed in text sentence by sentence. Maybe you'll get better results. :-)
Ramin
@transfluxus
@Lavabar @phalexo I think often the document is chunked into lines, where you have a sentence per line
phalexo
@phalexo
Sure, but one would lose some information about relationships between adjacent thoughts/ideas. Even a hybrid approach could be used, feeding in self-contained documents and individual sentences.
Ramin
@transfluxus
Hi, I just found from all the en_core_web models only the small one detect stopwords?
Evgeny Denisov
@eIGato
Hi guys. I've thought up a hack for Doc2Vec inferrence. But i don't know if it does make any sense.
The problem was that infer_vector() produces a vector that is very different from bulk-trained vector of the same document.
The hack is that after the bulk training i just re-infer all vectors and replace all document vectors with inferred ones.
Does that make sense?
phalexo
@phalexo
And if you do it a second time, all the inferred vectors are going be different again, because you changed the model.
Do it a few hundred times, maybe it will converge to something stable. :-)
Evgeny Denisov
@eIGato
@phalexo you are wrong about it.
In [13]: result = [np.dot(old_infer, new_infer) for old_infer, new_infer in zip(d2v.docvecs.doctag_syn0norm, infer_syn0norm)]
In [14]: hist = plt.hist(result, bins=sim_borders)
In [15]: plt.show(block=False)
I've calculated the similarity of first and second inferrence: it's about 1.0 for all the vectors.
phalexo
@phalexo
What parameters are you using with infer_vector? How many steps, alpha, min_alpha?
Evgeny Denisov
@eIGato
@phalexo steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha. d2v.min_alpha = 0