## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Evgeny Denisov
@eIGato
@rambo-yuanbo do you think that longer vectors are more significant?
rambo-yuanbo
@rambo-yuanbo
Exactly. if some word's length is high, it will have more impact on the average. If the word2vec's length does have some significant meaning, it's more justified to average the raw vector. But some of my document's average words vector does give some unexpected high similarities to some obviously un-relevent query word. That
That is why i am wondering whether averaging raw word2vec vectors make more sense than average each word's similarity
Evgeny Denisov
@eIGato
You may, for example, try to monkey-patch the target method and look whether there are any improvements.
phalexo
@phalexo
@eIGato use window size 13-15
@rambo-yuanbo What are you trying to do anyway? Forget the mechanics of it, what is the general goal?
phalexo
@phalexo
@eIGato "norm" actually means simply vector' length. When people talk about "normalizing" a vector they usually mean A/A.dot(A), which would be an unit vector with A's direction, i.e. a vector divided by its length.
Evgeny Denisov
@eIGato
@phalexo you are right. I just thought about syn0norm when i answered. And by the way, A.dot(A) equals length squared, not just length.
phalexo
@phalexo
@eIGato Yep, I forgot the sqrt.
Ivan Menshikh
@menshikh-iv
1
pramodith
@pramodith
Hi everyone my name is Pramodith and I'm a graduate student at the Georgia Institute of Technology, I'm interested in contributing to the gensim library as a part of GSOC 2018. I would really like to work on Neural Networks and evaluate and implement a published paper. Am I too late to the party? And can anyone give me more guidance as to how to move forward?
Yu-Sheng Su
@CoolSheng
@menshikh-iv I have interest in gsoc project: Neural networks (similarity learning, state-of-the-art language models). I would like to know more in advance. The main propose is to get better performance than gensim result , right?
Ivan Menshikh
@menshikh-iv
@CoolSheng this is too, gensim more about "create embedding for later similarity indexing", here, we don't worry about embedding, target is similarity.
singhsanjee
@singhsanjee
I have installed gensim and all other supporting libraries in an environment, on Mac OS. it is in pip list, it is updated and upgraded and everything is ok but I always get import error in jupyter notebook
any solution
Evgeny Denisov
@eIGato
@singhsanjee does jupyter run in the same env?
phalexo
@phalexo
Is there a way to insure that initial random weights are always initialized the same way? With numpy, for example, I can set a seed, and the random number generator then always produces the same data.
Evgeny Denisov
@eIGato
phalexo
@phalexo
@eIGato Ok, that's interesting. Thanks for the reference.
Previously I was asking about fixing certain vectors in place, and found out that there was a mechanism to do that. I am thinking it should be possible to have a model converge to more or less the same state, if I were to keep certain words/documents' vectors fixed while mutating all others.
singhsanjee
@singhsanjee
@eIGato yes jupyter and all other libraries run perfectly
Rizky Luthfianto
@rilut
Hi, I'm curious about large scale training of word vectors. If I have a lot of rows of sentences in Cassandra DB: should I write the table to a file first, then feed it to Gensim with LineIterator, or how?
Thanks
phalexo
@phalexo
@rilut If you have a python API to the database, you should be able to feed data directly into a Word2Vec or Doc2Vec constructor. Might want to write a generator function for that purpose.
phalexo
@phalexo
If you don't have a python API to Cassandra, you could use a different language and dump the results into a pipe, from where python can read it and feed data into Gensim models.
Kataev Egor
@Lavabar
Hello, I have a question about word2vec. As for skipgrams, do you make it on just one sentence(separately from other sentences) or skipgrams use parts of other sentences too?
phalexo
@phalexo
@Lavabar I think the usual thing to do would be to strip all the punctuation and use an entire text as a single blob. Effective window size is around 15-17 words, most sentences are probably shorter unless unless you're playing with German. :-)
Kataev Egor
@Lavabar
thank you)
phalexo
@phalexo
That said, there is no harm in trying to feed in text sentence by sentence. Maybe you'll get better results. :-)
Ramin
@transfluxus
@Lavabar @phalexo I think often the document is chunked into lines, where you have a sentence per line
phalexo
@phalexo
Sure, but one would lose some information about relationships between adjacent thoughts/ideas. Even a hybrid approach could be used, feeding in self-contained documents and individual sentences.
Ramin
@transfluxus
Hi, I just found from all the en_core_web models only the small one detect stopwords?
Evgeny Denisov
@eIGato
Hi guys. I've thought up a hack for Doc2Vec inferrence. But i don't know if it does make any sense.
The problem was that infer_vector() produces a vector that is very different from bulk-trained vector of the same document.
The hack is that after the bulk training i just re-infer all vectors and replace all document vectors with inferred ones.
Does that make sense?
phalexo
@phalexo
And if you do it a second time, all the inferred vectors are going be different again, because you changed the model.
Do it a few hundred times, maybe it will converge to something stable. :-)
Evgeny Denisov
@eIGato
@phalexo you are wrong about it.
In [13]: result = [np.dot(old_infer, new_infer) for old_infer, new_infer in zip(d2v.docvecs.doctag_syn0norm, infer_syn0norm)]
In [14]: hist = plt.hist(result, bins=sim_borders)
In [15]: plt.show(block=False)
I've calculated the similarity of first and second inferrence: it's about 1.0 for all the vectors.
phalexo
@phalexo
What parameters are you using with infer_vector? How many steps, alpha, min_alpha?
Evgeny Denisov
@eIGato
@phalexo steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha. d2v.min_alpha = 0
phalexo
@phalexo
What are the actual values, I don't know what you have in your variables?
Evgeny Denisov
@eIGato
print(d2v.alpha, d2v.iter ** 2)
0.00221920956360714 25
phalexo
@phalexo
To get a reasonable inferred vector steps should be around 500-1000
Evgeny Denisov
@eIGato
@phalexo Tried alpha = min_alpha = 0.19, steps = 4500. Got same result. That's weird.
phalexo
@phalexo
Try alpha = 0.01, min_alpha = 0.0001
with steps = 1000
Evgeny Denisov
@eIGato
@phalexo Same result((
phalexo
@phalexo
This is before your tweaking vectors?
Evgeny Denisov
@eIGato
No. I do discard the bulk-trained vectors.
This is similarity between first inferrence (steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha) and the last result from interactive shell.
Evgeny Denisov
@eIGato
F@#\$%! All that time i did infer documents from a generator. One-time generator. They did not re-infer at all. Bleen.