by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Evgeny Denisov
    @eIGato
    @rambo-yuanbo, norm means "vector with length of 1".
    rambo-yuanbo
    @rambo-yuanbo
    @eIGato surely i understand that norm means a vector's length. But i was confused about whether I average the vectors first, then normalized the avg, and calculate the inner product, OR i normalize each vector and calculate an inner product , and then take the average (or equivalently average and then inner product) ?
    the result will be quite different between the two ways
    the difference is to normalize the average vector, OR to normalize each vector and then take the average
    Evgeny Denisov
    @eIGato
    @rambo-yuanbo do you think that longer vectors are more significant?
    rambo-yuanbo
    @rambo-yuanbo
    Exactly. if some word's length is high, it will have more impact on the average. If the word2vec's length does have some significant meaning, it's more justified to average the raw vector. But some of my document's average words vector does give some unexpected high similarities to some obviously un-relevent query word. That
    That is why i am wondering whether averaging raw word2vec vectors make more sense than average each word's similarity
    Evgeny Denisov
    @eIGato
    You may, for example, try to monkey-patch the target method and look whether there are any improvements.
    phalexo
    @phalexo
    @eIGato use window size 13-15
    @rambo-yuanbo What are you trying to do anyway? Forget the mechanics of it, what is the general goal?
    phalexo
    @phalexo
    @eIGato "norm" actually means simply vector' length. When people talk about "normalizing" a vector they usually mean A/A.dot(A), which would be an unit vector with A's direction, i.e. a vector divided by its length.
    Evgeny Denisov
    @eIGato
    @phalexo you are right. I just thought about syn0norm when i answered. And by the way, A.dot(A) equals length squared, not just length.
    phalexo
    @phalexo
    @eIGato Yep, I forgot the sqrt.
    Ivan Menshikh
    @menshikh-iv
    1
    pramodith
    @pramodith
    Hi everyone my name is Pramodith and I'm a graduate student at the Georgia Institute of Technology, I'm interested in contributing to the gensim library as a part of GSOC 2018. I would really like to work on Neural Networks and evaluate and implement a published paper. Am I too late to the party? And can anyone give me more guidance as to how to move forward?
    Yu-Sheng Su
    @CoolSheng
    @menshikh-iv I have interest in gsoc project: Neural networks (similarity learning, state-of-the-art language models). I would like to know more in advance. The main propose is to get better performance than gensim result , right?
    Ivan Menshikh
    @menshikh-iv
    @CoolSheng this is too, gensim more about "create embedding for later similarity indexing", here, we don't worry about embedding, target is similarity.
    singhsanjee
    @singhsanjee
    I have installed gensim and all other supporting libraries in an environment, on Mac OS. it is in pip list, it is updated and upgraded and everything is ok but I always get import error in jupyter notebook
    any solution
    Evgeny Denisov
    @eIGato
    @singhsanjee does jupyter run in the same env?
    phalexo
    @phalexo
    Is there a way to insure that initial random weights are always initialized the same way? With numpy, for example, I can set a seed, and the random number generator then always produces the same data.
    phalexo
    @phalexo
    @eIGato Ok, that's interesting. Thanks for the reference.
    Previously I was asking about fixing certain vectors in place, and found out that there was a mechanism to do that. I am thinking it should be possible to have a model converge to more or less the same state, if I were to keep certain words/documents' vectors fixed while mutating all others.
    singhsanjee
    @singhsanjee
    @eIGato yes jupyter and all other libraries run perfectly
    Rizky Luthfianto
    @rilut
    Hi, I'm curious about large scale training of word vectors. If I have a lot of rows of sentences in Cassandra DB: should I write the table to a file first, then feed it to Gensim with LineIterator, or how?
    Thanks
    phalexo
    @phalexo
    @rilut If you have a python API to the database, you should be able to feed data directly into a Word2Vec or Doc2Vec constructor. Might want to write a generator function for that purpose.
    phalexo
    @phalexo
    If you don't have a python API to Cassandra, you could use a different language and dump the results into a pipe, from where python can read it and feed data into Gensim models.
    Kataev Egor
    @Lavabar
    Hello, I have a question about word2vec. As for skipgrams, do you make it on just one sentence(separately from other sentences) or skipgrams use parts of other sentences too?
    phalexo
    @phalexo
    @Lavabar I think the usual thing to do would be to strip all the punctuation and use an entire text as a single blob. Effective window size is around 15-17 words, most sentences are probably shorter unless unless you're playing with German. :-)
    Kataev Egor
    @Lavabar
    thank you)
    phalexo
    @phalexo
    That said, there is no harm in trying to feed in text sentence by sentence. Maybe you'll get better results. :-)
    Ramin
    @transfluxus
    @Lavabar @phalexo I think often the document is chunked into lines, where you have a sentence per line
    phalexo
    @phalexo
    Sure, but one would lose some information about relationships between adjacent thoughts/ideas. Even a hybrid approach could be used, feeding in self-contained documents and individual sentences.
    Ramin
    @transfluxus
    Hi, I just found from all the en_core_web models only the small one detect stopwords?
    Evgeny Denisov
    @eIGato
    Hi guys. I've thought up a hack for Doc2Vec inferrence. But i don't know if it does make any sense.
    The problem was that infer_vector() produces a vector that is very different from bulk-trained vector of the same document.
    The hack is that after the bulk training i just re-infer all vectors and replace all document vectors with inferred ones.
    Does that make sense?
    phalexo
    @phalexo
    And if you do it a second time, all the inferred vectors are going be different again, because you changed the model.
    Do it a few hundred times, maybe it will converge to something stable. :-)
    Evgeny Denisov
    @eIGato
    @phalexo you are wrong about it.
    In [13]: result = [np.dot(old_infer, new_infer) for old_infer, new_infer in zip(d2v.docvecs.doctag_syn0norm, infer_syn0norm)]
    In [14]: hist = plt.hist(result, bins=sim_borders)
    In [15]: plt.show(block=False)
    image.png
    I've calculated the similarity of first and second inferrence: it's about 1.0 for all the vectors.
    phalexo
    @phalexo
    What parameters are you using with infer_vector? How many steps, alpha, min_alpha?
    Evgeny Denisov
    @eIGato
    @phalexo steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha. d2v.min_alpha = 0
    phalexo
    @phalexo
    What are the actual values, I don't know what you have in your variables?
    Evgeny Denisov
    @eIGato
    print(d2v.alpha, d2v.iter ** 2)
    0.00221920956360714 25
    phalexo
    @phalexo
    To get a reasonable inferred vector steps should be around 500-1000
    Evgeny Denisov
    @eIGato
    @phalexo Tried alpha = min_alpha = 0.19, steps = 4500. Got same result. That's weird.
    phalexo
    @phalexo
    Try alpha = 0.01, min_alpha = 0.0001
    with steps = 1000
    Evgeny Denisov
    @eIGato
    @phalexo Same result((