Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ivan Menshikh
    @menshikh-iv
    no, nearest variant - Sent2Vec trained on token n-grams
    Matan Shenhav
    @ixxie
    I see
    so that is the main difference
    @menshikh-iv has sent2vec been merged yet?
    Ivan Menshikh
    @menshikh-iv
    no, this is almost done, but not ready (but you already can use it)
    Matan Shenhav
    @ixxie
    cool
    I might try it out
    Matan Shenhav
    @ixxie
    @menshikh-iv - what do you recommend for doing record linkage? As our talk over previous days suggests, we have been trying to cluster vectors produced by doc2vec / FastText / LSI for this
    rambo-yuanbo
    @rambo-yuanbo
    Hello guys, I read the word2vec.py, it seems that it simply iterate my corpus for several times with a decreasing learning rate. Then does the order of samples in the corpus matter here? Actually i didn't realize this until now I'm trying to mix texts from 2 different source and to give them some weighting by randomly iterating over the 1st or 2nd text source.
    Matan Shenhav
    @ixxie
    @rambo-yuanbo I believe the order makes a difference to each training pass individually, but not to the whole training run because the list is shuffled between training epochs, but I am not 100% sure.
    phalexo
    @phalexo
    @rambo-yuanbo The more you train over source 1, the better results would be for those documents, and worse results would be for the source 2. You have to mix them thoroughly especially if vocabulary is different.
    Evgeny Denisov
    @eIGato
    Hello everyone
    I use Doc2Vec model, and after training almost all the vectors in docvecs section have the same (+-1%) value. What's the matter and how to heal that?
    phalexo
    @phalexo
    It is impossible to even make a conjecture based on the information you've provided.
    Evgeny Denisov
    @eIGato
    @phalexo , thanks for the answer. After some experiments i found that my alpha value was too high. What is optimal value for corpus of 10k-100k ?
    phalexo
    @phalexo
    I'd go with 0.01
    Evgeny Denisov
    @eIGato
    And how much should be negative? I use default of 5.
    Ivan Menshikh
    @menshikh-iv
    @eIGato this is OK value, are you sure that your corpus doesn't contains only very similar documents?
    rambo-yuanbo
    @rambo-yuanbo
    thanks a lot! @phalexo @ixxie
    rambo-yuanbo
    @rambo-yuanbo
    hi guys. as suggested by many I simply average the word2vec vectors of the words in a document , and calculate the inner product between the average vector and some query word's vector as a simlilarity/relevence measure. As some results look weird (the similarity is obviously too high), I also average the similarity between the document's each word and the query word. The two measues are quite different, that is, average vectors then inner product VS inner product then avarage. Somehow i feel the latter make sense but why does most literature i found suggest the former way?
    average vectors then inner product VS inner product then avarage, which makes more sense?
    what does a word2vec's norm mean, anyway?
    Evgeny Denisov
    @eIGato
    @menshikh-iv, thanks. That was not only alpha, but window was too small. When it is less than 3, nothing works properly.
    @rambo-yuanbo, norm means "vector with length of 1".
    rambo-yuanbo
    @rambo-yuanbo
    @eIGato surely i understand that norm means a vector's length. But i was confused about whether I average the vectors first, then normalized the avg, and calculate the inner product, OR i normalize each vector and calculate an inner product , and then take the average (or equivalently average and then inner product) ?
    the result will be quite different between the two ways
    the difference is to normalize the average vector, OR to normalize each vector and then take the average
    Evgeny Denisov
    @eIGato
    @rambo-yuanbo do you think that longer vectors are more significant?
    rambo-yuanbo
    @rambo-yuanbo
    Exactly. if some word's length is high, it will have more impact on the average. If the word2vec's length does have some significant meaning, it's more justified to average the raw vector. But some of my document's average words vector does give some unexpected high similarities to some obviously un-relevent query word. That
    That is why i am wondering whether averaging raw word2vec vectors make more sense than average each word's similarity
    Evgeny Denisov
    @eIGato
    You may, for example, try to monkey-patch the target method and look whether there are any improvements.
    phalexo
    @phalexo
    @eIGato use window size 13-15
    @rambo-yuanbo What are you trying to do anyway? Forget the mechanics of it, what is the general goal?
    phalexo
    @phalexo
    @eIGato "norm" actually means simply vector' length. When people talk about "normalizing" a vector they usually mean A/A.dot(A), which would be an unit vector with A's direction, i.e. a vector divided by its length.
    Evgeny Denisov
    @eIGato
    @phalexo you are right. I just thought about syn0norm when i answered. And by the way, A.dot(A) equals length squared, not just length.
    phalexo
    @phalexo
    @eIGato Yep, I forgot the sqrt.
    Ivan Menshikh
    @menshikh-iv
    1
    pramodith
    @pramodith
    Hi everyone my name is Pramodith and I'm a graduate student at the Georgia Institute of Technology, I'm interested in contributing to the gensim library as a part of GSOC 2018. I would really like to work on Neural Networks and evaluate and implement a published paper. Am I too late to the party? And can anyone give me more guidance as to how to move forward?
    Yu-Sheng Su
    @CoolSheng
    @menshikh-iv I have interest in gsoc project: Neural networks (similarity learning, state-of-the-art language models). I would like to know more in advance. The main propose is to get better performance than gensim result , right?
    Ivan Menshikh
    @menshikh-iv
    @CoolSheng this is too, gensim more about "create embedding for later similarity indexing", here, we don't worry about embedding, target is similarity.
    singhsanjee
    @singhsanjee
    I have installed gensim and all other supporting libraries in an environment, on Mac OS. it is in pip list, it is updated and upgraded and everything is ok but I always get import error in jupyter notebook
    any solution
    Evgeny Denisov
    @eIGato
    @singhsanjee does jupyter run in the same env?
    phalexo
    @phalexo
    Is there a way to insure that initial random weights are always initialized the same way? With numpy, for example, I can set a seed, and the random number generator then always produces the same data.
    phalexo
    @phalexo
    @eIGato Ok, that's interesting. Thanks for the reference.
    Previously I was asking about fixing certain vectors in place, and found out that there was a mechanism to do that. I am thinking it should be possible to have a model converge to more or less the same state, if I were to keep certain words/documents' vectors fixed while mutating all others.
    singhsanjee
    @singhsanjee
    @eIGato yes jupyter and all other libraries run perfectly
    Rizky Luthfianto
    @rilut
    Hi, I'm curious about large scale training of word vectors. If I have a lot of rows of sentences in Cassandra DB: should I write the table to a file first, then feed it to Gensim with LineIterator, or how?
    Thanks
    phalexo
    @phalexo
    @rilut If you have a python API to the database, you should be able to feed data directly into a Word2Vec or Doc2Vec constructor. Might want to write a generator function for that purpose.