by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    phalexo
    @phalexo
    @rambo-yuanbo What are you trying to do anyway? Forget the mechanics of it, what is the general goal?
    phalexo
    @phalexo
    @eIGato "norm" actually means simply vector' length. When people talk about "normalizing" a vector they usually mean A/A.dot(A), which would be an unit vector with A's direction, i.e. a vector divided by its length.
    Evgeny Denisov
    @eIGato
    @phalexo you are right. I just thought about syn0norm when i answered. And by the way, A.dot(A) equals length squared, not just length.
    phalexo
    @phalexo
    @eIGato Yep, I forgot the sqrt.
    Ivan Menshikh
    @menshikh-iv
    1
    pramodith
    @pramodith
    Hi everyone my name is Pramodith and I'm a graduate student at the Georgia Institute of Technology, I'm interested in contributing to the gensim library as a part of GSOC 2018. I would really like to work on Neural Networks and evaluate and implement a published paper. Am I too late to the party? And can anyone give me more guidance as to how to move forward?
    Yu-Sheng Su
    @CoolSheng
    @menshikh-iv I have interest in gsoc project: Neural networks (similarity learning, state-of-the-art language models). I would like to know more in advance. The main propose is to get better performance than gensim result , right?
    Ivan Menshikh
    @menshikh-iv
    @CoolSheng this is too, gensim more about "create embedding for later similarity indexing", here, we don't worry about embedding, target is similarity.
    singhsanjee
    @singhsanjee
    I have installed gensim and all other supporting libraries in an environment, on Mac OS. it is in pip list, it is updated and upgraded and everything is ok but I always get import error in jupyter notebook
    any solution
    Evgeny Denisov
    @eIGato
    @singhsanjee does jupyter run in the same env?
    phalexo
    @phalexo
    Is there a way to insure that initial random weights are always initialized the same way? With numpy, for example, I can set a seed, and the random number generator then always produces the same data.
    phalexo
    @phalexo
    @eIGato Ok, that's interesting. Thanks for the reference.
    Previously I was asking about fixing certain vectors in place, and found out that there was a mechanism to do that. I am thinking it should be possible to have a model converge to more or less the same state, if I were to keep certain words/documents' vectors fixed while mutating all others.
    singhsanjee
    @singhsanjee
    @eIGato yes jupyter and all other libraries run perfectly
    Rizky Luthfianto
    @rilut
    Hi, I'm curious about large scale training of word vectors. If I have a lot of rows of sentences in Cassandra DB: should I write the table to a file first, then feed it to Gensim with LineIterator, or how?
    Thanks
    phalexo
    @phalexo
    @rilut If you have a python API to the database, you should be able to feed data directly into a Word2Vec or Doc2Vec constructor. Might want to write a generator function for that purpose.
    phalexo
    @phalexo
    If you don't have a python API to Cassandra, you could use a different language and dump the results into a pipe, from where python can read it and feed data into Gensim models.
    Kataev Egor
    @Lavabar
    Hello, I have a question about word2vec. As for skipgrams, do you make it on just one sentence(separately from other sentences) or skipgrams use parts of other sentences too?
    phalexo
    @phalexo
    @Lavabar I think the usual thing to do would be to strip all the punctuation and use an entire text as a single blob. Effective window size is around 15-17 words, most sentences are probably shorter unless unless you're playing with German. :-)
    Kataev Egor
    @Lavabar
    thank you)
    phalexo
    @phalexo
    That said, there is no harm in trying to feed in text sentence by sentence. Maybe you'll get better results. :-)
    Ramin
    @transfluxus
    @Lavabar @phalexo I think often the document is chunked into lines, where you have a sentence per line
    phalexo
    @phalexo
    Sure, but one would lose some information about relationships between adjacent thoughts/ideas. Even a hybrid approach could be used, feeding in self-contained documents and individual sentences.
    Ramin
    @transfluxus
    Hi, I just found from all the en_core_web models only the small one detect stopwords?
    Evgeny Denisov
    @eIGato
    Hi guys. I've thought up a hack for Doc2Vec inferrence. But i don't know if it does make any sense.
    The problem was that infer_vector() produces a vector that is very different from bulk-trained vector of the same document.
    The hack is that after the bulk training i just re-infer all vectors and replace all document vectors with inferred ones.
    Does that make sense?
    phalexo
    @phalexo
    And if you do it a second time, all the inferred vectors are going be different again, because you changed the model.
    Do it a few hundred times, maybe it will converge to something stable. :-)
    Evgeny Denisov
    @eIGato
    @phalexo you are wrong about it.
    In [13]: result = [np.dot(old_infer, new_infer) for old_infer, new_infer in zip(d2v.docvecs.doctag_syn0norm, infer_syn0norm)]
    In [14]: hist = plt.hist(result, bins=sim_borders)
    In [15]: plt.show(block=False)
    image.png
    I've calculated the similarity of first and second inferrence: it's about 1.0 for all the vectors.
    phalexo
    @phalexo
    What parameters are you using with infer_vector? How many steps, alpha, min_alpha?
    Evgeny Denisov
    @eIGato
    @phalexo steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha. d2v.min_alpha = 0
    phalexo
    @phalexo
    What are the actual values, I don't know what you have in your variables?
    Evgeny Denisov
    @eIGato
    print(d2v.alpha, d2v.iter ** 2)
    0.00221920956360714 25
    phalexo
    @phalexo
    To get a reasonable inferred vector steps should be around 500-1000
    Evgeny Denisov
    @eIGato
    @phalexo Tried alpha = min_alpha = 0.19, steps = 4500. Got same result. That's weird.
    phalexo
    @phalexo
    Try alpha = 0.01, min_alpha = 0.0001
    with steps = 1000
    Evgeny Denisov
    @eIGato
    @phalexo Same result((
    phalexo
    @phalexo
    This is before your tweaking vectors?
    Evgeny Denisov
    @eIGato
    No. I do discard the bulk-trained vectors.
    This is similarity between first inferrence (steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha) and the last result from interactive shell.
    Evgeny Denisov
    @eIGato
    F@#$%! All that time i did infer documents from a generator. One-time generator. They did not re-infer at all. Bleen.
    Evgeny Denisov
    @eIGato
    image.png
    This is similarity between first inferrence and alpha = min_alpha = 0.19, steps = 4500.
    Evgeny Denisov
    @eIGato
    Replaced document vectors with copy of inferred ones. And re-inferred. Similarity is about 1.0 like before. Because infer_vector() doesn't use old docvecs at all.
    But i still don't know if it makes sense.
    phalexo
    @phalexo
    I don't either. Makes no sense to me. Considering that inferred vectors should be different based on parameters, it would seem odd.