Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ivan Menshikh
    @menshikh-iv
    @ixxie about interface (in general meaning) - completely agree
    Matan Shenhav
    @ixxie
    @menshikh-iv alright, I will do so when I have time!
    Ivan Menshikh
    @menshikh-iv
    :+1:
    Matan Shenhav
    @ixxie
    @menshikh-iv - are there any other models in gensim you would recommend for short sentence vectorization besides LSI and FastText?
    FastText seems to perform very well in quality and speed so far, and requires less parameter tuning
    I think it will be the winner in the end
    I guess I might pull that Sent2Vec PR though because that might have potential too
    Ivan Menshikh
    @menshikh-iv
    probably :) I also can propose any techniques about "mixture of word-vectors"
    Sent2Vec not really good with short documents
    Matan Shenhav
    @ixxie
    oh okay, the name seemed to suggest otherwise xD
    Ivan Menshikh
    @menshikh-iv
    @ixxie you need something like suggested here - RaRe-Technologies/gensim#1879
    @ixxie truth is learned in practice, feel free to try of course!
    Matan Shenhav
    @ixxie
    @menshikh-iv we are on a tight schedule so I will only have explored it if I got some recommendations for it :)
    @menshikh-iv - mixing word vectors is an interesting idea..... doesn't FastText do this internally somewhat?
    Ivan Menshikh
    @menshikh-iv
    @ixxie FastText averaged vectors of n-grams, this is simplest way how to solve this task
    Matan Shenhav
    @ixxie
    I see
    is it correct that the right choice of number of topics for LSI is sensitive to the size of the dataset / diversity of the tokens?
    Matan Shenhav
    @ixxie
    @menshikh-iv - does the vocab which FastText / Doc2Vec use provide a way to do something like TFIDF?
    phalexo
    @phalexo
    @menshikh-iv word_locks and doctag_locks default to None obviously. So, what would be the approriate syntax to lock a list? If I do word_locks['ischemia'] = False have I locked that vector to be immutable during additional training? Seems to me, that's how it should work.
    Ivan Menshikh
    @menshikh-iv
    @phalexo not false, probably 0.0 :)
    this contains float values
    @ixxie I don't think so, for TFIDF please use https://radimrehurek.com/gensim/corpora/dictionary.html
    Matan Shenhav
    @ixxie
    @menshikh-iv I also discovered a comment in fasttext's github about how you can simply do the weighting manually because you add up the word vectors to find the doc vector; since I need to use custom frequencies anyway I have been playing around with custom weighting of the word vectors this morning.
    Ivan Menshikh
    @menshikh-iv
    :+1:
    phalexo
    @phalexo
    @menshikh-iv word_locks and doctag_locks have float elements? Why? The comments seemed clear enough that these control whether weights can be further changed during infer_vector or train execution. Boolean seems the most appropriate type. Would there be any reason for this? Are the values simply used elsewhere to multiply weights? What would be the True value then, 1.0?
    Ivan Menshikh
    @menshikh-iv
    No, float is better here, because this used directly in training (in multiplication operations)
    If you'll use boolean, you need to write some "if <statement>" (and forgot about calculation vectorization) or cast boolean to float (because original matrices in float). This is very well thought out move to optimize
    phalexo
    @phalexo
    OK, I see the rationale now. Is 1.0 used then as "weights are mutable?"
    Ivan Menshikh
    @menshikh-iv
    yes
    phalexo
    @phalexo
    Would I need to set up the arrays only once for multiple training epochs or refresh them every time?
    Ivan Menshikh
    @menshikh-iv
    I think only once
    Matan Shenhav
    @ixxie
    @menshikh-iv do you know of models that focus on the level of token bigrams?
    Ivan Menshikh
    @menshikh-iv
    @ixxie I don't catch question, you can add bigrams to your corpus with Phrases and pass it to any model
    Matan Shenhav
    @ixxie
    @menshikh-iv I see.... I was wondering more if there are word / document embedding models which are trained on a level of token-bigrams for ALL the corpus
    Ivan Menshikh
    @menshikh-iv
    no, nearest variant - Sent2Vec trained on token n-grams
    Matan Shenhav
    @ixxie
    I see
    so that is the main difference
    @menshikh-iv has sent2vec been merged yet?
    Ivan Menshikh
    @menshikh-iv
    no, this is almost done, but not ready (but you already can use it)
    Matan Shenhav
    @ixxie
    cool
    I might try it out
    Matan Shenhav
    @ixxie
    @menshikh-iv - what do you recommend for doing record linkage? As our talk over previous days suggests, we have been trying to cluster vectors produced by doc2vec / FastText / LSI for this
    rambo-yuanbo
    @rambo-yuanbo
    Hello guys, I read the word2vec.py, it seems that it simply iterate my corpus for several times with a decreasing learning rate. Then does the order of samples in the corpus matter here? Actually i didn't realize this until now I'm trying to mix texts from 2 different source and to give them some weighting by randomly iterating over the 1st or 2nd text source.
    Matan Shenhav
    @ixxie
    @rambo-yuanbo I believe the order makes a difference to each training pass individually, but not to the whole training run because the list is shuffled between training epochs, but I am not 100% sure.
    phalexo
    @phalexo
    @rambo-yuanbo The more you train over source 1, the better results would be for those documents, and worse results would be for the source 2. You have to mix them thoroughly especially if vocabulary is different.
    Evgeny Denisov
    @eIGato
    Hello everyone
    I use Doc2Vec model, and after training almost all the vectors in docvecs section have the same (+-1%) value. What's the matter and how to heal that?
    phalexo
    @phalexo
    It is impossible to even make a conjecture based on the information you've provided.
    Evgeny Denisov
    @eIGato
    @phalexo , thanks for the answer. After some experiments i found that my alpha value was too high. What is optimal value for corpus of 10k-100k ?
    phalexo
    @phalexo
    I'd go with 0.01