##### Activity
Evgeny Denisov
@eIGato
@phalexo steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha. d2v.min_alpha = 0
phalexo
@phalexo
What are the actual values, I don't know what you have in your variables?
Evgeny Denisov
@eIGato
print(d2v.alpha, d2v.iter ** 2)
0.00221920956360714 25
phalexo
@phalexo
To get a reasonable inferred vector steps should be around 500-1000
Evgeny Denisov
@eIGato
@phalexo Tried alpha = min_alpha = 0.19, steps = 4500. Got same result. That's weird.
phalexo
@phalexo
Try alpha = 0.01, min_alpha = 0.0001
with steps = 1000
Evgeny Denisov
@eIGato
@phalexo Same result((
phalexo
@phalexo
This is before your tweaking vectors?
Evgeny Denisov
@eIGato
No. I do discard the bulk-trained vectors.
This is similarity between first inferrence (steps = d2v.iter ** 2, alpha = min_alpha = d2v.alpha) and the last result from interactive shell.
Evgeny Denisov
@eIGato
F@#\$%! All that time i did infer documents from a generator. One-time generator. They did not re-infer at all. Bleen.
Evgeny Denisov
@eIGato
This is similarity between first inferrence and alpha = min_alpha = 0.19, steps = 4500.
Evgeny Denisov
@eIGato
Replaced document vectors with copy of inferred ones. And re-inferred. Similarity is about 1.0 like before. Because infer_vector() doesn't use old docvecs at all.
But i still don't know if it makes sense.
phalexo
@phalexo
I don't either. Makes no sense to me. Considering that inferred vectors should be different based on parameters, it would seem odd.
Evgeny Denisov
@eIGato
d2v.sample = 1.0 / word_count
Is it reasonable?
Evgeny Denisov
@eIGato
word_count is the count of different words (provisional length of d2v.wv.index2word).
Evgeny Denisov
@eIGato
How to pick a reasonable sample value?
Evgeny Denisov
@eIGato
What if 90% of words are different from each other? Is it possible to train *2Vec model with such a corpus?
phalexo
@phalexo
Clearly that is not a natural language application.
Evgeny Denisov
@eIGato
@phalexo purpose is to predict phrases, not words. So i use phrases as d2v words. And full texts as d2v docs.
Saurabh Vyas
@saurabhvyas
is there a pretrained lda model available for gensim , just for tinkering ?
Matan Shenhav
@ixxie
@saurabhvyas can LDA even be used in a supervised mode?
anyway, its been pretty easy for us to train+predict on a given data set
Dennis.Chen
@DennisChen0307
Hi there. Is there any road maps for new release of gensim?
AMaini503
@AMaini503
Should I expect Doc2Vec to use all the cores if I pass workers = #cpus ?
matanster
@matanster
Apologies for adding a 4th question in a row here...
Does gensim have anything builtin for transforming a document to a bag of n-grams representation, or does it in fact only do bag of words? (words being 1-grams...)
@piskvorky
@matanster Gensim doesn't actually do the transformation; it already expects a (feature_id, weight) bag-of-whatever pairs on input. How you split the documents into words/ngrams/something else is up to you.
@DennisChen0307 our aim is one release per month, but last months have been busy at RARE, not much time for open source. We plan a release for the end of this month.
matanster
@matanster
@piskvorky oh, sorry then, I just thought maybe corpora.Dictionary.doc2bow might have some usage form for that... I could swear I saw it computing the bag-of-words in my code, but I should probably start reading the source to answer my own questions
Jesse Talavera-Greenberg
@JesseTG
When training a word2vec model, I need to give it a list of documents. How does word2vec treat unknown words? By giving an unknown word a vector close to a known word?
phalexo
@phalexo
It does the initial pass, compiling a corpus dictionary. If it does not make into the dictionary, I believe, it is totally ignored thereafter.
Jesse Talavera-Greenberg
@JesseTG
@phalexo So wait, how can I use word2vec to analyze Tweets if slang and trending topics are always changing? Or am I missing something?
phalexo
@phalexo
You would have to continue training. Maybe there is a way to update the dictionary.
In any case, with Twitter you have a huge problem because people abbreviate everything, make up their own words, etc...
Jesse Talavera-Greenberg
@JesseTG
@phalexo What would you suggest?
phalexo
@phalexo
This sounds like the problem posed on experfy.com site.
Jesse Talavera-Greenberg
@JesseTG
What do you mean?
phalexo
@phalexo
Well, you have to experiment. Maybe there is a large repository of tweets. I'd train on that.
Jesse Talavera-Greenberg
@JesseTG
I have lots of data already. Like, terabytes. My problem is deciding what to do with it.
phalexo
@phalexo
There was a project posted in which the wanted to track groups of people by language they use (slang)
Jesse Talavera-Greenberg
@JesseTG
Specifically, I'm trying to detect whether or not a given user is a sock puppet or bot (because Russia screwing with our elections pisses me off)
I have a list of known bots and I'm currently combing through my data to get some (but not all!) of the tweets made by these bots
phalexo
@phalexo
Well, train the general corpus first.
including the bots and known bad actors.
Jesse Talavera-Greenberg
@JesseTG
Here's a catch; I also want to consider URLs and usernames that these bots commonly post. Should I consider those to be words?
Also, not all bots will have the same amount of tweets available. For some bots I might have hundreds, for others I might have tens. I don't know yet, the job is still running
phalexo
@phalexo