wordfreqmodule to build a custom corpus from multiple languages and multiple sources ^^
gensimto vectorize and cluster a relatively small set of sentances; the frequency counts inside our data set are misleading for the purposes of TFIDF so we need to use external frequencies. I imagine we are not the only ones. I was just wrapping
LsiModelinto our model and for the first two I could use
.build_vocab_from_freq()while for LSI I had to do this work around.
Sent2VecPR though because that might have potential too