AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py'>
DocvecsArray
class moved (and where)?
corpus2dense()
.vector.shape
. See the # numpy vector of a word
example at https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors
corpus2dense
: it's true NLP produces sparse matrices using the bag-of-words representation. But some later transformations, such as LSI, transform the sparse vectors into a dense space. So using a dense representation actually saves you memory (less overhead than representing a dense matrix using sparse structures).
corpus2dense
. Even scikit-learn didn't have sparse support for a while ;)
import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
dataset = api.load("fake-news")
dct = Dictionary(dataset) # fit dictionary
corpus = [dct.doc2bow(line) for line in dataset] # convert corpus to BoW format
model = TfidfModel(corpus) # fit model
vector = model[corpus[0]] # apply model to the first corpus document
corpus2dense
, I had it overflow the memory of a large machine on just 250,000 documents .....
def read_dataset(fpath):
with open(fpath, "r") as f:
for line in f:
# <sentence creation logic>
yield sentence
model = FastText(min_count=1, size=50, window=5, workers=8, sg=1, word_ngrams=1, min_n=3, max_n=6, iter=5, negative=0)
model.build_vocab(read_dataset(args.fpath))
model.train(read_dataset(args.fpath), total_examples=model.corpus_count, epochs=model.iter)
model.save("custom_model")