IDS
and PAD_AMOUNT
are global constants that are constructed by loading a txt file of Crime and Punishment. You can instead load multiple txt files, apply the tokenizer, and then generate corresponding token ids
for each one
Hi, thank you for making Reformer and Trax available.
I have a question regarding the TPU Crime and Punishment example. The language model obviously learns made-up words - scandlchedness , raggong, innatummed , quisten... Some great words there, but...
Is this an artifact of the hashing, or what do you think causes it?
def my_inputs(n_devices):
while True:
file = random.choice(os.listdir('files'))
with GFile('/files/' + file) as f:
text = f.read()
IDS = TOKENIZER.EncodeAsIds(text)
@nkitaev I'm using this to feed the multiple text files. Do you think I can tweak any of the hyparameters in the parse_config to run the model longer than half an hour without running into memory issues?
MultifactorSchedule
control the learning rate schedule, which only affects how long training takes and not how much memory is used. You can try running with a little more warmup steps, and more steps_per_cycle
in the cyclic cosine schedule.
my_inputs
will let you feed in your own data, and you can tune the model hyperparameters as well