@dimeldo - The Reformer architecture, which is implemented in Trax, has a few experiments that you can check out - https://arxiv.org/pdf/2001.04451.pdf
The novelty is being able to train over longer sequences, so the authors pushed that aspect, I'm sure making it deeper (the reformer is already more memory efficient than other Transformers), doing MLM pre-training (coming, hopefully soon) will push/reach the SOTA in those tasks.
@friesel - Let me ask the person who did this to maybe make a Colab and share.
Re: problem.py
Trax can (and does) consume T2T problems, so in a trax gin config, just do inputs.dataset_name = 't2t_<whatever your problem name is >'
and trax should get that as the input.
@dimeldo - The Reformer architecture, which is implemented in Trax, has a few experiments that you can check out - https://arxiv.org/pdf/2001.04451.pdf
The novelty is being able to train over longer sequences, so the authors pushed that aspect, I'm sure making it deeper (the reformer is already more memory efficient than other Transformers), doing MLM pre-training (coming, hopefully soon) will push/reach the SOTA in those tasks.
@nikitakit the first author can comment more here.
@johngrabner - There aren't any examples right now of image to text (that is what your task is right? we have image to label using Reformer/Transformer), but one way to proceed ahead would be to add your dataset to TFDS (this should be easy, it has very nice documentation) and then pose it as a text generation problem with the image as input.
But Trax is a library of deep learning models that allows you to do these kind of things.
_data_rotation_farthest
. although it was not mentioned in the paper, I was wondering if choosing either of the two made a difference
self._allow_duplicate_attention
and hard_k
_data_rotation_farthest
after submitting the paper, checkout the gin configs under trax/configs some of the reformer configs use _data_rotation_farthest
(enwik8 doesn't since _data_rotation_farthest
works slightly worse than random for some reason, on the other tasks it works better than random. Maybe @lukaszkaiser and @nikitakit can tell you more about the other settings.
_data_rotation_farthest
) seems to help a bit for imagenet64 but it performs a bit worse on enwik8. There are also some open questions about how one would sample from such a model, because inference starts with an empty sequence and no way to initialize the data-dependent hash. I would say this option is still in the research/experimental phase.
hard_k
) never yielded any demonstrable benefits in our experiments. At the start of the project we had this idea that if one could identify the top-k attention targets for each query, then doing sparse attention within the top-k only would be faster than dense attention. The problem is that modern hardware is designed for dense computation to such a large degree that the sparse version usually ends up being slower (sometimes substantially slower).
Hi, I am trying to get acquainted with Trax through the quickstart notebooks and wanted to use trax/rl. How do I use ppo_trainer with a custom gym env? I looked at trax/rl/ppo_trainer_test.py for reference.
(Issues similar to this: https://colab.research.google.com/drive/1TnYMIt7Zm-iCN-Az3jeO8QoIpQ7YvHiD)
Also I eventually want to build a simple DQN which can use transformer_decoder as the model, how should I go about, does transformer always expect inputs as in trax/rl/supervised/inputs.py ? How do I include states and actions both in the training stream? Any guidance/resources would be very helpful, TIA.
Hi @PranavMahajan25 - thanks for trying it out! We'd be very interested in taking the colab as an example of RL once it works.
I ran the colab and it doesn't error out for me! So looks like you got it working (maybe don't use the same object with a different net since there may have been some caching of weights it looks like?)
PS: I like the idea of starting with the test and modifying it in place to get what you want :)
Re: states and actions both in traning stream @koz4k added some code for doing similar things in the rl/ directory, mostly related to simple.py
. That part of the code is under heavy development, we want to try something similar as well.
PPS: The colab uses trax 1.0.0, maybe upgrade to the latest 1.2.2?
-
IDS
and PAD_AMOUNT
are global constants that are constructed by loading a txt file of Crime and Punishment. You can instead load multiple txt files, apply the tokenizer, and then generate corresponding token ids
for each one
Hi, thank you for making Reformer and Trax available.
I have a question regarding the TPU Crime and Punishment example. The language model obviously learns made-up words - scandlchedness , raggong, innatummed , quisten... Some great words there, but...
Is this an artifact of the hashing, or what do you think causes it?