Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    hairzooc
    @hairzooc
    If there's some document or something, it'll be helpful
    I've just run this command "python -m trax.trainer --config_file=$PWD/trax/configs/reformer_enwik8.gin"
    Lukasz Kaiser
    @lukaszkaiser
    @hairzook: can you try with tf 1.15? It may be that some incompatibility with tf2 slipped through...
    hairzooc
    @hairzooc
    @lukaszkaiser I got it. I will try it and let you know.
    It's working well for now. Thank you for your swift help :)
    Brandon Koerner
    @ReewassSquared
    @dimeldo autoregressive decoder only LM is satisfactory for deep conversational models from my experience. I have yet to test the Reformer, getting dependencies fixed. Been a while since I've updated things.
    skwonlee
    @skwonlee
    There is no 'predict' mode in Reformer model for translation task. Any plan to be implemented?
    Lukasz Kaiser
    @lukaszkaiser
    @skwonlee : yes! We hope to have some code for it next week, let's see how far we manage to get...
    dimeldo
    @dimeldo
    @lukaszkaiser @ReewassSquared Thank you!
    gofimofi
    @gofimofi
    @lukaszkaiser Thank you for making this available. I have a dataset and I want to train Reformer using my own dataset to create generic lm and then fine tune that model using another dataset in order to embed any sentences that are not in my dataset. Is there any sample that I can follow in order to accomplish this?
    Madison May
    @madisonmay
    Looks like DeepMind has gifted you all a dataset for your Reformer experiments and issued a challenge!
    gofimofi
    @gofimofi
    At least, i would like to learn how to train reformer on wiki data so that it is able to provide embeddings of each tokens in the given input text. Can anyone suggest me a sample script? Thank you
    Aran Komatsuzaki
    @arankomatsuzaki_twitter
    @madisonmay PG19 is cool! On a slightly related note, I'm recently thinking of creating a dataset whose sample is a concatenation of publically available info (news etc) of a company during a given time interval and the near-future stock price. This doesn't only require long-range attn span but also may generate a profit for training a larger model :)
    afuller187
    @afuller187
    Cool group! What's the latest on using much larger vocabs with the reformer model?
    afuller187
    @afuller187
    Also has anyone tried increasing the batch size of the reformer? What was the outcome?
    Phillip Bock
    @friesel

    @friesel - Let me ask the person who did this to maybe make a Colab and share.

    @afrozenator
    Did you happen to get to this? (The Trax-implementation of the "A friend played with it on the TFDS scientific papers dataset and it does generate reasonable summaries (even if it was a little repetitive at first try)." Would be awesome. Still struggling heavily to get back to what we did in t2t. Thx

    Suraj Patil
    @patil-suraj
    @friesel I'm also playing with this problem. Haven't got any significant results yet but the model has learned to generate reasonable text
    Dyn4mi
    @Dyn4mi
    Hello. It is my first time trying to contribute to a project. where do I start?
    I know how to build the program, how to clone the repo, but I can't understand the issues.
    WTPlumb
    @WTPlumb
    Hi again. I've trained 3 separate reformers where I am to see the differences between the influential or most connected words from each model. Is there a way of extracting a list of the strongest connected words from a model or something similar? Thanks again!
    Tenoke
    @Tenoke
    some samples from the Reformer I'm training on wikipedia (still training, and I need to change the sampling) in case anyone here is curious https://pastebin.com/tMrRcDQv
    dimeldo
    @dimeldo
    Nice one! @Tenoke How many data samples did you feed it? The size of the sequences? And the size of the model?
    Also, what kind of sampling did you use?
    Tenoke
    @Tenoke

    a bit over >100k steps, at 8 samples per steps, 1 sample is 1 full wikipedia article + random padding. I'm mostly using the basic sampling from the notebooks with gin.parse_config(""" TimeBinCausalAttention.bin_length = 128 TimeBinCausalAttention.n_bins = None LSHCausalAttention.n_hashes = 8 LSHCausalAttention.bucket_capacity_for_inference = 256 """)
    I'm going to do top_p sampling next and see what I can change in the config on inference to see if I get better results (e.g. more hashes, bigger bins, not sure what else yet). Advice welcome.

    I'll write it up as a tutorial after, include the helper functions I've added, etc. so people can follow it.

    dimeldo
    @dimeldo
    Cool! Thanks for the info. Also, what dataset are you using? What's the dataset size?
    Lukasz Kaiser
    @lukaszkaiser
    @madisonmay and @arankomatsuzaki_twitter : I really like the PG19 dataset very much! It'd be great to have it hooked up in TFDS or T2T for easy use (I'll try to do it, but I have a 2-month old daughter so am strapped for time probably a month or two more). I strongly believe that pre-training on long text will show cool things.
    @skwonlee : Nikita managed to get a full bi-directional Reformer going and got good translation results (just in time for the final ICLR paper). Decoding works but is still hacky, needs a little cleanup before checking in. The BLEU is 29.1 for a big Reformer, so no degradation :)
    Aminzai Wardak
    @Aminzai1
    Hello! who implemented 'Attention Is All You Need' paper using Trax? if yes , then can you share code with me. thank you.
    dimeldo
    @dimeldo
    @lukaszkaiser Congratulations! Good luck (:
    Lukasz Kaiser
    @lukaszkaiser
    @Aminzai1 : yes, the original Transformer code is in Trax here: https://github.com/google/trax/blob/master/trax/models/transformer.py
    Tenoke
    @Tenoke
    @dimeldo I downloaded the most recent full dump of wikipedia, and turned it into text files using https://radimrehurek.com/gensim/scripts/segment_wiki.html (I had to fix an issue with the script but they've accepted my PR upstream now). ~19gb of text containing all english wikipedia articles. The model works out to ~99m params, so bigger dataset likely won't benefit me quite that much at that size , and I couldn't fit a model as big as the ones from the paper on a TPU. After trying different hyperparametrs for a while I basically ended up using https://github.com/google/trax/blob/master/trax/configs/reformer_enwik8.gin but with half the d_ff. I'm not sure if I had abandoned the Trainer altogether if I wouldn't have been able to train using the full 4096 d_ff though.
    ijanus
    @Akaa_gitlab
    Hello All
    James Proud
    @jamesproud
    Hi, are there any examples with Trax or Jax of defining custom connections for dense layers? i.e. forcing certain weights and activations to always be zero.
    Michal Auersperger
    @michal_au_twitter
    Hi! Looking at the code for Reformer it seems to me only regular self-attention is used (tl.SelfAttention, tl.EncDecAttention). I was expecting to see tl.LSHSelfAttention so that locality-sensitive hashing is used as mentioned in the paper. Am I missing something? Thanks a lot!
    Tenoke
    @Tenoke
    you are looking at the default arguments to the Reformer function, but it's mostly not used with the defaults. You typically pass a list of LSH and/or Timebin casual attentions using attention_type. They mostly do it using gin.config, you can find some of the configs they used here https://github.com/google/trax/blob/master/trax/configs/
    Michal Auersperger
    @michal_au_twitter
    Thanks for the reply! I can see this is the case (gin configs) for language models (e.g. ReformerLM), but the full-blown reformer seems to have the attention type hard-coded within. I am wondering why this is the case...
    Tenoke
    @Tenoke
    it says it right in the docstring - "At the moment, this model supports dot-product attention only. For the attention types in the Reformer paper, see ReformerLM."
    nkitaev
    @nkitaev
    @michal_au_twitter To add on to that: ReformerLM is the model that we study in the paper. The encoder-decoder one is designed for machine translation, where sequences are generally shorter than 128 tokens so there isn't any benefit from using LSH
    skwonlee
    @skwonlee
    @lukaszkaiser : Thanks for your good information. But I failed to infer machine translation applying gin. @nkitaev : May I have an inference example of Reformer MT?
    Michal Auersperger
    @michal_au_twitter
    @Tenoke: I actually managed to skip reading the docstring, thanks :D! @nkitaev: Thanks for the clarification. Regarding MT, people are trying to translate larger pieces of text than individual sentences with the hope of capturing some extra-sentential dependencies, so I thought LSH attention might be useful for that...
    Danil
    @blizda
    Hi guys. Is it possible to implement somehow custom training loop? I trying to train ReformerLM like BERT, but i not sure how implement loss calculation only on particular(mask) tokens.
    Lukasz Kaiser
    @lukaszkaiser
    @blizda : it should be easy to do a custom training loop: model.pure_fn is a purely functional verion of the model, trax.math.grad can give you gradients, and you can go wild from there. On the other hand, Trax Trainer has the (widely used) option of "has_weights=True" in which case it assumes the training inputs are (input, target, loss_weights) and you can set loss weights as for masked models like BERT. Even the intro example (https://colab.research.google.com/github/google/trax/blob/master/trax/intro.ipynb) with copying sequences has loss weights so the LM does not get loss for the prefix: you can set these weights for BERT too.
    Tenoke
    @Tenoke
    I want to use top_k/top_p sampling, however AssertionError: Fast inference with hard_k is not implemented. What should I look at to actually use top_k sampling? I can't seem to find any examples or configs where it was used.
    nkitaev
    @nkitaev
    I'm still working on decoding and the relevant code hasn't been added to the github yet. You can look at the demo reformer colabs and modify the sampling function to do top_k/top_p pretty easily, though
    hard_k has nothing to do with decoding; in fact that flag only exists as a remnant of a research direction that yielded no results
    Michael Albergo
    @malbergo
    Hey everyone! Quick question for y'all: could someone clarify the difference between setting "ReformerLM(mode='predict')" and "ReformerLM(mode='evaluate') ? I am trying to benchmark a very simple task with a decoder-only transformer (e.g. just to train on a binary sequence and sample binary sequences of fixed length).
    nkitaev
    @nkitaev
    mode='eval' is for evaluating perplexity or accuracy on entire sequences at a time. mode='predict' is for autoregressive decoding, where you need the model output for one token before you can move on to the next token