Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    AlDante
    @AlDante
    @mahesh21aug_twitter Join the club :-)
    Jérémie C. Wenger
    @jchwenger
    I'd also like to add that the combination of LSH and reversible layers was mind-blowing, and bearing huge promise! Especially given the amount of work I've seen people doing in order to fit recent language models in memory.
    NGamma
    @NGamma
    def my_inputs(n_devices):
      while True:
        file = random.choice(os.listdir('files'))
        with GFile('/files/' + file) as f:
          text = f.read()
        IDS = TOKENIZER.EncodeAsIds(text)
    @nkitaev I'm using this to feed the multiple text files. Do you think I can tweak any of the hyparameters in the parse_config to run the model longer than half an hour without running into memory issues?
    Suraj Patil
    @patil-suraj
    @nikitakit @NGamma also how to batchify my_inputs ?
    Suraj Patil
    @patil-suraj
    Sorry, a better question is how to create batches for feeding the TPU, as in the original Reformer example each TPU core is running one example, so instead of that can we run batches of different examples
    nkitaev
    @nkitaev
    @patil-suraj In the example my_inputs returns 8 x 0.5M tensors, but you can change it to be 16x?, 32x?, etc.
    As long as the total batch size is a multiple of the number of devices (8), the code will work. Half a million tokens is close to the limit of what fits in 8GB, so if you increase the batch size you'll almost certainly need to shorten the length
    @NGamma The parameters of MultifactorSchedule control the learning rate schedule, which only affects how long training takes and not how much memory is used. You can try running with a little more warmup steps, and more steps_per_cycle in the cyclic cosine schedule.
    nkitaev
    @nkitaev
    Colabs do time out after a while, and I'm not sure how long of a training time you can reliably get without being preempted
    nkitaev
    @nkitaev
    @jchwenger Thanks! The 0.5M tokens in the demo is definitely at the limit of what fits in 8 GB. Past that you can chunk the data, get more memory (TPUv3 on Google cloud has 16GB per core instead of 8GB), or write custom code to split a single example across multiple TPU cores
    Suraj Patil
    @patil-suraj
    Hello, I'm running the colab Reformer notebook with a new dataset. I changed number of tokens to 281525. This is giving me following error
    TypeError: reshape total size must be unchanged, got new_sizes (1, 281525, 256) for shape (1, 512, 1024, 256).
    Any ideas, TYIA!
    nkitaev
    @nkitaev
    @patil-suraj I think the fix is to change ReformerLM.axial_pos_shape in the config. The product of all numbers in the tuple must be equal to the padded length
    Storing a position embedding vector for each of 0.5M positions is too wasteful, so the approach is to pretend that the text has 2D 512x1024 shape, and concatenate an "x-embedding" and a "y-embedding"
    @dimeldo You can try modifying the text generation demo. Redefining my_inputs will let you feed in your own data, and you can tune the model hyperparameters as well
    Suraj Patil
    @patil-suraj
    @nkitaev Yes, that seems to be the issue, I'm not aware how I should select shape for ReformerLM.axial_pos_shape. What will be the ReformerLM.axial_pos_shape for 281525
    nkitaev
    @nkitaev
    I think you'll need to pad it out to something that factors nicely
    Suraj Patil
    @patil-suraj
    I tried factoring it as shown in the colab, but as my vocab size is 16K, I ran into memory errors
    nkitaev
    @nkitaev
    Oh, vocab size 16K won't work with that length at the moment. We used to have support for this, but it got removed in a refactoring
    The problem is that output logits are size (281K, 16K), which is more than 8GB. They need to be chunked to fit in memory.
    This is on my agenda because I'm dealing with the exact same problem with MT (vocab size 32K)
    Suraj Patil
    @patil-suraj
    Aah, Thank you @nkitaev . Will have to try with smaller vocab then.
    Phil Wang
    @lucidrains
    @nkitaev hi, I'm curious but do you have any encouraging results for MT? I read in OpenReview that it is a work in progress
    Madison May
    @madisonmay
    @nkitaev curious whether there are plans to scale up the text Reformer model and release a pre-trained model a. la. BERT / RoBERTa, etc? Is the primary concern availability of appropriate corpora to train on where long-term context is useful in reducing an MLM loss?
    I'm dealing with a domain where I have a strong indicator that context more than 512 tokens away (long-form scanned documents) is useful and am interested in building off of the work of you and your collaborators.
    Lukasz Kaiser
    @lukaszkaiser
    @patil-suraj : to clarify the TF2 question: some lower-level things are quite hard to do in TF. One of them is really memory-efficient reversible layers, hashing is not trivial too. Trax in general can run with the TF2 backend (and, e.g., Trax Transformer runs using TF), but some things we needed in Reformer were just very hard to do without JAX. That's why we went with JAX and it's been working really well for us!
    @madisonmay : We'd really like to train on a larger data-set and release a model then. I think currently we're thinking of using the C4 dataset, what would you think? It's not very long context though...
    Jérémie C. Wenger
    @jchwenger
    @patil-suraj @nkitaev the error I encountered first had indeed to do with shapes, and if I remember correctly adding multiples of 256 to 1024 before multiplying with 512 worked.... but ran out of memory. The Colab example fits a single TPU memory really well, and it seems way easier to work with this input length in this context.
    Madison May
    @madisonmay
    @lukaszkaiser thanks for the reply, and glad to hear you do have plans to train on a larger dataset! Barring introducing a new dataset, it seems like C4 or one of its subsets is one of the few decent options out there. The T5 paper also benchmarked on Toronto Book Corpus + Wikipedia -- I imagine because of TBC you'd have much longer avg. sequence length at the expense of operating over a much narrower domain.
    Madison May
    @madisonmay
    I'm sure there would be serious interest from industry in something trained on the text of long-form PDFs from Common Crawl and it does feel like there is long-term information to exploit there (section headers + subheaders, page headers, references to defined terms and other clauses in legal docs, etc.), but that comes with the added warts of having to deal with poor OCR, poor detected reading order, and all sorts of other warts to be sorted out in preprocessing that mean it's probably more trouble than it's worth.
    nkitaev
    @nkitaev
    @lucidrains I checked in a config for WMT en-de. The learning curves look the same as the regular Transformer. There's still work left in putting together beam search decoding and making sure the evaluation conditions match prior work (+ some hyperparameter tuning). We're still a few weeks out from having official results.
    Suraj Patil
    @patil-suraj
    @nkitaev as you are training WMT en-de then it seems that Reformer seq2seq is available. Or are you doing it using just the decoder as suggested by @afrozenator ?
    @lukaszkaiser Thank you for the clarification. and indeed Trax with JAX seems very powerful and easier than other DL tool. I'm hoping the community will get bigger and more diverse. Thank you for making it open source
    Afroz Mohiuddin
    @afrozenator

    Just wanted to chime in to say we are really excited to be making use of Trax in the Project Clarify (https://github.com/projectclarify/clarify) codebase and would like to invite anyone with experience with Trax or Jax and interest in mentoring to come give a tutorial session at one of our upcoming hackathon/training days (Jan 25th and Feb 29): https://forms.gle/oFWkN7UuAxS7NUGJ9 Especially you @afrozenator, didn't have your email to send you an invite. Looking forward to adding value to Trax as we get up to speed with using it.

    Thanks @cwbeitel - that looks exciting, will follow up over email.

    @patil-suraj - Nikita wrote the encoder/decoder version. That is what you should probably start with I guess.
    Suraj Patil
    @patil-suraj
    @afrozenator great, I'll take a look at encoder/decoder version. Now I have already trained a summarization model using your approach. But I'm not being able to decode properly. I'm using the sampling code given in the colab and using the <text>[doc]<summary> as prompt. But its not generating summary instead I see the original text. Can you help me on this ?
    Afroz Mohiuddin
    @afrozenator
    did you by any chance train a copy model? can you try nikita's colab as well? just changing the inputs there will probably get you what you want
    Phil Wang
    @lucidrains
    @nikitakit thank you! that was what i was hoping to hear. congrats on a great paper
    Phillip Bock
    @friesel
    @nkitaev When implementing the beam search: We had to hack into t2t in order to get all beams to compute and be returned so we could work with them. Would it be feasable in this implementation to allow users/devs to optionally get all beams returned?
    Suraj Patil
    @patil-suraj
    @afrozenator I am not using the copy model also for sampling I used the same code in given in the Reformer colab where I changed the prompt to <text>[doc]<summary> adding the document text in [doc]. Still not able to get results. Anyone else here has any ideas about this ? The goal is I want to sample summary conditioned on <text>[doc]<summary>. TYIA!
    Lukasz Kaiser
    @lukaszkaiser
    @patil-suraj : did you make sure to make a loss mask, so the loss is only on the summary? The intro colab has an example how to do that for the copy task: https://colab.research.google.com/github/google/trax/blob/master/trax/intro.ipynb
    Suraj Patil
    @patil-suraj
    @lukaszkaiser yes the loss mask is correct. Solved the problem. The problem was that the sampling length was very small so the model was only processing the input prompt and the sampling used to get stopped before generating the summary.
    Rodrigo Baron
    @rodrigobaron
    heya o/
    I was doing some toy tests with trax, a Resnet50 trained on *custom Cifar10 dataset reported some weird results
    Step   1563: train                   accuracy |  0.21875000
    Step   1563: train                       loss |  13.42460442
    -
    Step   3126: train                   accuracy |  0.15625000
    Step   3126: train                       loss |  2.90936065
    -
    Step   4689: train                   accuracy |  0.28125000
    Step   4689: train                       loss |  1.86861885
    -
    Step   6252: train                   accuracy |  0.09375000
    Step   6252: train                       loss |  20935.30468750
    -
    Step   7815: train                   accuracy |  0.46875000
    Step   7815: train                       loss |  1.39475393
    my first thought is the data (always), but not sure if data can make the model report this ~random~ results...
    now testing using gin_config
    Rodrigo Baron
    @rodrigobaron
    yes, got same behavior using gin_config
    I0123 00:01:49.315093 140688036902784 trainer_lib.py:752] Step      1: train                   accuracy |  0.07968750
    Step      1: train                   accuracy |  0.07968750
    I0123 00:01:49.315915 140688036902784 trainer_lib.py:752] Step      1: train                       loss |  4623.56787109
    -
    I0123 00:03:25.528132 140688036902784 trainer_lib.py:752] Step   2000: train                   accuracy |  0.16718750
    Step   2000: train                   accuracy |  0.16718750
    I0123 00:03:25.528812 140688036902784 trainer_lib.py:752] Step   2000: train                       loss |  5.00325918
    -
    I0123 00:04:39.171037 140688036902784 trainer_lib.py:752] Step   4000: train                   accuracy |  0.19687501
    Step   4000: train                   accuracy |  0.19687501
    I0123 00:04:39.171646 140688036902784 trainer_lib.py:752] Step   4000: train                       loss |  3.77166605
    -
    I0123 00:05:53.784074 140688036902784 trainer_lib.py:752] Step   6000: train                   accuracy |  0.13750000
    Step   6000: train                   accuracy |  0.13750000
    I0123 00:05:53.784580 140688036902784 trainer_lib.py:752] Step   6000: train                       loss |  4.01593637
    -
    I0123 00:07:08.749417 140688036902784 trainer_lib.py:752] Step   8000: train                   accuracy |  0.33906251
    Step   8000: train                   accuracy |  0.33906251
    I0123 00:07:08.749950 140688036902784 trainer_lib.py:752] Step   8000: train                       loss |  1.65593171
    -
    ...
    -
    I0123 00:10:54.362399 140688036902784 trainer_lib.py:752] Step  14000: train                   accuracy |  0.13125001
    Step  14000: train                   accuracy |  0.13125001
    I0123 00:10:54.362976 140688036902784 trainer_lib.py:752] Step  14000: train                       loss |  78.67562866