Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community

    Hi all, i am trying to run this code of text summarizationSEP = 0 # Padding or separator token
    EOS = 1 # End of sentence token

    Concatenate tokenized inputs and targets using 0 as separator.

    def preprocess(stream):
    for (article, summary) in stream:
    joint = np.array(list(article) + [EOS, SEP] + list(summary) + [EOS])
    mask = [0] (len(list(article)) + 2) + [1] (len(list(summary)) + 1) # Accounting for EOS and SEP
    yield joint, joint, np.array(mask)

    You can combine a few data preprocessing steps into a pipeline like this.

    input_pipeline = trax.data.Serial(

    # Tokenizes
    # Uses function defined above
    # Filters out examples longer than 2048


    Apply preprocessing to data streams.

    train_stream = input_pipeline(train_stream_fn())

    eval_stream = input_pipeline(eval_stream_fn())

    train_input, train_target, train_mask = next(train_stream)
    assert sum((train_input - train_target)**2) == 0

    but i got this error
    ValueError: too many values to unpack (expected 2)
    can anybody help please
    Mike Azatov

    Hey guys, I'm enjoying learning about Trax and I got stuck on this issue. As I understand there is a key difference between Trax and Keras/Tensorflow/Pytorch in the way they handle RNN layers like GRU and LSTM. In Keras, n_units is the number of individual cells we have in your sequence layer and the dimension of the weight matrix is automatically obtained from the dimension of the input/embedding layer. The number of individual cells (n_units in Keras) needs to be equal to the max_len of all your input sequences.

    Now in Trax, as I understand the number of cells does not need to be fixed and n_units is actually the dimension of the weight matrix. It looks like the number of cells can actually differ from batch to batch. It needs to be constant within a batch to not mess up the gradients which makes sense.

    This difference is quite confusing and I can't find any detailed documentation on it. I have so many questions about this. Why do we need to specify n_units in Trax if it has to be equal to the embedding dimension? What happens with the number of cells? How can it be different from batch to batch? What happens in the many-to-many type problem, where we'll have a different number of outputs for each batch? What happens when we convert the model to Keras which expects a constant number of cells? Etc... To me, it feels like this is a huge fundamental difference between Trax and more established environments yet I can't find any documentation on this.

    If somebody could shed some light on this, I'd really appreciate this. I'd love to read some thoughts from the autors for why they designed it this way as well as for the pros/cons of each appraoch. It's still blowing my mind that you can get away doing it this way! Thanks!

    riccardo ughi
    After losing a few days of testing, I looked at this chat and I decided to ask, hoping for help.
    So, the fact is that I have tried with 4 gpu of the training and I have seen that there is no improvement if the system uses 2,3 or 4 GPU (nvidia rtx A4000), in the sense that the time taken for each epoch is practically the same with 2,3,4 GPU. But doesn't JAX have a mechanism (pmap) to distribute the job card in parallel which should cut down the time it takes for each epoch? Since I can't get it to work properly, is there any trick that is essential to be able to do it? (I hope I managed to be clear). Thanks for the help.
    riccardo ughi
    I also specify that I have noticed with the OS (the command is nvidia-smi) that all 2, 3 or 4 GPUs (depending on the test in progress) work in the sense that their memory is occupied and the % of utilization grows. However, the total time did not improve in the three cases.
    Lukasz Kaiser
    @sunvod : I think the default setting in Trax is to specify batch size per core - so if you have more GPUs, you'll be training with a larger batch size, but the time per step should be the same or even slightly worse. Did you try lowering batch size with more GPUs?
    riccardo ughi
    Yes, I tried, but the time remains independent of the number of GPUs involved. Is there anything I can do ?
    Ahmed Baruwa

    Hi everyone! ,

    I'm new to trax. Anyone ever finetuned the reformer model on pointcloud dataset? Any suggestions on how I can go about this ?

    Seonpyo Kim

    Hello everyone!,
    I'm new not only trax but also ML. Hence, I decided to utilize a complete package for my work.
    In my case, two models are instantiated in both 'eval' and 'train' mode.

    'eval' mode is to generate samples, which are training data. I need fast_model = tl.Accelerate(eval_model) to sample from the model in each training step.
    Also, fast_model.replicate_weights(model_train) is called in my training data generator to synchronize the two separated models.
    However, I got stuck on the error Invalid argument: CopyToDevice called on deleted or donated buffer.

    What is the best way to sample during the training process?

    Amr Shahin
    Hello everyone!

    i'm training a reformer model on a custom dataset, the code is pretty much identical to https://github.com/google/trax/blob/master/trax/examples/NMT_with_Transformers_Reformers_using_Trax.ipynb
    when i run the training, i keep getting "nan" for CrossEntropyLoss after 200 steps
    here's a sample of the log:

    Step 2300: Ran 100 train steps in 21.84 secs
    Step 2300: train CrossEntropyLoss | nan
    Step 2300: eval CrossEntropyLossWithLogSoftmax | nan
    Step 2300: eval WeightedCategoryAccuracy | 0.00000000

    Step 2400: Ran 100 train steps in 21.36 secs
    Step 2400: train CrossEntropyLoss | nan
    Step 2400: eval CrossEntropyLossWithLogSoftmax | nan
    Step 2400: eval WeightedCategoryAccuracy | 0.00000000

    here's my model:

    model = trax.models.Reformer(
        d_model=512, d_ff=2048, dropout=0.1,
        n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
        max_len=2048, mode='train')
    train_task = training.TrainTask(
        # use the train batch stream as labeled data
        # use the cross entropy loss with LogSoftmax
        # use the Adafactor optimizer with learning rate of 0.001
        optimizer=trax.optimizers.Adafactor(learning_rate=0.001, epsilon1=1e-30),
        # have 500 warmup steps
        lr_schedule=trax.lr.multifactor(constant=1.0, warmup_steps=500),
        # have a checkpoint every 100 steps
        # saving a checkpoint every 1000 steps on the output_dir
    Hello everyone, does trax currently able to support multi node- multi GPU training. If not, is there a workaround for this?
    Yan Virin
    Hello! I am wondering what's the difference between Jax and Tensorflow-numpy - are these basically two competing projects doing the same thing, or there are some differences?
    Lukasz Kaiser
    You can run multi-node multi-gpu training in Trax with the newest jaxlib and JAX. For that, you need to set these flags: https://github.com/google/trax/blob/master/trax/trainer_flags.py#L54
    1 reply
    So for example python trainer.py --gpu_cluster_chief_ip= --gpu_cluster_n_hosts=2 --gpu_cluster_host_id=0 on one host, and the same on the other with gpu_cluster_host_id=1.
    (And the chief IP must be the one with id 0 and of course they must be able to find each other on the net.)
    Yan Virin
    I have a question about Dropout layer in Trax. It has a parameter "train", which indicates whether the layer is part of a network that is being trained or not. In the code the layer will behave as identity function is its internal _mode parameter is not set to "train". My question is will Trax manage the internal parameter correctly when training and when evaluating the model that I created? Should I use that parameter at all, or just leave it with the default True value? When I load the model and want to use it for predictions, do I need to construct the model with that parameter in mind? Thanks
    1 reply
    And in general I see that the parameter "train" is used in any layer that inside is using Dropout. So it is a more general question of how do we need to treat "train" parameter when we create the model, and whether Trax internally will manage that parameter of layers when doing evaluation.
    1 reply
    Andreas Antoniou
    Hello people. I have a small issue was wondering if you could enlighten some knowledge.

    Run this tutorial to develop an NMT model: https://colab.research.google.com/github/OmarAlsaqa/trax/blob/master/trax/examples/NMT_with_Transformers_Reformers_using_Trax.ipynb.

    I would like to export the model as a tf object so as to deploy it to tensorflow serving. I followed this tutorial:

    However, I got this error: StagingError: Exception encountered when calling layer "as_keras_3" (type AsKeras).

    Clearly something I am missing here
    Omar Alsaqa
    @andreas_antoniou:matrix.org I have not tried to export the model as tf since i faced multiple issues before. I posted an issue at trax-ml github, i got an answer but i didn't have a time to try it. google/trax#1504
    please check if it will help you, let me know if it would help or not.
    Andreas Antoniou
    Hello @OmarAlsaqa , thank you for the insight and also thank you for this wonderful tutorial :)
    Will check it out and revert back
    Andreas Antoniou
    @OmarAlsaqa: I have tried to output the model as a tf model to serve it, although I found many obstacles. Even after I made it work with some ducktaping, the model's performance is pretty bad.
    I also changed the mode to predict, in order to adapt the infrastructure, however further issues were faced. Someone has to shed some light in this
    it takes at least 2 minutes for one simple prediction
    Omar Alsaqa
    @andreas_antoniou:matrix.org Hopefully someone would help us both. @lukaszkaiser
    Rafid K. Al-Humaimidi

    What is the benefit of providing a signature when calling the init() method of a model? Depending on the definition of the model, the signature is usually part of it. For example, when defining an Embedding layer, then surely the input is of shape (batch_size), where the values range from 0 to embed_size. In fact, in the code of Embedding, they simply delete the input signature:


    So, what is the point behind it?

    1 reply
    Similarly, if I define a TransformerEncoder and provide all the necessary information like vocab_size, n_classes, etc., why do I need to further provide input signature when I call init()?
    Rafid K. Al-Humaimidi
    I am new to Trax so forgive the basic question. I noticed that when I use the Loop class to train my model, it takes one batch of training data per step. For example, if I have 1000 batches in my training data, I would have to call run(1000) to finish a training epoch. Usually, one provides an iterator and doesn't know how many batches there are in the training data, so how do I go about telling Loop to go through the training data X epochs?
    1 reply
    Rafid K. Al-Humaimidi
    Does Trax simply reserve all GPU memory available? I defined an embedding layer of size 1000x64 and by the time I call init, almost all GPU memory is completely used. Why is that?
    2 replies
    how could i create personal subword represtentations like ende_32k.subword used in the Trax Quick Intro
    # Create a Transformer model.
    # Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
    model = trax.models.Transformer(
        d_model=512, d_ff=2048,
        n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
        max_len=2048, mode='predict')
    # Initialize using pre-trained weights.
    # Tokenize a sentence.
    sentence = 'It is nice to learn new things today!'
    tokenized = list(trax.data.tokenize(iter([sentence]),  # Operates on streams.
    # Decode from the Transformer.
    tokenized = tokenized[None, :]  # Add batch dimension.
    tokenized_translation = trax.supervised.decoding.autoregressive_sample(
        model, tokenized, temperature=0.0)  # Higher temperature: more diverse results.
    # De-tokenize,
    tokenized_translation = tokenized_translation[0][:-1]  # Remove batch and EOS.
    translation = trax.data.detokenize(tokenized_translation,
    Fernando Costa

    [Attention Visualization]

    Hi guys,

    How can I visualize attention weights/scores on my input sequences? I want to see where each attention head is paying attention on my input sequence at each input processed. I'm using a personalized Transformer decoder on my model, containing only the Causal Attention and this layer has a name that I can use, if needed.

    In Keras we can do this: https://stackoverflow.com/questions/53867351/how-to-visualize-attention-weights. Is there some related way to do this in Trax?

    2 replies
    Jake Searcy

    Hi All,

    Has anyone else had trouble with initializing a model from a file? When I start with a file the model isn't callable without a JAX error, and I see the weights have a different type than they would from running init.

    >> <class 'jaxlib.xla_extension.DeviceArray'>
    >> <class 'numpy.ndarray'>
    2 replies
    Can i use this library for any kind of language apart from english. For example the semtic languages?
    Ken Otwell
    Does trax run the Branch or Parallel layer splits actually in parallel in the GPU? My timing tests suggest not.
    1 reply
    Hi all! New to trax, I was trying to figure out where the pretrained models live. Is there a way to get an overview of all the pretrained models available? The tutorial has end_wmt32k.pkl.gz, but what about other languages? Or eg a ResNet50? Would it be possible to use pretrained models from tf_hub?
    Ken Otwell
    Anyone know how to get trax to actually use the GPU? If I call jax directly, it works fine - but when I train a trax model, there's a bit of activity at startup then nothing. Here's a tensorflow capture:
    Ryan Greenblatt
    Does anyone know how it would be possible to implement stop grad such that it only stops the gradient wrt to certain inputs?
    Francisco Javier Estrella Rodriguez
    Hello, is there any Trax community on discord?
    Jess Edmund Fan
    Does anyone have a minimal example of converting terraformer to keras? I have this, but it won't get past creating the hidden layer https://colab.research.google.com/drive/1Yiss6NKEimwcU9QAD4Vk7vsw59ZyTKz-?usp=sharing
    Peter Dippold
    Hello, I recently installed jax (0.3.7), jaxlib (0.3.7+cuda11.cudnn82) and trax (1.4.1) on a linux docker container (Cuda compilation tools, release 11.5, V11.5.119, Build cuda_11.5.r11.5/compiler.30672275_0, cudnn 8.3.1) and get no complaints after importing trax packages into a Jupyter notebook (like "GPU/TPU not found"), so everything looks fine. But when I try to train a model I can see that about 90% of my GPU's memory is allocated but the calculations are done by the CPU only (100% load, GPU 0% load). Is this due to the different cuda versions (8.2 vs. 8.3.1) or could there be any other reasons?
    Himanshu Chaturvedi
    Hi, I couldn't understand how to use the Parallel layer in the data module? can somebody elaborate on that?
    Alaa Shaker
    I have problem , I can't import trax on colab , just yesterday all thing was right, I search on stackoverflow or github to find solution but there is no one.
    please can any one help me?
    Alaa Shaker
    even on the official trax colab page I can't import it