- Join over
**1.5M+ people** - Join over
**100K+ communities** - Free
**without limits** - Create
**your own community**

@lukaszkaiser , Colab is running 1.3.4? but github tags goto 1.3.3?

2 replies

I was asking because I can reference tl.LSHSelfAttention(), but not hash_vecs(), which appears to be in the same file in master. I see, however, using this 'new to me' feature of colab - you can call up the source. Nice. apparently this was refactored and incorporated into hash_values(). Still, it would be nice to be able to search in github (or clone to local) [ also, ReverseHalfResidual seem to be in 'reverse.py' on github, but in reformer.py in 1.3.4?]

@lukaszkaiser I am currently running some experiments with the Reformer and LSH attention. I have tried running a single training step to see if things work, and with 1 Hash that training step took about 30 seconds. However, after increasing the number of hashes to 2, the model has been running for about 15 minutes now and has still not completed. Is this normal? Or is it more likely I have made some error in my set-up that lead to this explosion in time complexity? My parameters are LSH chunk Len 64, max_len=2048, d_model=1024, d_ff=4096, 3 layers, ReLu activation, Decoder only model. I am training on a Google Colab TPU.

There does seem to be a similar phenomenon when increasing the hashes to 2 on this colab: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb

There does seem to be a similar phenomenon when increasing the hashes to 2 on this colab: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb

5 replies

Hi a trax newbie here, in

Is it the same as

`trax/trax/models/transformer.py`

the `Transformer`

function, I see`tl.Branch([], tl.PaddingMask())`

at https://github.com/google/trax/blob/master/trax/models/transformer.py#L374.Is it the same as

`tl.Branch(None, tl.PaddingMask())`

? Thank you!
Hello, i am trying to understand the difference between trax and tensorflow. I would be great to have a small bit in the readme about the differences. "Trax is an end-to-end library for deep learning that focuses on clear code and speed." Is this implying tensorflow is unclear and slow? I dont understand the differences, im sure lots of people here could explain why they're using trax :)

1 reply

@weiddeng :

`Branch([], x)`

and `Branch(None, x)`

are the same, and the same goes for `Parallel`

. It's all handled in this line in code: https://github.com/google/trax/blob/master/trax/layers/combinators.py#L235
(Both

`[]`

and `None`

get transformed into an identity layer, taking 1 input and returning the same, which can be constructed as `Serial()`

or `Serial(None)`

.)
I'm trying to use the BERT model in trax library. The Bert model checkpoint with their weights are in a folder ("bert_config.json", "*.ckpt.*", "vocab.txt"). To get the weights, in trax there's only the "init_from_file" method but needs a gziped pickled file. Is there a way to init the weights using the tensorflow "*.ckpt.*" file or convert the model to a pickle file to use it in trax?

Thank you very much Lukasz. I have a question on datasets. I see c4 and squad data here https://github.com/google/trax/tree/ebb9aa01b70c02498b29f3f0a31d361f31caa395/trax/data/testdata I also see in the pre-trained transformer it use

`vocab_dir='gs://trax-ml/vocabs/'`

Do you know where I can access the commonly used nlp datasets from trax?
@kujaomega : yes, in

`bert.py`

there is the functionality to load from a TF checkpoint, see from this line on: https://github.com/google/trax/blob/master/trax/models/research/bert.py#L146
So you can load from a TF checkpoint and then use the model weights and state to save it in Trax format (with model.pkl.gz).

@weiddeng : Trax does not provide datasets on its own, but it has hooks into TFDS, so you can use those with

`trax.data.TFDS`

. An example of how to create a whole input pipeline for IMDB is here: https://github.com/google/trax/blob/master/trax/data/inputs.py#L23
Or maybe nicer to read here: https://trax-ml.readthedocs.io/en/latest/trax.data.html

We explain how to use it in the intro: https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html#Data

Note two things: (1) it'll work with any TFDS dataset (and TFDS will download it for you), there are quite a few of them!

Or make your own from your data in a file, as explained in the second link above.

Thanks @lukaszkaiser for the response. As I see, the TF checkpoint is loaded when the new_weights is called, but when I initalize the model

`model = BERT(init_checkpoint=BERT_MODEL_PATH + 'bert_model.ckpt')`

and the `weights = super().new_weights(input_signature)`

is called, I'm getting `AttributeError: 'super' object has no attribute 'new_weights'`

as the tl.Serial object isn't calling new_weights
@gladwig2 : LSH hashing doesn't work with buckets if Q != K - check out the Reformer paper for a more detailed description why (LSH generally works, but if you want to make it fast on TPUs/GPUs then bucketing is really good and then you want Q=K). So the question was: does Q=K hurt? The answer seems to be: not at all, same performance as general Transformer.

In the Reformer paper you mention that instead of computing softmax(QK) one can also compute individual queries --> softmax(qiK). Does the Self-Attention layer in trax use this by default? I haven't quite been able to figure it out from the code.

And the Q=K is also not used unless I set shared_qk = True in gin, correct?

And the Q=K is also not used unless I set shared_qk = True in gin, correct?

I have some questions on the theoretical concepts of machine translation/transformers and their implementation in Trax. I hope these questions are not completely stupid and somebody can help me…

When I have a transformer (used for machine translation), it is a sequence-to-sequence model, i.e. it consists of an encoder part and a decoder part. When an input (x, y) is feeded into the network, where x corresponds to the encoder part and y corresponds to the decoder part, a vector of probabilities is computed. The highest probability corresponds to the next word, and so (if we search for the correct translation in the simplest way) we can expand y to y’ and work with (x, y’) in the next step. (I hope, my understanding is correct so far - if not, please correct me!)

Now I want the following:

Problem 1: Let’s say I already have a pair (x, y). How can I modify my transformer in such a way that a SINGLE probability is computed, namely the probability P(y | x) that y is a "correct" translation of x (in the sense of the transformer as a relative language model, i.e. the probability that y occurs in the target language provided x is the corresponding input in the source language)? Intuitively, this should be done by taking just the component of the output vector at position 1 (provided this is the position for the <EOS> tag).

How can this be achieved in Trax? Intuitively, I should have to expand the transformer model by just one layer, but I do not know how to do that (which layer to use and how to expand).

Problem 2: Is it somehow possible to obtain just the decoder part of a trained transformer as a separate model?

Problem 3: Now I want to compute "absolute" probabilities P(y). So this should be a combination of problems 1 and 2, i.e. I want to have just the decoder part, but this extended such that it yields single probabilities. If my intuition is right, I should being able to solve this in Trax once I know how to solve problems 1 and 2 (which I do also need to solve on its own purpose).

2 replies

@SebastianThomas1_gitlab : Transformer decoder is an auto-regressive model, so there's a catch to P(y | x) as there are only P(y_{t+1} | y*t, y*{t-1}, ..., y*0, x). You can add them up and call it P(y | x) (as we do in loss), but remember that it's always auto-regressive, y*{t+1} depends on all the previous ys. Nothing specific to Trax or any software here - it's how autoregressive models work.

You can also get the decoder if you pick the appropriate

`model.sublayer`

(as someone already said)
@lukaszkaiser Thank you very, very much for your hints!!

I think, you have to take the product of the probabilities instead of the sums, right? With sums, one can easily achieve something greater than 1, but with products one should have

P(y_{t + 1} | x, y_0, …, y_t) \cdot P(y_t | x, y*0, …, y*{t - 1}) \cdot … \cdot P(y*0 | x)= P(y*{t + 1} | x, y_0, …, y_t) \cdot P(y_0, …, y_t | x)

= P(x, y_0, …, y

= P(x, y_0, …, y

= P(y_0, …, y

where the first equality holds by induction. (I have not yet thought about whether the usual probability laws hold for this notation, but it would make me wonder if this was not the case.)

Of course, the product of probabilities then translates to the sum of log-probabilities, which might be what you had in mind.

However, I still have another problem. You told me to apply the transformer to the pair (x, y) since this would contain all probabilities. I tried my best by imitating the source code of `trax.supervised.decoding.autoregressive_sample`

resp. `trax.supervised.decoding.autoregressive_sample_stream`

, but got some confusing results.

Let’s take the example from the Trax intro (https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html). There `tokenized_translation`

has the value`array([ 168, 24, 9358, 2, 352, 367, 2427, 18, 3580, 207])`

I used the following code to imitate the code of `autoregressive_sample`

:

```
start_symbol = np.full((1, 1), 0, dtype=np.int32)
model.state = initial_state
current_symbols = start_symbol
while current_symbols != np.array([[1]]):
logits = model((tokenized, current_symbols))[0]
sample = trax.layers.logsoftmax_sample(logits[:, -1, :], temperature=0.0)
print(sample)
current_symbols = sample[:, None]
```

Then, as expected, the following is printed to the screen:

```
[168]
[24]
[9358]
[2]
[352]
[367]
[2427]
[18]
[3580]
[207]
[1]
```

However, when I apply the model to the whole `tokenized_translation`

, I obtain a different result: The code

```
model.state = initial_state
current_symbols = np.concatenate([start_symbol, tokenized_translation[None, :]], axis=1)
logits = model((tokenized, current_symbols))[0]
samples = trax.layers.logsoftmax_sample(logits[:, :, :], temperature=0.0)
print(samples)
```

yields

`[[207 24 207 207 33 207 207 207 207 207 1]]`

(instead of the expected `[[ 168 24 9358 2 352 367 2427 18 3580 207 1]]`

).

What do I miss here?

I'm highly interested in contributing to the Trax package as it's really fits my area of interest well.

I have previously contributed on a few python package and framework called signac and signac-flow. If the contributions are required / welcomed then I'd really like to contribute. Hoping for a positive response, Thank you. 😊

Hey guys,

I have a few questions on hyperparameter tuning when developing neural networks with Trax. It would be really great if you shared some of your current approaches or point me in the right direction to get information on that matter.

- What is current approach for hyperparameter tuning? Is there any build-in tools, or could we easily utilize some external tools?
- Do we want to see any hyperparameter tuning tools, helper functions, etc. within Trax or is this out of the scope of the Trax library? If it is a subject for contribution, what we want to be implemented in the first place?
- I saw some mentions of the TensorBoard in the source code of the Trax. How exactly can we integrate and use TensorBoard with Trax?

3 replies

@lukaszkaiser Hi Lukasz! I am a beginner in machine translation. I want to use Trax to complete my first translation task. But I did not find any relevant examples of using Trax for translation on the Internet. Can you please provide an example for me? Thank you very much!

1 reply

@ameyas1 You may try BPE using byte instead of character to deal with asian languages like Korean. For translation, you can try Joined BPE. They are discussed in the BPE paper https://arxiv.org/abs/1508.07909. I think you can find trax library code to do this.

:point_up: September 17, 2020 3:25 PM Hey @SebastianThomas1_gitlab what was your motivation in this question? Btw thanks for the detailed description. My understanding is that for (x, y), training and inference are done differently. In training, one of the learning task is predicting the next or masked y_i, for this, you feed (x, y) and the y_i predictions can be done in parallel. In inference, you just feed x, and generate y. In Problem 1, you can compute p(y|x) as the product of p(y*i|y*<i, x). You can use this probability as a ranking function to select the most likely y to match x if you are given a set of y's. In Problem 2 and 3, yes but what do you wanna do? In Problem 3, if you are interested in p(y), then this probably has nothing to do with translation but just about the target language Y. You can build a language model for language Y, and compute p(y) as a product p(y*i | p*<i). Theoretically, you can estimate p(y) by randomly draw an x, and then use p(y|x) for p(y).

@lukaszkaiser @weiddeng I tried using sentencepiece bpe with English and French dataset from http://www.manythings.org/anki/fra-eng.zip on seq2seq attention model from the coursera couurse but it gave me very poor results. I think tokenization could be the problem. Now, I don't know how to tackle this problem.

Hi, does anyone have a code reference (e.g. notebook) for BERT pretraining? I am interested to learn the practice. So far the examples I know are relatively simple like IMDB review sentiment classification. Thanks!

Hello, Does anyone know if there is any layer like Conv2D if I would like to do image processing. I could not find anything similar on "https://trax-ml.readthedocs.io/en/latest/trax.layers.html".