by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    gladwig2
    @gladwig2
    image.png
    image.png
    @lukaszkaiser , Colab is running 1.3.4? but github tags goto 1.3.3?
    2 replies
    gladwig2
    @gladwig2
    I was asking because I can reference tl.LSHSelfAttention(), but not hash_vecs(), which appears to be in the same file in master. I see, however, using this 'new to me' feature of colab - you can call up the source. Nice. apparently this was refactored and incorporated into hash_values(). Still, it would be nice to be able to search in github (or clone to local) [ also, ReverseHalfResidual seem to be in 'reverse.py' on github, but in reformer.py in 1.3.4?]
    TomLeung
    @tomleung1996
    @lukaszkaiser Thank you so much! There must be numerous colab focusing on various tasks based on Trax, but I have difficulties in finding them XD.
    Afroz Mohiuddin
    @afrozenator
    @gladwig2 - my bad, fixed!
    Yannick Wehr
    @YannickWehr
    @lukaszkaiser I am currently running some experiments with the Reformer and LSH attention. I have tried running a single training step to see if things work, and with 1 Hash that training step took about 30 seconds. However, after increasing the number of hashes to 2, the model has been running for about 15 minutes now and has still not completed. Is this normal? Or is it more likely I have made some error in my set-up that lead to this explosion in time complexity? My parameters are LSH chunk Len 64, max_len=2048, d_model=1024, d_ff=4096, 3 layers, ReLu activation, Decoder only model. I am training on a Google Colab TPU.
    There does seem to be a similar phenomenon when increasing the hashes to 2 on this colab: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb
    5 replies
    Vishwaas
    @VishwaasHegde
    Hello, in transformers, are there any conventions followed for d_model and d_ff? I see that d_ff is almost always 4xd_model, my d_model = 2048 and setting d_ff = 4x2048 = 8192 would probably be too heavy for my model.
    3 replies
    weiddeng
    @weiddeng
    Hi a trax newbie here, in trax/trax/models/transformer.py the Transformer function, I see
    tl.Branch([], tl.PaddingMask()) at https://github.com/google/trax/blob/master/trax/models/transformer.py#L374.
    Is it the same as tl.Branch(None, tl.PaddingMask())? Thank you!
    Ben B
    @ben-xD
    Hello, i am trying to understand the difference between trax and tensorflow. I would be great to have a small bit in the readme about the differences. "Trax is an end-to-end library for deep learning that focuses on clear code and speed." Is this implying tensorflow is unclear and slow? I dont understand the differences, im sure lots of people here could explain why they're using trax :)
    1 reply
    Lukasz Kaiser
    @lukaszkaiser
    @ben-xD : Trax uses TensorFlow as a backend. But TF is not a model library - in fact it has no models or datasets at all! So TF is like the lower-level infrastructure (think C) and Trax is like the standard library, with a lot of stuff to use. Does that help?
    Lukasz Kaiser
    @lukaszkaiser
    @weiddeng : Branch([], x) and Branch(None, x) are the same, and the same goes for Parallel. It's all handled in this line in code: https://github.com/google/trax/blob/master/trax/layers/combinators.py#L235
    (Both [] and None get transformed into an identity layer, taking 1 input and returning the same, which can be constructed as Serial() or Serial(None).)
    kujaomega
    @kujaomega
    I'm trying to use the BERT model in trax library. The Bert model checkpoint with their weights are in a folder ("bert_config.json", ".ckpt.", "vocab.txt"). To get the weights, in trax there's only the "init_from_file" method but needs a gziped pickled file. Is there a way to init the weights using the tensorflow ".ckpt." file or convert the model to a pickle file to use it in trax?
    weiddeng
    @weiddeng
    Thank you very much Lukasz. I have a question on datasets. I see c4 and squad data here https://github.com/google/trax/tree/ebb9aa01b70c02498b29f3f0a31d361f31caa395/trax/data/testdata I also see in the pre-trained transformer it use vocab_dir='gs://trax-ml/vocabs/' Do you know where I can access the commonly used nlp datasets from trax?
    Lukasz Kaiser
    @lukaszkaiser
    @kujaomega : yes, in bert.py there is the functionality to load from a TF checkpoint, see from this line on: https://github.com/google/trax/blob/master/trax/models/research/bert.py#L146
    So you can load from a TF checkpoint and then use the model weights and state to save it in Trax format (with model.pkl.gz).
    @weiddeng : Trax does not provide datasets on its own, but it has hooks into TFDS, so you can use those with trax.data.TFDS. An example of how to create a whole input pipeline for IMDB is here: https://github.com/google/trax/blob/master/trax/data/inputs.py#L23
    Note two things: (1) it'll work with any TFDS dataset (and TFDS will download it for you), there are quite a few of them!
    Lukasz Kaiser
    @lukaszkaiser
    (2) It's all pure python generators. So you can pip install and import nlp from Huggingface and use those datasets too.
    Or make your own from your data in a file, as explained in the second link above.
    kujaomega
    @kujaomega
    Thanks @lukaszkaiser for the response. As I see, the TF checkpoint is loaded when the new_weights is called, but when I initalize the model model = BERT(init_checkpoint=BERT_MODEL_PATH + 'bert_model.ckpt') and the weights = super().new_weights(input_signature) is called, I'm getting AttributeError: 'super' object has no attribute 'new_weights' as the tl.Serial object isn't calling new_weights
    gladwig2
    @gladwig2
    Reformer compare Query vs Query rather than Query vs Key. Is there a noteworthy reference describing why this is a good idea?
    Lukasz Kaiser
    @lukaszkaiser
    @kujaomega : indeed, this file wasn't updated! We'll get to it in the next release, but before that, you may try to patch on your own. The old new_weights is now called init_weights and instead of returning sets self.weights.
    @gladwig2 : LSH hashing doesn't work with buckets if Q != K - check out the Reformer paper for a more detailed description why (LSH generally works, but if you want to make it fast on TPUs/GPUs then bucketing is really good and then you want Q=K). So the question was: does Q=K hurt? The answer seems to be: not at all, same performance as general Transformer.
    kujaomega
    @kujaomega
    Thanks for the tip @lukaszkaiser
    Yannick Wehr
    @YannickWehr
    In the Reformer paper you mention that instead of computing softmax(QK) one can also compute individual queries --> softmax(qiK). Does the Self-Attention layer in trax use this by default? I haven't quite been able to figure it out from the code.
    And the Q=K is also not used unless I set shared_qk = True in gin, correct?
    Jindřich Prokop
    @jindrich.prokop_gitlab
    How can I run inference twice after I've initialized a model? e.g. with Tranformer from intro, translating any sentence works fine, but translating second sentence produces just "." or "!". What am I missing?
    4 replies
    Sebastian Thomas
    @SebastianThomas1_gitlab

    I have some questions on the theoretical concepts of machine translation/transformers and their implementation in Trax. I hope these questions are not completely stupid and somebody can help me…

    When I have a transformer (used for machine translation), it is a sequence-to-sequence model, i.e. it consists of an encoder part and a decoder part. When an input (x, y) is feeded into the network, where x corresponds to the encoder part and y corresponds to the decoder part, a vector of probabilities is computed. The highest probability corresponds to the next word, and so (if we search for the correct translation in the simplest way) we can expand y to y’ and work with (x, y’) in the next step. (I hope, my understanding is correct so far - if not, please correct me!)

    Now I want the following:

    Problem 1: Let’s say I already have a pair (x, y). How can I modify my transformer in such a way that a SINGLE probability is computed, namely the probability P(y | x) that y is a "correct" translation of x (in the sense of the transformer as a relative language model, i.e. the probability that y occurs in the target language provided x is the corresponding input in the source language)? Intuitively, this should be done by taking just the component of the output vector at position 1 (provided this is the position for the <EOS> tag).

    How can this be achieved in Trax? Intuitively, I should have to expand the transformer model by just one layer, but I do not know how to do that (which layer to use and how to expand).

    Problem 2: Is it somehow possible to obtain just the decoder part of a trained transformer as a separate model?

    Problem 3: Now I want to compute "absolute" probabilities P(y). So this should be a combination of problems 1 and 2, i.e. I want to have just the decoder part, but this extended such that it yields single probabilities. If my intuition is right, I should being able to solve this in Trax once I know how to solve problems 1 and 2 (which I do also need to solve on its own purpose).

    2 replies
    Lukasz Kaiser
    @lukaszkaiser
    @YannickWehr : we do not compute qiK separately now; we had code for this, it worked, but it's so slow it's useless for anything but baselines.
    @SebastianThomas1_gitlab : Transformer decoder is an auto-regressive model, so there's a catch to P(y | x) as there are only P(y_{t+1} | yt, y{t-1}, ..., y0, x). You can add them up and call it P(y | x) (as we do in loss), but remember that it's always auto-regressive, y{t+1} depends on all the previous ys. Nothing specific to Trax or any software here - it's how autoregressive models work.
    Lukasz Kaiser
    @lukaszkaiser
    You can obtain P(y | x) in Trax by running preds = transformer((x, y)) and then taking the sum or mean of preds which contain all the P(y_{t+1} | ...)
    You can also get the decoder if you pick the appropriate model.sublayer (as someone already said)
    Sebastian Thomas
    @SebastianThomas1_gitlab

    @lukaszkaiser Thank you very, very much for your hints!!

    I think, you have to take the product of the probabilities instead of the sums, right? With sums, one can easily achieve something greater than 1, but with products one should have
    P(y_{t + 1} | x, y_0, …, y_t) \cdot P(y_t | x, y0, …, y{t - 1}) \cdot … \cdot P(y0 | x)
    = P(y
    {t + 1} | x, y_0, …, y_t) \cdot P(y_0, …, y_t | x)
    = P(x, y_0, …, yt, y{t + 1}) / P(x, y_0, …, y_t) \cdot P(x, y_0, …, y_t) / P(x)
    = P(x, y_0, …, yt, y{t + 1}) / P(x)
    = P(y_0, …, yt, y{t + 1} | x),
    where the first equality holds by induction. (I have not yet thought about whether the usual probability laws hold for this notation, but it would make me wonder if this was not the case.)
    Of course, the product of probabilities then translates to the sum of log-probabilities, which might be what you had in mind.

    However, I still have another problem. You told me to apply the transformer to the pair (x, y) since this would contain all probabilities. I tried my best by imitating the source code of trax.supervised.decoding.autoregressive_sample resp. trax.supervised.decoding.autoregressive_sample_stream, but got some confusing results.

    Let’s take the example from the Trax intro (https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html). There tokenized_translation has the value
    array([ 168, 24, 9358, 2, 352, 367, 2427, 18, 3580, 207])
    I used the following code to imitate the code of autoregressive_sample:

    start_symbol = np.full((1, 1), 0, dtype=np.int32)
    
    model.state = initial_state
    current_symbols = start_symbol
    
    while current_symbols != np.array([[1]]):
        logits = model((tokenized, current_symbols))[0]
        sample = trax.layers.logsoftmax_sample(logits[:, -1, :], temperature=0.0)
        print(sample)
        current_symbols = sample[:, None]

    Then, as expected, the following is printed to the screen:

    [168]
    [24]
    [9358]
    [2]
    [352]
    [367]
    [2427]
    [18]
    [3580]
    [207]
    [1]

    However, when I apply the model to the whole tokenized_translation, I obtain a different result: The code

    model.state = initial_state
    current_symbols = np.concatenate([start_symbol, tokenized_translation[None, :]], axis=1)
    logits = model((tokenized, current_symbols))[0]
    samples = trax.layers.logsoftmax_sample(logits[:, :, :], temperature=0.0)
    print(samples)

    yields

    [[207  24 207 207  33 207 207 207 207 207   1]]

    (instead of the expected [[ 168 24 9358 2 352 367 2427 18 3580 207 1]]).

    What do I miss here?

    Hardik Ojha
    @kidrahahjo
    Hi team, my name is Hardik Ojha and I'm a student currently pursuing a bachelor's degree in the Indian Institute of Technology Roorkee, India.
    I'm highly interested in contributing to the Trax package as it's really fits my area of interest well.
    I have previously contributed on a few python package and framework called signac and signac-flow. If the contributions are required / welcomed then I'd really like to contribute. Hoping for a positive response, Thank you. 😊
    Andrei Nesterov
    @manifest

    Hey guys,

    I have a few questions on hyperparameter tuning when developing neural networks with Trax. It would be really great if you shared some of your current approaches or point me in the right direction to get information on that matter.

    • What is current approach for hyperparameter tuning? Is there any build-in tools, or could we easily utilize some external tools?
    • Do we want to see any hyperparameter tuning tools, helper functions, etc. within Trax or is this out of the scope of the Trax library? If it is a subject for contribution, what we want to be implemented in the first place?
    • I saw some mentions of the TensorBoard in the source code of the Trax. How exactly can we integrate and use TensorBoard with Trax?
    3 replies
    dabingooo
    @dabingooo
    @lukaszkaiser Hi Lukasz! I am a beginner in machine translation. I want to use Trax to complete my first translation task. But I did not find any relevant examples of using Trax for translation on the Internet. Can you please provide an example for me? Thank you very much!
    1 reply
    Saurav Maheshkar
    @SauravMaheshkar
    Hey, is there a ConvTranspose Layer in Trax?
    ameyas1
    @ameyas1
    @lukaszkaiser I am trying nmt for asian language like korean to English. How can I create subwords vocabulary for it
    weiddeng
    @weiddeng
    @ameyas1 You may try BPE using byte instead of character to deal with asian languages like Korean. For translation, you can try Joined BPE. They are discussed in the BPE paper https://arxiv.org/abs/1508.07909. I think you can find trax library code to do this.
    weiddeng
    @weiddeng
    :point_up: September 17, 2020 3:25 PM Hey @SebastianThomas1_gitlab what was your motivation in this question? Btw thanks for the detailed description. My understanding is that for (x, y), training and inference are done differently. In training, one of the learning task is predicting the next or masked y_i, for this, you feed (x, y) and the y_i predictions can be done in parallel. In inference, you just feed x, and generate y. In Problem 1, you can compute p(y|x) as the product of p(yi|y<i, x). You can use this probability as a ranking function to select the most likely y to match x if you are given a set of y's. In Problem 2 and 3, yes but what do you wanna do? In Problem 3, if you are interested in p(y), then this probably has nothing to do with translation but just about the target language Y. You can build a language model for language Y, and compute p(y) as a product p(yi | p<i). Theoretically, you can estimate p(y) by randomly draw an x, and then use p(y|x) for p(y).
    ameyas1
    @ameyas1
    @lukaszkaiser @weiddeng I tried using sentencepiece bpe with English and French dataset from http://www.manythings.org/anki/fra-eng.zip on seq2seq attention model from the coursera couurse but it gave me very poor results. I think tokenization could be the problem. Now, I don't know how to tackle this problem.
    weiddeng
    @weiddeng
    @ameyas1 I don't know.. Maybe you can try different parameters for the number of merges?
    Hi, does anyone have a code reference (e.g. notebook) for BERT pretraining? I am interested to learn the practice. So far the examples I know are relatively simple like IMDB review sentiment classification. Thanks!
    ameyas1
    @ameyas1
    @lukaszkaiser can we visualize trax models that we created?
    Helly
    @HellyJain
    Hello, Does anyone know if there is any layer like Conv2D if I would like to do image processing. I could not find anything similar on "https://trax-ml.readthedocs.io/en/latest/trax.layers.html".