@friesel - Let me ask the person who did this to maybe make a Colab and share.
Did you happen to get to this? (The Trax-implementation of the "A friend played with it on the TFDS scientific papers dataset and it does generate reasonable summaries (even if it was a little repetitive at first try)." Would be awesome. Still struggling heavily to get back to what we did in t2t. Thx
a bit over >100k steps, at 8 samples per steps, 1 sample is 1 full wikipedia article + random padding. I'm mostly using the basic sampling from the notebooks with
TimeBinCausalAttention.bin_length = 128
TimeBinCausalAttention.n_bins = None
LSHCausalAttention.n_hashes = 8
LSHCausalAttention.bucket_capacity_for_inference = 256
I'm going to do top_p sampling next and see what I can change in the config on inference to see if I get better results (e.g. more hashes, bigger bins, not sure what else yet). Advice welcome.
I'll write it up as a tutorial after, include the helper functions I've added, etc. so people can follow it.
hard_khas nothing to do with decoding; in fact that flag only exists as a remnant of a research direction that yielded no results