a) I think there are some interests. We had similar plans in the past but this proved to be a bit complex to build and maintain over time. Other initiatives such as https://github.com/tensorflow/text could be an alternative for some use cases.
b) Sure. The Tokenizer is MIT licensed so you can do anything you want as long as you credit the original project.
Please keep me updated and let me know if there are any changes in the Tokenizer that would make your work easier.
sample_buffer_sizecan be used to configure the shuffle buffer size. Small values mean faster filling but worse shuffling of the training data, while large values mean slower filling but improved shuffling. The default buffer size is the dataset size.