I will make a new ticket today but before doing that I wanted to discuss few things.
- I feel that if the Corpus of raw data contains any kind of discrepancy then it will carry forward to the steps ahead like Tokenization.
- I have edited few files and ran Tokenizer on both the existing file and the edited file, and lot of those unnecessary characters with no meaning present in existing file were now not present in the edited one. I think this will lead to efficient training and execution.
- I just wanted to know whether I can remove those unnecessary words/characters from the files or not. Most important things is that before making any changes int the files I will definitely mention the source.
And I would really love to fix this problem myself.