@sa-js sounds you're well on your way to analyzing the data! You've already gotten the data in a vector form. Stemming them is a great idea as you've mentioned. NLTK should be able to a lot of this for you as you suggest. I'd agree this is a good enough approach for now.
Also, it looks like NLTK has a built-in classifier you can use https://pythonspot.com/natural-language-processing-prediction/
Here are some other resources that might help:
@mridul037 if you want some practice, you can practice going through the collecting data, cleaning/manipulating the data, and visualizing the data workflow.
Here is one such initiative to practice this https://github.com/rfordatascience/tidytuesday
Bioinformatics related, here are some I've found useful:
If you have specific questions for bioinformatic data science, feel free to ask around here :smile:
...there's not much to the guidelines here, just be friendly, don't veer too far off topic, and no self-promotion. You can have a quick read through the Code of Conduct if you'd like :)
"...no self-promotion." @becausealice2 Lol whoops. I hope my earlier link wasn't too self-promotion-y! (Aren't I a moderator too? I should know the bounds too haha.) Just sharing an analysis I thought others might be interested in seeing :smile: