by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Titipat Achakulvisut
    @titipata
    For any comments related to pubmed_parser repo
    Titipat Achakulvisut
    @titipata
    This has been an empty room since 2015
    dterg
    @dterg
    Hi there! This room shall not be empty in 2016. May I ask if there's a reason why you chose to export to spark dataframes?
    Titipat Achakulvisut
    @titipata
    Hi @dterg ! It's not quite anymore, yay. So, it comes to the speed of parsing the whole document with Spark and ability to read parquet file. But there are other options too. I just go for the Spark since I have to process a lot of these file and Spark is quite handy for parallelization.
    You can export to CSV too. However, you need quite big memory to read the bigger file.
    dterg
    @dterg
    Do you have an estimate on how long it takes, say per 100 papers, to process? I know this is hardware dependent but was wondering how it would perform if I exported to JSON to use it with elasticsearch
    dterg
    @dterg
    Also, correct me if I'm wrong, from a quick glance it seems when checking for updates it currently checks for the same file name right? Would be happy to integrate my code which uses the filelist file provided to identify new updates, retrieve them and extract them if thats the case. I found that sometimes a new update has the same files from previous releases. Might be minor corrections? Havent checked for differences yet when this happens.
    Titipat Achakulvisut
    @titipata
    Oh nice! So on my Mac with 8 cores, it takes about 30 mins to parse all Pubmed Open-Access article
    dterg
    @dterg
    Ah that's not too bad! I'm assuming you only process the bulk once and then incrementally based on new files :)
    Titipat Achakulvisut
    @titipata
    @dterg, I would love to! I haven't written scripts folder well yet. Right now, I'm running everything on my Spark notebook so it's a bit of code changes :)
    @dterg, that what I'm thinking for both Medline and Open Access! At the end I would like to have code that figure out the last time update and update only few files
    Back to the former question with exporting to json file. I'll take a look if there is a way to transform parquet file to json file real quick.
    dterg
    @dterg
    Thats what I'm working on actually. Pubmed OA has an api to check which files were added since a date/time specified. So what I'm doing is updating, and every 24hrs get the list of new files and download those specifically
    I can implement the json export. Only wondering what the performance differences would be
    Titipat Achakulvisut
    @titipata
    That's awesome! I can also add you in the repository directly or PR is good also. I guess JSON export should be good! I just use it to do the data to do topic modeling later on so dataframe is quite handy for that kind of analysis.
    dterg
    @dterg
    I wonder if parallelizing with spark and then converting to json would be better perfirmance wise than simply serial and directly to json. Probably is. I plan on using the parsed text for keyword searching hence why elasticsearch
    Titipat Achakulvisut
    @titipata
    Oh, and I just saw that you are from Imperial College! I were there 3 weeks ago.
    dterg
    @dterg
    Oh were you? Visiting?
    Titipat Achakulvisut
    @titipata
    yes! I was visiting Aldo Faisal's lab at Bioengineering Department!
    dterg
    @dterg
    That's awesome. You work in academia aswell?
    Titipat Achakulvisut
    @titipata
    I guess parallelizing with spark and save to any format then convert to JSON would be the fastest. Spark won't require a lot of memory to do that too.
    yeah, we're from Northwestern >> http://kordinglab.com/
    dterg
    @dterg
    Yeah I think that as well. Will have a look if I can parallelize otherwise
    Oh sweet!
    I'm working on a flask app to check the status as well so I'll keep you updated
    Titipat Achakulvisut
    @titipata
    Thanks so much @dterg! let me know if you have problem with the parser.
    and let me know if you have problem running spark :smile: . It requires some setup (e.g. SPARK_HOME)
    dterg
    @dterg
    Thanks. I will try to do some benchmarks :) only working with pubmed oa at the moment. From what I've seen I need a license for MEDLINE
    Titipat Achakulvisut
    @titipata
    You can actually download MEDLINE right away :P It's basically a hidden folder in ftp. ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/
    But it's good to start with OA and we can improve script in Medline later on. Feel free to bug me here or if I don't reply here. I'm always on my gmail: titipat.a@u.northwestern.edu
    dterg
    @dterg
    Oh good to know. I hope the ftp file structure is similar pubmed 😆
    similar to pubmed*
    Titipat Achakulvisut
    @titipata
    pubmed_parser also has function to parse Medline dataset, structure is quite similar. I would say less complicated!
    dterg
    @dterg
    Cheers! Same here
    Titipat Achakulvisut
    @titipata
    Cheers! I'll chat with you soon then!
    dterg
    @dterg
    See you mate. And cheers for the library by the way!
    Titipat Achakulvisut
    @titipata
    Thanks @dterg and thanks for contributing the library!
    Titipat Achakulvisut
    @titipata
    if you want to run it the processing pubmed oa alone
    dterg
    @dterg
    Cheers. Tried the non-parallelized version (not the script) and looks good ;)
    Titipat Achakulvisut
    @titipata
    Perfect! It's actually no problem if you don't want to parallelized it. It will just take 4-5 hours to process all :)
    dterg
    @dterg
    will submit a pull request to an alternative paralellization implementation without requiring spark
    Titipat Achakulvisut
    @titipata
    @dterg thanks so much! if you can help me improve the automatic update part, that would be even more awesome
    dterg
    @dterg
    I have actually completed that. But need to adapt it to the one you have already implemented cause I use a different approach
    Kevin Henner
    @kjhenner
    Looks like this room has been empty for a couple of years, but no harm in saying "hello" here. I'm just exploring this project so it can hopefully replace the parser I'd been working on myself. Looks very promising so far!