Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jonathan Reeve
    @JonathanReeve
    Hello to everyone interested in the corpus-DB project! Feel free to introduce yourself.
    Anthony Durity
    @anthonydurity_twitter
    Hello Jonathan. I am interested. I have written a custom corpus builder for philosophical texts in Ruby on Rails but it is very much for my own personal use. I think that I would very much like to be involved in a corpus project that met my needs but that was also good enough to share with the world. I have Python experience, and a smidgen of Haskell.
    Jonathan Reeve
    @JonathanReeve
    Hey Anthony! That sounds awesome. Is your rails app on GitHub somewhere? If the philosophy texts are public-domain, and aren't already in Project Gutenberg, maybe we can think of a way to integrate them to the corpus-db database?
    Anyway, check out the codebase when you get a chance and see what you might be interested in hacking around on. There are basically three areas: 1) Python scripts for getting book data from, e.g. Wikipedia and adding it to the database (like this one). 2) the engine that makes the web page and API, written in Haskell, and 3) analyses, in Python, made possible with corpus-db corpora.
    There are plenty of other opportunities for hacking around elsewhere, too. If you have an idea for a new feature, for instance, open an issue for it, and we'll discuss.
    Anthony Durity
    @anthonydurity_twitter
    Cheers Jonathan. My Rails app is on GitHub, but it's not up to date so I won't point to it just yet. I'd like to bring your attention to the CITE Architecture if you are not already aware of it. I am trying to leverage their CTS URN notation http://cite-architecture.github.io/ctsurn/overview/ spec and am in touch with one of the architects, Christopher Blackwell; could be something for Corpus-DB to take advantage of. They have a Scala version, I'm making a Ruby one, Corpus-DB could make a Python one.
    Many of the texts I am using are in the public domain, yes. I'll add a copyright status field to my DB. Some texts I am sing are not yet out of copyright, I was taking advantage of (a) fair use and (b) that I was not publishing the texts merely doing computational linguistics stuff on them…
    Have you thought about using metadata from Wikidata? That's what I've been doing. SPARQL-fu for the win.
    Just finishing writing up an potential IACAP paper then I'll turn my attention to Corpus-DB. Keep prodding me, I assure you I am very interested, I've been following your work for a number of years.
    Jonathan Reeve
    @JonathanReeve
    I hadn't seen CITE or CTS URN before, this is awesome. Feel free to open an issue in the corpus-db repo about integrating it or parsing CTS URNs in the API. For instance, corpus-db.org/api/cts-urn/<cts-urn-here>.
    Anthony Durity
    @anthonydurity_twitter
    Will do.
    They're work arose out of building a corpus for Classics texts and the need for a generic notation scheme, let's not reinvent the wheel!
    Jonathan Reeve
    @JonathanReeve
    I'd done the Wikipedia data parsing using DBPedia and SPARQL queries. It doesn't do the best job, since it relies on title/author matching, and isn't that fuzzy. As a result, I only got wikipedia data for about 1.5K out of the total 45K Gutenberg volumes. If you can improve on the code in any way, it'd be much appreciated by everyone, I'm sure
    Agreed. It's super similar to the thing I tried to do with AnnoTags, intended to make Tweetable hashtags for book locations: http://jonreeve.com/projects/annotags/
    Anthony Durity
    @anthonydurity_twitter
    I was able to match 13,000 philosophers or authors who have written major philosophical works using name matching and metadata matching and some eyeballing. I have done te same for only about 200 texts, I think I could do it for far more texts.
    Agreed. Okay, talk to you anon. Over 'n out!
    Jonathan Reeve
    @JonathanReeve
    Awesome. OK! Thanks for dropping in.
    Matthew Miller
    @matthewmmiller_twitter
    @JonathanReeve: This sounds like a fascinating project. I'm looking for a Project Gutenberg metadata database in order to analyze it using Tableau. How large are the SQLite databases? I'd be potentially willing to cover the bandwidth to transfer them.
    Anthony Durity
    @anthonydurity_twitter
    @matthewmmiller_twitter What is Tableau?
    Matthew Miller
    @matthewmmiller_twitter
    @anthonydurity_twitter
    It’s data visualization and analysis software.
    Jonathan Reeve
    @JonathanReeve
    Hi @matthewmmiller_twitter! Welcome. Sorry for the late response. I apparently don't have notifications turned on for Gitter yet. The databases are about 16GB without the full text search indices (FTS5), and around 33GB with them. I'd be happy to give you a copy, if we can figure out a way to transfer it. Maybe I can put the 16GB version up on a BitTorrent seedbox, so that way it'd be available to anyone.
    If you offer it to the public, you might want to square away the legal end of it, though. I've scrubbed all the licenses from the text files themselves, since these throw off the text analysis, but PG might require you to distribute the licenses along with the texts.