Hello to everyone interested in the corpus-DB project! Feel free to introduce yourself.
Hello Jonathan. I am interested. I have written a custom corpus builder for philosophical texts in Ruby on Rails but it is very much for my own personal use. I think that I would very much like to be involved in a corpus project that met my needs but that was also good enough to share with the world. I have Python experience, and a smidgen of Haskell.
Hey Anthony! That sounds awesome. Is your rails app on GitHub somewhere? If the philosophy texts are public-domain, and aren't already in Project Gutenberg, maybe we can think of a way to integrate them to the corpus-db database?
There are plenty of other opportunities for hacking around elsewhere, too. If you have an idea for a new feature, for instance, open an issue for it, and we'll discuss.
Cheers Jonathan. My Rails app is on GitHub, but it's not up to date so I won't point to it just yet. I'd like to bring your attention to the CITE Architecture if you are not already aware of it. I am trying to leverage their CTS URN notation http://cite-architecture.github.io/ctsurn/overview/ spec and am in touch with one of the architects, Christopher Blackwell; could be something for Corpus-DB to take advantage of. They have a Scala version, I'm making a Ruby one, Corpus-DB could make a Python one.
Many of the texts I am using are in the public domain, yes. I'll add a copyright status field to my DB. Some texts I am sing are not yet out of copyright, I was taking advantage of (a) fair use and (b) that I was not publishing the texts merely doing computational linguistics stuff on them…
Have you thought about using metadata from Wikidata? That's what I've been doing. SPARQL-fu for the win.
Just finishing writing up an potential IACAP paper then I'll turn my attention to Corpus-DB. Keep prodding me, I assure you I am very interested, I've been following your work for a number of years.
I hadn't seen CITE or CTS URN before, this is awesome. Feel free to open an issue in the corpus-db repo about integrating it or parsing CTS URNs in the API. For instance, corpus-db.org/api/cts-urn/<cts-urn-here>.
They're work arose out of building a corpus for Classics texts and the need for a generic notation scheme, let's not reinvent the wheel!
I'd done the Wikipedia data parsing using DBPedia and SPARQL queries. It doesn't do the best job, since it relies on title/author matching, and isn't that fuzzy. As a result, I only got wikipedia data for about 1.5K out of the total 45K Gutenberg volumes. If you can improve on the code in any way, it'd be much appreciated by everyone, I'm sure
I was able to match 13,000 philosophers or authors who have written major philosophical works using name matching and metadata matching and some eyeballing. I have done te same for only about 200 texts, I think I could do it for far more texts.
Agreed. Okay, talk to you anon. Over 'n out!
Awesome. OK! Thanks for dropping in.
@JonathanReeve: This sounds like a fascinating project. I'm looking for a Project Gutenberg metadata database in order to analyze it using Tableau. How large are the SQLite databases? I'd be potentially willing to cover the bandwidth to transfer them.
Hi @matthewmmiller_twitter! Welcome. Sorry for the late response. I apparently don't have notifications turned on for Gitter yet. The databases are about 16GB without the full text search indices (FTS5), and around 33GB with them. I'd be happy to give you a copy, if we can figure out a way to transfer it. Maybe I can put the 16GB version up on a BitTorrent seedbox, so that way it'd be available to anyone.
If you offer it to the public, you might want to square away the legal end of it, though. I've scrubbed all the licenses from the text files themselves, since these throw off the text analysis, but PG might require you to distribute the licenses along with the texts.