These are chat archives for cltk/cltk

18th
Jun 2016
Rob Jenson
@ferthalangur
Jun 18 2016 13:59
Re: JOSS ... In my opinion, having a little more text in the entries would be helpful. Not necessarily a whole scholarly paper about the software, but something standalone at a high level that talks about the goals, problem description, key people, etc. It might be slightly redundant with what is stashed in the Github repo, but that is not always clear to the non-technical reader how to navigate.
@kylepjohnson ewww ... putting DOIs on individual point releases is a misunderstanding of what persistent identifiers are intended for, IMHO. A DOI should refer to a semi-permanent intellectual object, not a transient instantiation of that object.
James Tauber
@jtauber
Jun 18 2016 14:15
I think one reason for per-release DOIs is reproducibility. But I agree (and it was one of my early questions about Zenodo) that we really need a single DOI for a project, not (just) a point-in-time snapshot of that project
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 15:07
@darkthirst hello!
You can start a conversation here
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 15:22
@jtauber and @ferthalangur you both see the issues with the per-release DOI. I'm comfortable with the way Zenodo does it (for reason of reproducibility) and willing to accept the downside of making the DOI less trackable
Patrick J. Burns
@diyclassics
Jun 18 2016 15:31
re: reproducibility—should we add something to the docs about installing older versions of the CLTK with this is mind?
Maybe not as much of an issue now, but more so over time with more releases and (hopefully!) published research based on CLTK data/methods.
James Tauber
@jtauber
Jun 18 2016 15:46
the way I'd like to see it work is a DOI for each release AND overall project so you can cite either a release or the overall project
@diyclassics the releases on PyPI should be sufficient for that but no harm in making the documentation explicit about it
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 15:48
I'll add old version install to the docs today. That is a smart idea. James is right, though it isn't totally obvious
@jtauber the multiple DOI is really really compelling ... I'll look into other projects that do this (if they do) ... Anything you want to share or find welcome
James Tauber
@jtauber
Jun 18 2016 15:50
for it to work, any dependencies in CLTK need to be pinned to a particular version too (which I think they are)
@kylepjohnson I asked Zenodo when their github integration first came out but never got a response from them. Other parties may mint DOIs for entire projects. Perhaps GitHub should just do it themselves
Given Arfon Smith works at GitHub it might not be outside the realm of possibility
I guess the other solution is to "publish" the project as a whole in some JOSS-style way
James Tauber
@jtauber
Jun 18 2016 15:55
NLTK itself asks people to cite the O'Reilly book
so perhaps you need to write a book :-)
Patrick J. Burns
@diyclassics
Jun 18 2016 16:03
As for the docs—I meant to say, not just instructions on how to install old versions, but an introductory comment that refers to reproducibility as a reason why you would want to install one.
James Tauber
@jtauber
Jun 18 2016 16:04
+1
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:09
@jtauber do you know Arfon Smith? I was connected with him a few months back, to talk about other DH ideas
Ha, I myself am not writing a book anytime soon … I do think we should collaborate on something
James Tauber
@jtauber
Jun 18 2016 16:10
I only know of him (through Zooniverse and then JOSS)
so one potential publishable collaboration relevant to these discussions would be something about applying scientific computing trends around openness and reproducibility to DH
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:13
We have a number of people writing tutorials here and there … over time, I think there could be enough to constitute a "book" or something publish-able
James Tauber
@jtauber
Jun 18 2016 16:14
yeah, an edited volume would be excellent!
Patrick J. Burns
@diyclassics
Jun 18 2016 16:14
I expect to write more—interestingly (for this conversation) I've been waiting until the next official release to blog posts so they won't break immediately with the next round of changes.
James Tauber
@jtauber
Jun 18 2016 16:15
:-)
Patrick J. Burns
@diyclassics
Jun 18 2016 16:15
i might be overthinking it
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:15
@james I haven't always pinned my releases, but have begun doing so
Patrick, I'll can make a release now if you want!
Patrick J. Burns
@diyclassics
Jun 18 2016 16:15
wait until I'm back from CHS
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:16
ok, just give me the word
Patrick J. Burns
@diyclassics
Jun 18 2016 16:16
One of the reason I spend some time on easily loading the Latin Library with PlaintextCorpusReader is to easily set up tutorials.
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:17
Smart.
Patrick J. Burns
@diyclassics
Jun 18 2016 16:17
I'll let you know. (Other important CLTK work to do as well, as you know!)
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:18
James, what are you working on these days? I saw your Septuagint syntax book. Really fascinating stuff, I know only a little
@ferthalangur I missed your earlier comment about JOSS – that's a good idea. I'm going to look into this today
James Tauber
@jtauber
Jun 18 2016 16:20
still mostly working on my morphological lexicon (see jktauber.com for a blog post series I've just started about modelling principal parts)
besides that blog post series, I'm trying to clean up more of my code and data so others can use it
Patrick J. Burns
@diyclassics
Jun 18 2016 16:24
Following those posts with great interest, @jtauber.
James Tauber
@jtauber
Jun 18 2016 16:25
I too often code ahead of my writing so this blog series is designed to force me to stop coding and go back and describe what I'm doing :-)
Patrick J. Burns
@diyclassics
Jun 18 2016 16:26
A practice worth considering/imitating.
James Tauber
@jtauber
Jun 18 2016 16:26
which will in turn hopefully make it easier to write journal articles / conference presentations about this stuff later :-)
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:27
Agreed, James, these posts are terrific
James Tauber
@jtauber
Jun 18 2016 16:28
this afternoon I'm headed to a morphology conference in Lyon which I'm sure will generate a ton more ideas
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:28
Your docs of the JOSS process really helpful.
James Tauber
@jtauber
Jun 18 2016 16:28
(I should say, this afternoon, evening and tomorrow morning :-) )
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:28
Quick question for you James, and you too Patrick –
I have met an undergrad (Jack, Patrick whom you met) who is interested in Greek phonetics
James Tauber
@jtauber
Jun 18 2016 16:30
including phonology as well?
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:30
I had the idea that, for an intro project, he could make a map of Greek orthography to historical IPA transcription
yes, I think so … not sure I remember the distinction
(a) Do you think this would be valuable – to take a reconstruction and make a dict of of {<greek_letters>: <IPA_string>}
(b) other phonetic/phonological projects that would be good for the cltk?
James Tauber
@jtauber
Jun 18 2016 16:32
sounds like a great project, actually and one I'd love to help out on
Patrick J. Burns
@diyclassics
Jun 18 2016 16:32
I can ask him today.
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:33
OK! I will put you in touch with him. Smart kid, enthusiastic. He has an intership this summer at Harvard's Center for Hellenic Studies
Thank you, Patrick, please do. Any advice you have for Jack, please share with him.
Patrick J. Burns
@diyclassics
Jun 18 2016 16:34
And interested (actually, already using CLTK for some work)
James Tauber
@jtauber
Jun 18 2016 16:34
to be clear, there are plenty of people that know a LOT more about phonetics, phonology, and greek phonetics / phonology than I do but I know enough to be dangerous, especially if coding is involved :-)
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:35
Cool. And you know tons more than I do, so you're the expert here by a mile :D
James Tauber
@jtauber
Jun 18 2016 16:36
there are some interesting follow-on projects for later too, like looking for patterns in spelling variations
(as that tends to be a major source of evidence we have for sound changes in Hellenistic times)
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 16:38
Yes, I can imagine some great follow ups
I forget if I've mentioned this – I work with a Phd student doing ML + phonetics, where he mines pattern data out of IPA transcriptions
Anyways, he got me excited about the possibilities for this …
James Tauber
@jtauber
Jun 18 2016 16:40
very cool
Rob Jenson
@ferthalangur
Jun 18 2016 16:54

the way I'd like to see it work is a DOI for each release AND overall project so you can cite either a release or the overall project

That would work for me, at least “significant” releases. I realize that’s a really subjective term, but presumably, if someone were to write a paper about, just for example, improvements in algorithms used in two later releases and you wanted to point to a persistent ID for each release, you’d want a DOI or ARK for each.

Nelson Liu
@nelson-liu
Jun 18 2016 16:57
I think it makes a lot more sense for DOI to be only release based. If you want to make your research replicable, it’s important that people trying to replicate your results know exactly what version of whatever open source software you were using. Hence why a DOI needs to be assigned for every little incremental change.
James Tauber
@jtauber
Jun 18 2016 16:59
you could have it both ways, though, and that's what I'd like to see
like GitHub URLs support referring to a repo, a tag or a specific commit, it's possible for identifiers to operate at different levels
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 20:06
Gentlemen, for your review, some instructions I wrote for installing previous versions. Any ideas or PR welcome
Note I rm'd the dev-requirements.txt … logic being that anyone building from source is what I'd consider a dev
Patrick J. Burns
@diyclassics
Jun 18 2016 20:20
Looks good. Helpful feature.
Related—can anybody recommend a how-to on research design when git repos are involved?
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 20:41
You mean what kind of files should go into a git repo? If so: plain text is best, binary executables and databases are terrible; other file types in the middle (jpeg, pdf, etc.). gitattributes options can help, but I've never used them
Patrick J. Burns
@diyclassics
Jun 18 2016 20:43
I was thinking more like the opposite perspective of what you added to the docs—say, someone is going to set up a new project for a scientific study
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 20:45
ahh
but with explicit discussion of existing repos
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 20:47
That's a good question. A few things that we would certainly be helpful: (a) mark commit hashes (which you can get through git log); (b) tag versions (git tag or through GitHub (what I do))
Something I could improve is the getting of a user's current hash for a given data repo.
Patrick, what kind of repo are you thinking of making? Something for academic work?
as in an application or collection of scripts?
Patrick J. Burns
@diyclassics
Jun 18 2016 20:52
Don't get me wrong—I think what you've included is great. But it got me wondering how others work with different versions.
nothing specific, just thinking through the problem
but let's say I was going to work on a data-driven article, how should I set the project up and how should I best make use of/document what you wrote up today
nothing pressing!
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 21:03
I gotcha. It's an open field, I think. James's point above about different types of unique identifiers (for example, a github URL sufficing in place of a DOI). This reminds me of what Sebastian Heath has talked about for years … the key concept is getting others back to the same thing you're looking at.
Patrick J. Burns
@diyclassics
Jun 18 2016 21:15
Another blog post to write this summer…
Kyle P. Johnson
@kylepjohnson
Jun 18 2016 21:30
Really! I'd love to read something like a "Digital Humanities checklist" of DO's and DONT's