Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 28 10:33
    JuliaTagBot commented #209
  • Nov 28 10:16
    JuliaRegistrator commented on 1cfea08
  • Nov 28 10:15
    jakobnissen commented on 1cfea08
  • Nov 28 10:15
    jakobnissen closed #260
  • Nov 28 10:15
    jakobnissen closed #261
  • Nov 28 10:11
    codecov[bot] commented #261
  • Nov 28 10:10
    jakobnissen commented #260
  • Nov 28 10:08
    codecov[bot] commented #261
  • Nov 28 10:07
    codecov[bot] commented #261
  • Nov 28 10:02
    jakobnissen opened #261
  • Nov 28 08:18
    ian-small opened #260
  • Nov 12 14:06
    JuliaTagBot commented #209
  • Nov 12 13:50
    JuliaRegistrator commented on db8a692
  • Nov 12 13:50
    jakobnissen commented on db8a692
  • Nov 12 13:49
    jakobnissen closed #259
  • Nov 12 13:49
    jakobnissen ready_for_review #259
  • Nov 12 13:47
    codecov[bot] commented #259
  • Nov 12 13:45
    codecov[bot] commented #259
  • Nov 12 13:42
    codecov[bot] commented #259
  • Nov 12 13:41
    codecov[bot] commented #259
Ben J. Ward
@BenJWard
I've just been pushing forward on the new v2 api, which I might make a branch on BioJulia/BioSequences.jl soon.
Jakob Nybo Nissen
@jakobnissen
Cool. Anything that interferes with this implementation? Otherwise I'll take a look at it.
Ben J. Ward
@BenJWard
The long and short of it is:
  • The abstract Sequence type is now called BioSequence{A}.
  • The type previously called BioSequence{A} is now LongSequence{A}
  • Kmers are now a parametric type: Skipmer{U<:Unsigned, A<:Alphabet, M, N, K}. A Skipmer is a more generic type of Kmer, where in addition to K, you have two other numeric parameters M and N. They are more fully described here: https://www.biorxiv.org/content/early/2017/09/19/179960.full.pdf+html
The U parameter in the Skipmer type then allows you to pick the size of the UInt that stores the nucleotides.
There are aliases defined so you can use say DNAKmer{27} in your scripts and such as shorthand and it will work.
Jakob Nybo Nissen
@jakobnissen
Neat! Are there any real advantages for the U parameter though? People usually don't store Kmers directly.
Ben J. Ward
@BenJWard
In general a Kmer is a Skipmer where M ==1 && N == 1
I'm not sure if other people will use U, but I know in my group we like to use both UInt64 sizes kmers, and UInt128 sized kmers and compare results, as theres a tradeoff between number of kmers, and specificity/uniqueness, as you increase or decrease the length.
And we've found the 64bit limit quite critical in some cases.
Jakob Nybo Nissen
@jakobnissen
Fair enough. Assemblers typically use Kmers up to K=127, too.
Also, there's probably zero performance penalty with having the U parameter, so why not.
Ben J. Ward
@BenJWard
I think I'm going to do a PR of the type renames now, and then subsequently add PRs for the "EachSkipmerIterator" I've been working on, a Codon type, and some other code refactors and tidying up afterwards.
Your kmer iteration methods I think will remain, as for Kmers, I'm hard pressed to come up with something neater or faster, so we can dispatch to your iterators for Kmers where M and N == 1, and then the EachSkipmerIterator I've written for other generic skipmers.
Ben J. Ward
@BenJWard
I've opened here: BioJulia/BioSequences.jl#53
Jakob Nybo Nissen
@jakobnissen
Turns out that no, it's not possible to create a macro pair similar to @boundscheck/`@inbounds`.
Ben J. Ward
@BenJWard
Is that because it needs some special support from the interpreter?
Jakob Nybo Nissen
@jakobnissen
Ben J. Ward
@BenJWard
Maybe we can do it with an API then
Passing Val{true} is described
Jakob Nybo Nissen
@jakobnissen
I'd just export unsafe_setindex!.
Ben J. Ward
@BenJWard
or we could simply have unsafe_* version of the methods that edit any sequences.
Jakob Nybo Nissen
@jakobnissen
Yep, that's neater I think. Probably add a small docstring saying that it skips all boundschecks and copy-on-write checks :)
Ben J. Ward
@BenJWard
I think this is smart
I'll make that the next PR after the current one then!
Jakob Nybo Nissen
@jakobnissen
Hm, what's the intended use of CharSequence? It seems like it's a complicated String.
Ben J. Ward
@BenJWard
Pretty much! :laughing:
I think it's just to show that with BioSequences you could store any element type, using an alphabet.
I think each char is stored in 8 bits, and so if you were doing bitops you could do things more quickly... what kind of things I've no idea
CharSequences are also mutable which String is not
But really it's something that is there, but isn't really used much right now.
I do envisage other alphabets in the future to store, for example a sequence of SNPs and that sort of thing as BioSequences would let you do that, but again there's no need or call for it right now.
Ben J. Ward
@BenJWard
@jakobnissen Do you think there's any merit it seperating the process into three stages? - boundscheck -> cowcheck -> unsafe opertation, or do you think two steps of bounds & cow-check -> unsafe operation is enough?
Jakob Nybo Nissen
@jakobnissen
I don't think the user cares that much about which internal checks BioSequences are doing. They probably just want a safe option (default) and an unsafe but fast option (where the documentation clearly states exactly which precautions they should take). However, I don't think it'sa good idea to make the COW-check parts of the bounds heck so it can be skipped with @inbounds, because the user might th
might reasonably assume that @inbounds only skill skipsa boundscheck and so might use it wrong
One of the best things about the COW-principle is that the user benefits without even having to know it's there.
Jakob Nybo Nissen
@jakobnissen
Just a heads-up: AminoAcidAlphabet()[1] doesn't work currently on the v2 branch. You need to convert i - 1 to UInt8 first. Same with the NucleicAcidAlphabet{4} subtypes.
Ben J. Ward
@BenJWard
Good catch!
Jakob Nybo Nissen
@jakobnissen
With the code coverage bot currently being untrustworthy, is there any way to figure out how many lines in your code you've written tests for?
Jakob Nybo Nissen
@jakobnissen
@BenJWard I get deprecation warning when testing BioSequences due to a function in Twiddle:
┌ Warning: repeatbyte is deprecated, use repeatpattern instead
│ caller = count_01_bitpairs at counting.jl:79 [inlined]
└ @ Core ~/.julia/packages/Twiddle/MbfXo/src/counting.jl:79
Ben J. Ward
@BenJWard
@jakobnissen Yes I'm aware of that one, I think it's my fault, it's something in the bit-parallel site counting code I developed last year. I found a way to make some generic methods for bit-twiddling that work for all Unsigned sizes, with zero cost to performance, so I put them in Twiddle and also renamed some of them. I'll make a PR fixing this.
Guillermo Luque
@gluque
Hello there! I've installed Julia 1.3.0 but when installing BioSequences, it uses the 1.1 instead of the 2.0 version, do you know how to "force" the 2.0 version when installing?
Ben J. Ward
@BenJWard

@gluque Hiya,

Have you added the "BioJulia package registry" to your julia setup?

registry add https://github.com/BioJulia/BioJuliaRegistry.git
v2.0 of BioSequences, FASTX, Pseudoseq, and the newer packages (or newer versions of old packages) are being released through the "BioJulia package registry" now instead of the "General" registry.
Guillermo Luque
@gluque
Hello @BenJWard Thanks! It is working after updating the registry.
Ciarán O'Mara
@CiaranOMara
@BenJWard and @jakobnissen, have you guys seen https://github.com/tkf/BenchmarkCI.jl ?
Ghost
@ghost~5ede2c26d73408ce4fe64346
Hello! I understand that I should use FASTX for parsing FASTA files but how do I convert FASTX's FASTA records into a BioSequences type? I'm having a hard time finding this in the documentation.
Ghost
@ghost~5ede2c26d73408ce4fe64346
I found that I can use BioSequences.sequence(FASTX.FASTA.Record) to do this, in case anyone else is wondering the same thing.
Ben J. Ward
@BenJWard
Yep that's right. In any upcoming releases this method may be deprecated in favour of getting the user to simply use a constructor e.g. LongSequence{DNAAlphabet{2}}(record).
Ghost
@ghost~5ede2c26d73408ce4fe64346
Another question: I have Julia installed on two separate computers. On my Linux machine, I have the latest version of BioSequences.jl but my MacBook has v1.1.0. I am running update but it doesn't do anything. I have Julia v1.4 installed on both machines. I am updating from the general registry on both machines and I shouldn't have to add the BioJulia registry anymore if I understand things correctly.
Ghost
@ghost~5ede2c26d73408ce4fe64346
Solved. I removed ~/.julia/registries/General, then I ran ]registry update General, and then I reinstalled all of my bio-related packages. I think the problem was that BioTools.jl depends on an older version of BioSequences.jl. Sorry for spamming the chat!