Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 29 20:55
    codecov[bot] commented #264
  • Jan 29 20:54
    codecov[bot] commented #264
  • Jan 29 20:54
    jakobnissen edited #263
  • Jan 29 20:50
    jakobnissen opened #264
  • Jan 24 02:32
    kescobo labeled #263
  • Jan 23 06:47
    jakobnissen opened #263
  • Jan 18 09:51
    benjaminlozanow commented #262
  • Jan 14 03:15
    camilogarciabotero commented #262
  • Jan 14 02:53
    kescobo commented #262
  • Jan 14 02:53
    kescobo commented #262
  • Jan 13 20:29
    jakobnissen commented #262
  • Jan 13 20:27
    jakobnissen commented #262
  • Jan 13 18:39
    benjaminlozanow commented #262
  • Jan 13 18:38
    benjaminlozanow commented #262
  • Jan 13 16:27
    kescobo commented #262
  • Jan 12 23:22
    CiaranOMara commented #262
  • Jan 12 14:48
    benjaminlozanow edited #262
  • Jan 12 14:48
    benjaminlozanow edited #262
  • Jan 12 13:14
    benjaminlozanow opened #262
  • Nov 28 2022 10:33
    JuliaTagBot commented #209
Jakob Nybo Nissen
@jakobnissen
I'll look into it, but I don't yet know how the @inbounds work (I'd think @nonshared should use that as a template)
Ben J. Ward
@BenJWard
Yeah I'm not sure how it works either. I once thought of a generic macro pair. "@safetycheck" and "@assumesafe" which like "@boundscheck" and "@inbounds" would allow you to make a function as being some kind of check for validity of assumptions and so on, but if you absolutely know you pass the check, then you could just waive the cost by using "@assumesafe"
The other way of doing it would be to strongly seperate slicing methods vs view methods, and adding a view type.
Jakob Nybo Nissen
@jakobnissen
Hm, although I guess there already is unsafe_setindex! which basically does that. It's just hidden in the source code.
Ben J. Ward
@BenJWard
But I think this would be in a way less nice
Jakob Nybo Nissen
@jakobnissen
I agree, the COW-principle is much nicer.
Ben J. Ward
@BenJWard
There's a lot of slicing code in bioinformatics scripts I've seen written in python and R and so on, people extracting a gene or doing something else with a large sequence "just do it" and so it makes sense BioSequences goes to all the COW trouble for them to stop their scripts copying redundant data all over the place.
unsafe_setindex does, but it would be nice if we had a more generic mechanism, that uses the macros can do it for all mutating functions.
I've just been pushing forward on the new v2 api, which I might make a branch on BioJulia/BioSequences.jl soon.
Jakob Nybo Nissen
@jakobnissen
Cool. Anything that interferes with this implementation? Otherwise I'll take a look at it.
Ben J. Ward
@BenJWard
The long and short of it is:
  • The abstract Sequence type is now called BioSequence{A}.
  • The type previously called BioSequence{A} is now LongSequence{A}
  • Kmers are now a parametric type: Skipmer{U<:Unsigned, A<:Alphabet, M, N, K}. A Skipmer is a more generic type of Kmer, where in addition to K, you have two other numeric parameters M and N. They are more fully described here: https://www.biorxiv.org/content/early/2017/09/19/179960.full.pdf+html
The U parameter in the Skipmer type then allows you to pick the size of the UInt that stores the nucleotides.
There are aliases defined so you can use say DNAKmer{27} in your scripts and such as shorthand and it will work.
Jakob Nybo Nissen
@jakobnissen
Neat! Are there any real advantages for the U parameter though? People usually don't store Kmers directly.
Ben J. Ward
@BenJWard
In general a Kmer is a Skipmer where M ==1 && N == 1
I'm not sure if other people will use U, but I know in my group we like to use both UInt64 sizes kmers, and UInt128 sized kmers and compare results, as theres a tradeoff between number of kmers, and specificity/uniqueness, as you increase or decrease the length.
And we've found the 64bit limit quite critical in some cases.
Jakob Nybo Nissen
@jakobnissen
Fair enough. Assemblers typically use Kmers up to K=127, too.
Also, there's probably zero performance penalty with having the U parameter, so why not.
Ben J. Ward
@BenJWard
I think I'm going to do a PR of the type renames now, and then subsequently add PRs for the "EachSkipmerIterator" I've been working on, a Codon type, and some other code refactors and tidying up afterwards.
Your kmer iteration methods I think will remain, as for Kmers, I'm hard pressed to come up with something neater or faster, so we can dispatch to your iterators for Kmers where M and N == 1, and then the EachSkipmerIterator I've written for other generic skipmers.
Ben J. Ward
@BenJWard
I've opened here: BioJulia/BioSequences.jl#53
Jakob Nybo Nissen
@jakobnissen
Turns out that no, it's not possible to create a macro pair similar to @boundscheck/`@inbounds`.
Ben J. Ward
@BenJWard
Is that because it needs some special support from the interpreter?
Jakob Nybo Nissen
@jakobnissen
Ben J. Ward
@BenJWard
Maybe we can do it with an API then
Passing Val{true} is described
Jakob Nybo Nissen
@jakobnissen
I'd just export unsafe_setindex!.
Ben J. Ward
@BenJWard
or we could simply have unsafe_* version of the methods that edit any sequences.
Jakob Nybo Nissen
@jakobnissen
Yep, that's neater I think. Probably add a small docstring saying that it skips all boundschecks and copy-on-write checks :)
Ben J. Ward
@BenJWard
I think this is smart
I'll make that the next PR after the current one then!
Jakob Nybo Nissen
@jakobnissen
Hm, what's the intended use of CharSequence? It seems like it's a complicated String.
Ben J. Ward
@BenJWard
Pretty much! :laughing:
I think it's just to show that with BioSequences you could store any element type, using an alphabet.
I think each char is stored in 8 bits, and so if you were doing bitops you could do things more quickly... what kind of things I've no idea
CharSequences are also mutable which String is not
But really it's something that is there, but isn't really used much right now.
I do envisage other alphabets in the future to store, for example a sequence of SNPs and that sort of thing as BioSequences would let you do that, but again there's no need or call for it right now.
Ben J. Ward
@BenJWard
@jakobnissen Do you think there's any merit it seperating the process into three stages? - boundscheck -> cowcheck -> unsafe opertation, or do you think two steps of bounds & cow-check -> unsafe operation is enough?
Jakob Nybo Nissen
@jakobnissen
I don't think the user cares that much about which internal checks BioSequences are doing. They probably just want a safe option (default) and an unsafe but fast option (where the documentation clearly states exactly which precautions they should take). However, I don't think it'sa good idea to make the COW-check parts of the bounds heck so it can be skipped with @inbounds, because the user might th
might reasonably assume that @inbounds only skill skipsa boundscheck and so might use it wrong
One of the best things about the COW-principle is that the user benefits without even having to know it's there.
Jakob Nybo Nissen
@jakobnissen
Just a heads-up: AminoAcidAlphabet()[1] doesn't work currently on the v2 branch. You need to convert i - 1 to UInt8 first. Same with the NucleicAcidAlphabet{4} subtypes.
Ben J. Ward
@BenJWard
Good catch!
Jakob Nybo Nissen
@jakobnissen
With the code coverage bot currently being untrustworthy, is there any way to figure out how many lines in your code you've written tests for?
Jakob Nybo Nissen
@jakobnissen
@BenJWard I get deprecation warning when testing BioSequences due to a function in Twiddle:
┌ Warning: repeatbyte is deprecated, use repeatpattern instead
│ caller = count_01_bitpairs at counting.jl:79 [inlined]
└ @ Core ~/.julia/packages/Twiddle/MbfXo/src/counting.jl:79
Ben J. Ward
@BenJWard
@jakobnissen Yes I'm aware of that one, I think it's my fault, it's something in the bit-parallel site counting code I developed last year. I found a way to make some generic methods for bit-twiddling that work for all Unsigned sizes, with zero cost to performance, so I put them in Twiddle and also renamed some of them. I'll make a PR fixing this.
Guillermo Luque
@gluque
Hello there! I've installed Julia 1.3.0 but when installing BioSequences, it uses the 1.1 instead of the 2.0 version, do you know how to "force" the 2.0 version when installing?
Ben J. Ward
@BenJWard

@gluque Hiya,

Have you added the "BioJulia package registry" to your julia setup?

registry add https://github.com/BioJulia/BioJuliaRegistry.git