Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jakob Nybo Nissen
    @jakobnissen
    So, I've finally found some time to hack on XAM.jl (the SAM/BAM parser etc). Probably won't be able to finish it before Easter. Anyway, I have question about API: SAM and BAM files can have missing fields. In BioAlignments, when querying a BAM for a missing value, a sentinel value is returned instead (e.g. a 0 for an index).
    I guess this is to avoid type instability. However, with union splitting, this is not really an issue anymore. So: What to return? missing seems obvious. But, that runs the risk of propagating through a chain of functions the users has, turning them all into missing. I'm sure that could lead to weird bugs.
    Currently, I've set them to nothing. I guess missing is more obvious, but nothing will almost immediately raise an error in most code which is more convenient.
    @BenJWard Thoughts?
    Ciarán O'Mara
    @CiaranOMara
    The user would need to opt into implementing Missing for propagation to occur. Passing the type Missing into a function that doesn't accept it will throw a MethodError.
    Ciarán O'Mara
    @CiaranOMara
    Without looking at any specifics, using Missing might lead to a neater implementation in that you would not need to include checks for nothings. You would only need to look at the final result to see if it were useful.
    Ciarán O'Mara
    @CiaranOMara
    I think the implementation of Missing is an optimistic approach that assumes the data is more likely available than not, which is proabably leads to quicker code as the checks and decision are not required.
    Ciarán O'Mara
    @CiaranOMara
    Also, it's sometimes useful to be able to count your missing values and see what they affect. Again, decisions about missing values can be defferred to the end. The user can filter them out or decide what else they might want to do with them.
    Jakob Nybo Nissen
    @jakobnissen
    I'm worried the user could very easily pass a missing to a function where its behaviour IS defined (such as arithmetic operations). For example, one might count coverage by getting the position of a read (which can be unavailable) and doing arithmetic on that.
    I'm fairly certain I'd have to check for missing as well as nothing in internal code, if I don't want the user to see confusing error messages.
    nothing and missing probably leads to equally fast code. If I want a nothing to be able to be returned, there'd be no check. If a returned nothing would wreck some internal XAM.jl function, and I'd have to put a check for that, then I'd probably also need a check for missing
    Jakob Nybo Nissen
    @jakobnissen
    I suppose it comes down to this: If you were a user and didn't read the documentation too dilligently, and expected an operation to return, say an Int:
    a) Would you be least surprised to find it returning a missing or a nothing, and
    b) Given the possibility of this value is completely unexpected, which one leads to the most horrible bugs for your code?
    It seems a) favors missing and b) favors nothing.
    Ciarán O'Mara
    @CiaranOMara
    @jakobnissen, you're correct about the coverage being able to accept missing and that missing there would wreak havock. However, that doesn't happen because the accessor functions for Record will throw a missing error exception.
    I notice that there is variations in the accessors for what constitutes as missing.
    Ciarán O'Mara
    @CiaranOMara
    The current implementation does not guard against direct access, and if directly accessed, it can return a variety of values that represent missing.
    Ciarán O'Mara
    @CiaranOMara
    I think this gets to the core of your original question. If the fields in Record were to represent missing values in a consistent fashion, which would be more favourable, missing or nothing?
    Ciarán O'Mara
    @CiaranOMara
    Would you say this is accurate?
    Ben J. Ward
    @BenJWard
    I would be happy for all Record types to return missing when the user accesses something which is not in the record, rather than throwing. We threw origionally as it seemed preferable to type instability, but julia supports nothing and missing very nicely now.
    I think for a user its perhaps easier for them to either: 1). propagate missing through their code if they don't care and they will find out at the end, or 2). Check for missing in their code because they are expecting it (either with an if or just using dispatch and union splitting). This might be more natural to them than using (an inefficient anyway) try catch method of trying to handle these missing fields.
    Ben J. Ward
    @BenJWard
    I think I might implement missing for FASTX too. And we can do this going forward with the major new versions of these packages, but keep BioAlignments.jl as it is for now.
    Jakob Nybo Nissen
    @jakobnissen
    Alright, I'll use missing. My repo for development is at https://github.com/jakobnissen/XAM.jl (can later be copied back into Bio-something.jl). Currently progressing at a glacial pace.
    Jakob Nybo Nissen
    @jakobnissen
    Damn, I just implemented BAM to SAM conversion. It's ~12 x slower than samtools view - and that is mostly due to the auxiliary data decoding and recoding, which I don't know how to speed up since it's inherently type unstable. That's really annoying.
    Ben J. Ward
    @BenJWard
    Without knowing exactly what you did, I imagine there's a way we can get around that, dealing with the raw UInt8 data, which is all htslib does anyway, encoding and decoding is pretty much the only reason BioJulia stuff was slower than htslib when we tested some years ago, so we made it so as encoding of fields is deferred for records.
    Is it too type unstable even for union splitting?
    Jakob Nybo Nissen
    @jakobnissen
    You're probably right that efficient BAM -> SAM conversion needs direct operations on the data array. What's causing the time is partly decoding the BAM record's fields and then re-encoding it, but mostly the auxiliary data. The auxiliary data is quite type unstable, too much for union splitting.
    Ben J. Ward
    @BenJWard
    Is aux data the same formatting for BAM as it is for SAM? With the key:type:value format thing?
    I'm wondering if we can just get away with copying the underlying data straight into a sam record or not.
    Jakob Nybo Nissen
    @jakobnissen
    Unfortunately, not quite the same. There are colons in SAM format (e.g. "NM:i:8"). Some types (all integers other than Int32) are not accepted in SAM, but are in BAM so must be converted. Lastly, arrays are represented in BAM directly bit-by-bit, whereas in SAM they must be converted to parsable text
    Ben J. Ward
    @BenJWard
    Yeah I think we're going to have to do all that using Vector{UInt8} instead of with strings then, to get the speed up. Inserting whatever the UInt8 version pf ':' is into appropriate places, converting all numbers to Int and so on.
    Jakob Nybo Nissen
    @jakobnissen
    Is that possible? How can one print integer arrays or integers without converting to string? The only way I can think e.g. reading a UInt16 is using unsafe_load to load the integer, then print(buffer, the_integer). That creates a String, right?
    Ben J. Ward
    @BenJWard
    Yep I think doing this kind of thing does involve some unsafe_read or unsafe_loads
    Ciarán O'Mara
    @CiaranOMara
    The write(buffer, the_integer) method will write without string conversion.
    Jakob Nybo Nissen
    @jakobnissen
    Yes, but it will not write the correct bytes. I.e. when writing 5, it will write \0\0\0\0\0\0\05, not UInt8('5')
    Ciarán O'Mara
    @CiaranOMara
    So UInt8 of the character literal?
    Ciarán O'Mara
    @CiaranOMara
    I’d need to see a specific case to understand.
    open("tmp-u8.txt", "w") do io
    write(io, UInt8(5)) #00000101
    write(io, UInt8('5’)) #00110101
    end
    image.png
    If you’re on unix/linux you can print the binary with xxd. The command used in the screenshot above was, xxd -b tmp-u8.txt.
    Ciarán O'Mara
    @CiaranOMara
    @jakobnissen, I’ve just come across Base.transcode in the docs and thought of your project — I don’t know if it’s useful to you.
    https://docs.julialang.org/en/v1/base/strings/#Base.transcode
    herbstk
    @herbstk
    I have a general question: Is BioAlignments.jl currently incompatible with BioSequences.jl@2.0.0? Am I restricted to BioSequences.jl@1.1.0 if I need to use it in conjunction with BioAlignments.jl?
    herbstk
    @herbstk
    I see that this issue has been raised before BioJulia/BioAlignments.jl#35. Thanks @CiaranOMara for pushing this to the next level. I will stay on the 1.X versions for both packages for the time being.
    Aadam
    @aadimator
    Can anyone help me with this? BioJulia/BioAlignments.jl#43 Any help would be much appreciated.