Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    arjunsrivatsa
    @arjunsrivatsa
    @arshajii The heap size thing doesn't seem to help. I am not sure if the gc is clearing anything at all. There never seems to be a reduction in memory, even for really simple program tests. Does this seem to happen for you? What would be the command to force the garbage collector to wipe everything?
    A. R. Shajii
    @arshajii
    @arjunsrivatsa Hmm that's odd, it should definitely be freeing things as you continue allocating more objects. Is it possible that you're keeping around references to things somehow, preventing the GC from collecting them? If you have some example(s) I can take a look
    arjunsrivatsa
    @arjunsrivatsa

    Here's a code chunk that has the same issues; Once it seems to read in the genome in the function it never releases this memory. This happens even in much longer running programs until the memory overloads.

    from bio import *
    
    def wgz(floc: str, ob: List[str]):
            with open(floc, 'wb') as afile:
                    for item in ob:
                            afile.write(f'{item}\n')
    
    def rgz(floc: str) -> List[str]:
            return_list = List[str]()
            with open(floc, 'rb') as file:
                    return_list = [line.strip() for line in file]
            return return_list
    def readch(floc):
            anewchrom = rgz(floc)
            anewchrom.clear()
            del anewchrom
            return 0
    
    
    
    @python
    def getmemory():
            import psutil, os
            process = psutil.Process(os.getpid())
            print(process.memory_info().rss)
    
    
    reduced_chrom_dict = ['chr1', 'chr10', 'chr11', 'chr12', 'chr13','chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19','chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8','chr9']
    sex_chroms = ['chrX', 'chrY']
    chroms = List[str]()
    for r in FASTA('/home/assrivat/hg38.fa', fai = False):
            if(r.name in reduced_chrom_dict):
                    print(r.name)
                    chroms.append(str(r.seq).upper())
                    chroms.append(str(r.seq).upper())
            elif(r.name in sex_chroms):
                    print(r.name)
                    chroms.append(str(r.seq).upper())
            else:
                    print('skipped')
    getmemory()
    wgz('/media/drive2/assrivat/randomtest.gz',chroms)
    chroms.clear()
    del chroms
    getmemory()
    readch('/media/drive2/assrivat/randomtest.gz')
    getmemory()
    readch('/media/drive2/assrivat/randomtest.gz')
    getmemory()
    readch('/media/drive2/assrivat/randomtest.gz')
    getmemory()

    The memory returns were

    6942056448
    6942244864
    12526358528
    14309744640
    14806085632
    Mark Henderson
    @markhend
    @arjunsrivatsa I grabbed the hg38 file and ran some tests too. I made some changes to see if I could stabilize the memory usage. Something like the following seems to avoid blowing up. FASTA can also accept the .gz format. I didn't convert the seq to upper and only write once in the if block, and write inline without saving to a list.
    from bio import *
    
    @python
    def getmemory():
        import psutil, os
        process = psutil.Process(os.getpid())
        print(process.memory_info().rss)
    
    reduced_chrom_dict = ['chr' + str(n) for n in range(1, 23)]
    sex_chroms = ['chrX', 'chrY']
    floc = 'assrivat_out.txt'
    
    getmemory()
    
    with open(floc, 'wb') as afile:
        for r in FASTA('data/hg38.fa.gz', fai=False):
            if (r.name in reduced_chrom_dict):
                print(r.name)
                afile.write(f'{r.seq}\n')
            elif (r.name in sex_chroms):
                print(r.name)
                afile.write(f'{r.seq}\n')
            else:
                # print('skipped')
                continue
            getmemory()
    
    getmemory()
    If I get more time, I'd like to try a better comparison with Python.
    arjunsrivatsa
    @arjunsrivatsa
    I see, thanks. I don't think the file writes are the main memory issue. The issue seems to be when reading in a genome. That is, I haven't found a way to read in a genome, do some things with it, and then remove it/not reference it without permanently increasing the memory of the program.
    Out of curiosity, what did your memory calls give you for that script above?
    Mark Henderson
    @markhend
    249458688,
    chr1 1117270016, chr10 1136144384, chr11 1136144384, chr12 1104474112, chr13 1104474112,
    chr14 1104474112, chr15 1106362368, chr16 1106362368, chr17 1106362368, chr18 1106362368,
    chr19 1106362368, chr2 1350844416, chr20 1350844416, chr21 1378107392, chr22 1382301696,
    chr3 1384398848, chr4 1416495104, chr5 1416495104, chr6 1416495104, chr7 1416503296,
    chr8 1416503296, chr9 1416503296, chrX 1453768704, chrY 1453789184,
    1453789184
    Mark Henderson
    @markhend
    That's with seqc run. With seqc build -release it runs in ~30 secs and memory is chr1 638062592 ... chrY 847187968.
    A. R. Shajii
    @arshajii
    @arjunsrivatsa Just wanted to check on the latest status on this issue -- I can check it out if there's still a memory issue
    arjunsrivatsa
    @arjunsrivatsa
    @arshajii yeah it seems to be a pretty persistent bug on the memory which I haven't found a way to fix.
    A. R. Shajii
    @arshajii
    @arjunsrivatsa OK let me take a look over the next couple days -- I'll play around with the code you posted above
    A. R. Shajii
    @arshajii
    @arjunsrivatsa It seems list.clear() does not actually clear/free/reset anything, it just sets the list length to 0, but list elements are still referenced by the underlying array. This could be why the memory is never freed. Can you try adding the following snippet to the top of your program?
    @extend
    class List:
        def clear(self):
            from internal.gc import sizeof
            str.memset(self.arr.ptr.as_byte(), byte(0), self.len * sizeof(T))
            self.len = 0
    We can change the various clear() methods then to actually do this
    Mark Henderson
    @markhend
    Shows nice improvement for me. Memory use stays stable or at least increases much slower (versus not clearing the list after a write), and is similar whether I save to a list and then write, or just bypass the list and write directly.
    arjunsrivatsa
    @arjunsrivatsa
    Will try this once i get some time this weekend, thanks
    arjunsrivatsa
    @arjunsrivatsa
    I get the error "generics do not match" when I throw that at the top of my program
    Mark Henderson
    @markhend
    Maybe try not specifying a type when you declare the list. e.g. chroms = []
    A. R. Shajii
    @arshajii
    What version of Seq are you using (seqc —version)
    arjunsrivatsa
    @arjunsrivatsa
    I just upgraded to 0.11, and it got rid of that bug, but now there's some new bug saying "bad dedent". I am not sure what that means; I ended up refactoring all the indents with autopep8, but that didnt do anything for the bug.
    Mark Henderson
    @markhend
    I see that typically when I've copy/pasted code. Usually retyping, or dedenting then indenting again will fix.
    arjunsrivatsa
    @arjunsrivatsa
    Yeah its pretty strange, I redid all the tabs, and even used an autoindenting software. Maybe I should try 4 spaces
    arjunsrivatsa
    @arjunsrivatsa
    Hmm that didn't work either
    def speedSNP(seqs):
        import random
        BASES = ['A', 'C', 'T', 'G']
        c = list(range(len(seqs)))
        chrom = random.sample(c,1)
        seq = seqs[chrom[0]]
        target = GLOBAL_CHROM_NUM[chrom[0]]
        n = len(seq)
        pos = 0
        all_positions = []
        if(n==0):
            pos = -1
        elif(n==1):
            n_repeats = 1
            all_positions = [0]
            for i in range(n_repeats):
                oldseq = seq
                char = oldseq[pos]
                while(char == seq[pos]):
                    char = random.choice(BASES)
                seq = seq[:0]+char
        else:
            n_repeats = 1
            for i in range(n_repeats):
                pos = random.randint(0,n-1)
                all_positions.append(pos)
                oldseq = seq
                char = oldseq[pos]
                while(char == seq[pos]):
                    char = random.choice(BASES)                                                                                                                                                                                                                                            
        seq = seq[:pos] + char + seq[pos+1:]
        seqs[chrom[0]] = seq
        return [-1], [pos, chrom[0], target]
    Can anyone see an indentation problem with this? I keep getting this dedent error and I double checked all the indentation/formatting
    Mark Henderson
    @markhend
    Try adding a space after the if/elif/while keywords. Also note, seq[:0]+char might not be what you want. This just evaluates to the value of char.
    arjunsrivatsa
    @arjunsrivatsa
    No luck with that
    Mark Henderson
    @markhend
    What error are you seeing? The only other changes I had to make were to define GLOBAL_CHROM_NUM, and if the if branch is taken, char is not initialized. You might give it a default.
    arjunsrivatsa
    @arjunsrivatsa
    The same error: "bad dedent"
    Mark Henderson
    @markhend
    Gert Hulselmans
    @ghuls
    @arjunsrivatsa The second char = random.choice(BASES) has a lot of trailing spaces.
    Jeff "Gent" Stone
    @jeffreykstone
    Is there a homebrew install for seq?
    A. R. Shajii
    @arshajii
    @jeffreykstone Not yet, we’re doing some cleanup internally right now and plan to do a release soon including some install options like brew
    Jeff "Gent" Stone
    @jeffreykstone
    @arshajii That would be great!
    Jacob Hilliard
    @jcbhl
    Hi all! I'm a student with some spare time on my hands and I'm interested in contributing to Seq. Is the project currently accepting PRs from external contributors?
    范兴国
    @FanXingGuo

    Find a bug in Sequence aligment

    from bio import *
    
    # default parameters
    s1 = s'CGGAAGAGCGTTTTCAGTTCATCAGGTGTGAAT'
    s2 = s'CGGAAGAGCGTTTTCAGTTAATCAGGGGTGAAT'
    aln = s1 @ s2
    print(aln.cigar, aln.score)  # 33M -2
    
    # custom parameters
    # match = 2; mismatch = 4; gap1(k) = 2k + 4; gap2(k) = k + 13
    aln = s1.align(s2, a=2, b=4, gapo=4, gape=2, gapo2=13, gape2=1)
    print(aln.cigar, aln.score)  # 33M 54

    There is a Mismacth in the sequence as floows:

    CGGAAGAGCGTTTTCAGTT C ATCAGGTGTGAAT

    CGGAAGAGCGTTTTCAGTT A ATCAGGGGTGAAT

    But the aligment result said it's match

    范兴国
    @FanXingGuo

    Find a bug in Sequence aligment

    from bio import *
    
    # default parameters
    s1 = s'CGGAAGAGCGTTTTCAGTTCATCAGGTGTGAAT'
    s2 = s'CGGAAGAGCGTTTTCAGTTAATCAGGGGTGAAT'
    aln = s1 @ s2
    print(aln.cigar, aln.score)  # 33M -2
    
    # custom parameters
    # match = 2; mismatch = 4; gap1(k) = 2k + 4; gap2(k) = k + 13
    aln = s1.align(s2, a=2, b=4, gapo=4, gape=2, gapo2=13, gape2=1)
    print(aln.cigar, aln.score)  # 33M 54

    There is a Mismacth in the sequence as floows:

    CGGAAGAGCGTTTTCAGTT C ATCAGGTGTGAAT

    CGGAAGAGCGTTTTCAGTT A ATCAGGGGTGAAT

    But the aligment result said it's match

    I know the reason,sorry

    A. R. Shajii
    @arshajii
    Hey @jcbhl sorry for the late reply -- let's reconnect in a week or two as we're doing some major internal revamps right now, if that sounds good!
    Gert Hulselmans
    @ghuls
    @arshajii Is all this happening in private? I don't see any new branches at https://github.com/seq-lang
    A. R. Shajii
    @arshajii
    @ghuls Yes that's right, we're planning to release it soon
    Gert Hulselmans
    @ghuls
    I am also running into some problems with porting my code from 0.10.3 to 0.11.0.
    Gert Hulselmans
    @ghuls
    # Read whitelisted barcodes from file and convert to a set of Kmers. (K = e.g. Kmers[16])
    bc_whitelist = read_barcode_whitelist_from_file[K](
             bc_whitelist_filename=bc_whitelist_filename,
             bc_column_idx=0,
             warning=True
     )
    
    # Gives an error:
    correct_barcode_in_fastq.seq:109:20: error: cannot find '__getitem__' in std.software.single_cell_toolkit.seq_lib.barcode_correction.read_barcode_whitelist_from_file[str,int,bool,kmer_type]
    
    # Function signature:
    def read_barcode_whitelist_from_file[kmer_type](bc_whitelist_filename: str, bc_column_idx: int = 0, warning: bool=True):
       ...
    In 0.11.0 Kmers seem to have changed a bit. What is the problem with this code?
    A. R. Shajii
    @arshajii
    Ah, we changed generic syntax a bit to be more Pythonic
    Basically as follows:
    # OLD:
    def foo[T](x: T):
        pass
    
    # NEW:
    def foo(x: T, T: type):
        pass
    Also, you can now pass k-mer lengths as arguments using Static[int]:
    def foo(k: Static[int]):
        ... Kmer[k] ...
    k will need to be a compile-time constant; you can also pass these on the command-line via -Dk=42 or the like
    Gert Hulselmans
    @ghuls
    .thanks. Got it fixed.