Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Jiyan Yang
    @chocjy
    also I can do something like dict.keys() to get all the (tau, mz) values
    maybe I should use two dicts?
    Jey Kottalam
    @jey
    sure, that'd work
    I used a deterministic mapping from (tau, mz_idx) to raw_col, and I have a seen_cols[j] array that maps raw_col to actual_col j
    which I lookup by bisect_left(seen_cols, raw_col)
    Jiyan Yang
    @chocjy
    ok thanks
    let me try
    Jiyan Yang
    @chocjy
    @ensemblearner
    Jiyan Yang
    @chocjy
    can you help me get something
    the cluster is completely down but Ben wants to compute something
    basically we want to compute the total intensities for each (tau, mz) bin
    this is very simple, say you have an RDD with each record to be corresponding one (tau, mz) bin
    assume each record is of the format (row_idx, (non_zero_col_idx, elements) )
    then in the map function you just do something like map(lambda row: np.sum( row1 ) ).collect()
    np.sum( row[ 1 ][ 1 ] ) ).collect()
    here I am assuming all the intensities are positive
    basically at the end the program should output a vector with length # (tau, mz)
    mohit1007
    @ensemblearner
    hey .. @chocjy just saw the ping
    mohit1007
    @ensemblearner
    So, can you clarify the idx (with regards to tensor T, matrix B or matrix C) ? basically the input rdd.. schema?? then i would be able to compute these :)
    Jiyan Yang
    @chocjy
    forget about T and B for now
    everything is in C
    use the dataset you gave me
    that's C
    mohit1007
    @ensemblearner
    i see.. so basically, in the matrix C
    row_id_c, column_id_c, (mz,tau) indices is the input rdd
    just to confirm mz_idx or mz_val?
    Jiyan Yang
    @chocjy
    the input format is (row_id_c, (col_id_c, values_c) )
    there is no mz here
    each record is corresponding to some (tau, row) pair
    values_c is the intensity value of some pair (row_id) on some pixel (col_id)
    mohit1007
    @ensemblearner
    oh.. got it.. so this is basically the matrix on which we are operating CX??
    right?
    Jiyan Yang
    @chocjy
    yes
    mohit1007
    @ensemblearner
    got it.. let me get back to you..
    Jiyan Yang
    @chocjy
    ok thanks
    mohit1007
    @ensemblearner
    sure.. np
    mohit1007
    @ensemblearner
    @chocjy running the job now.. sorry it took some time.. monday can get busy at work.. plus the groupby operation seems bit expensive.. i ahve started a batch job.. and should be done soon.. will ping you with an update
    Jiyan Yang
    @chocjy
    ok thanks np
    mohit1007
    @ensemblearner
    @chocjy
    Jiyan Yang
    @chocjy
    mohit, I think you used the wrong matrix
    the one we used for CX was the tall matrix
    and we are computing the sum of elements in each row
    so the file should contain #row numbers
    mohit1007
    @ensemblearner
    oh.. hang on..
    mohit1007
    @ensemblearner
    @chocjy shared via drive
    Jiyan Yang
    @chocjy
    ok thanks
    can you share me the codes as well
    mohit1007
    @ensemblearner
    from pyspark import SparkContext
    from pyspark import SparkConf
    import numpy as np
    conf = SparkConf().set('spark.eventLog.enabled','true').set('spark.driver.maxResultSize', '8g') 
    sc = SparkContext(appName='cx_exp',conf=conf)
    def clean(x):
            x = str(x)
        chunks = x.split(",")
        # take the transpose (we want tall matrix)
        return int(chunks[1]),int(chunks[0]),float(chunks[2])
    def prepare_matrix(rdd):
        gprdd = rdd.map(lambda x:(x[0],(x[1],x[2]))).groupByKey().map(lambda x :(x[0],list(x[1])))
        flattened_rdd = gprdd.map(lambda x: (x[0],_indexed(x[1])))
        return flattened_rdd
    def _indexed(grouped_list):
        indexed, values = [],[]
        for tup in grouped_list:
            indexed.append(tup[0])
            values.append(tup[1])
        return np.array(indexed), np.array(values)
    
    
    data = sc.textFile('/scratch1/scratchdirs/msingh/sc_paper/experiments/striped_data/final_matrix').map(lambda x:clean(x))
    grouped = data.map(lambda x:(x[0], (x))).groupByKey().map(lambda x:(x[0], list(x[1])))
    srdd = prepare_matrix(data)
    summation = srdd.map(lambda row: np.sum( row[ 1 ][ 1 ] )).collect()
    np.savetxt('/global/homes/m/msingh/final_mappings/summation_tall', np.array(summation))
    @chocjy
    Jiyan Yang
    @chocjy
    thanks