Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Oct 26 13:20

    BoPeng on master

    Fixed minor spelling issues (#1… (compare)

  • Oct 26 13:20
    BoPeng closed #100
  • Oct 26 06:26
    kvwatkins opened #100
  • Oct 22 09:19
    step21 commented #333
  • Oct 22 01:58
    mathieuboudreau opened #1447
  • Oct 21 20:16
    BoPeng commented #333
  • Oct 21 20:14

    BoPeng on master

    Improve the handling of switchi… (compare)

  • Oct 21 18:43
    BoPeng commented #333
  • Oct 21 17:59
    step21 commented #333
  • Oct 21 17:46
    BoPeng commented #333
  • Oct 21 17:16
    step21 commented #333
  • Oct 21 17:14
    BoPeng commented #333
  • Oct 21 16:08
    step21 opened #333
  • Oct 19 17:10
    BoPeng opened #1446
  • Oct 01 13:50
    BoPeng closed #332
  • Oct 01 05:57
    msh855 commented #332
  • Oct 01 05:56
    msh855 commented #332
  • Oct 01 05:40
    BoPeng commented #332
  • Oct 01 05:04
    msh855 commented #332
  • Sep 30 21:04
    BoPeng commented #332
Bo
@BoPeng
Or you could put group_by inside output_from to have some thing like output_form('trimmomatic', group-by=1), output_from('index-reference'). Use print(_input) to test your step before running anything.
Patrick Cudahy
@pgcudahy
Ah, with pairlabel, would it group the fastq files by 1?
When I've tried using multiple output_from inputs it usually complains if they're of unequal lengths
Bo
@BoPeng
Yes, and there are also pairlabel2, pairlabel3 etc. There is a summary table over there.
When I've tried using multiple output_from inputs it usually complains if they're of unequal lengths
Patrick Cudahy
@pgcudahy
Got it! Thanks very much for your patient help
Bo
@BoPeng
When one of the output_from has a single group, it will propagate to match the number of groups of other output_from.
I think moving group_by inside output_from is the most natural method.
Milo Shields
@miloshields:matrix.org
[m]
Hi! I'm trying to link SoS with this in-browser javascript kernel: https://github.com/stdlib-js/jupyter-stdlib-browser-kernel, and I was wondering if it would be as simple as adding the name of the kernel to the sos-javascript module, as it's JavaScript, although I'm not sure it's technically supported. Any pointers would be greatly appreciated!
Bo
@BoPeng
You can try it with
%use JavaScript --kernel kernel_name
here kernel_name should be the one listed with jupyter kernelspec list
If it works, I will be happy to add it to the supported kernel list for JavaScript.
@miloshields:matrix.org
Milo Shields
@miloshields:matrix.org
[m]
Thanks! I'll try that out right now.
Bo
@BoPeng
I guess it did not work?
Milo Shields
@miloshields:matrix.org
[m]
Yep... Getting this error:
Exception in event handler for select.Cell TypeError: Cannot read property 'send' of null
    at Kernel._send (main.min.js?v=931f67…555340d3f25c5:38180)
    at Kernel.send_shell_message (main.min.js?v=931f67…555340d3f25c5:38197)
    at Comm.send (main.min.js?v=931f67…555340d3f25c5:37361)
    at send_kernel_msg (kernel.js?v=20210616115701:1169)
    at window._Events.notify_cell_kernel (kernel.js?v=20210616115701:1264)
    at window._Events.dispatch (main.min.js?v=931f67…eb96555340d3f25c5:2)
    at window._Events.v.handle (main.min.js?v=931f67…eb96555340d3f25c5:2)
    at Object.trigger (main.min.js?v=931f67…eb96555340d3f25c5:2)
    at window._Events.<anonymous> (main.min.js?v=931f67…eb96555340d3f25c5:2)
    at Function.each (main.min.js?v=931f67…eb96555340d3f25c5:2) 
Arguments(2) ["select.Cell", {…}, callee: (...), Symbol(Symbol.iterator): ƒ]
I'll do some digging and see if I can find anything useful.
Bo
@BoPeng
ok, let me have a look.
The installation process of this kernel looks very dangerous (going to jupyter directory and git clone)...
Bo
@BoPeng
Saw the error. The kernel was trying to execute something at https://github.com/stdlib-js/jupyter-stdlib-browser-kernel/blob/5844a62eca5ca4e0ffa992c5f307149b57ba3895/kernel.js#L174 but I am not sure what it is trying to do.
So, what exactly does "in-browser" means? Does it use the browser's JS engine to compuete stuff, even controls the Jupyter cell interface directly?
Milo Shields
@miloshields:matrix.org
[m]
The evil() function is an alias for eval(), so it's just running the code in the cell using that function and then printing the result.
In my version, I'm trying to replace that with a link to another javascript engine, but I don't think that's what the error is coming from.
In terms of communication, I'm pretty sure it follows the same specs that all kernels use, with the 5 (i think) sockets performing different functions
But yeah, the evaluation is in the browser itself
Bo
@BoPeng
I think the problem is that this kernel have direct reference to the JS kernel object that calls its execute function.
However, SoS wraps each kernel in order to pass additional parameters to the SoS kernel before sending stuff to the underlying kernel.
Milo Shields
@miloshields:matrix.org
[m]
interesting
Bo
@BoPeng
  1. take whatever user input
  1. add sos options such as what is the kernel of the current cell
  1. pass to sos kernel for execution
  1. sos then processes the information and call the underlying kernel
Milo Shields
@miloshields:matrix.org
[m]
Right. This is great, thanks!
Bo
@BoPeng
I am not sure at which step this kernel conflicts with SoS but this line of code https://github.com/stdlib-js/jupyter-stdlib-browser-kernel/blob/5844a62eca5ca4e0ffa992c5f307149b57ba3895/kernel.js#L95 , namely kernel.execute = execute basically replaces the SoS my_execute with its own version, so SoS will not be able to function properly.
Bo
@BoPeng
Note that even if SoS does not wrap execute, this kernel tries to execute the cell directly (instead of waiting for SoS to feed the processed code to it) so it will not be able to handle the SoS magics such as %get and %put.
Milo Shields
@miloshields:matrix.org
[m]
Got it. Thanks for your help! I may try modifying the iJavascript kernel instead... I basically just need to replace the actual execute step of any javascript kernel with my own method.
Bo
@BoPeng
Not sure what you are trying to achieve but good luck.
Milo Shields
@miloshields:matrix.org
[m]
Thank you, and thanks for the help!
Patrick Cudahy
@pgcudahy

Hello Bo, I hope you've been well. Could you please help me figure something out? I'm trying to run kraken2 which spends a lot of time loading a ~55GB database called hash.k2d into memory before doing any work. I'm trying to figure out how to do this cleanly on a SLURM cluster so that the database gets loaded once into a ramdisk at /dev/shm, a series of jobs are run, and then the ramdisk is cleaned up. So far I've come up with

[kraken2_single_hpc]
input: single_read, group_by=1
output: kraken2_single_classified_reads_1 = f'/home/pgc29/scratch60/helen_mixed_infection/data/kraken2/\
{_input.labels[0]}_classified_reads_1.fastq.gz',
kraken2_single_output = f'/home/pgc29/scratch60/helen_mixed_infection/data/kraken2/{_input.labels[0]}.kraken2',
kraken2_single_report = f'/home/pgc29/scratch60/helen_mixed_infection/data/kraken2/{_input.labels[0]}.report'
task: walltime='01:00:00', mem='700G', trunk_size=50, trunk_workers=1, 
    workdir='/home/pgc29/scratch60/helen_mixed_infection/data/kraken2'

bash: expand=True

    if [ ! -f /dev/shm/pgc29/hash.k2d ]; then
        mkdir -p /dev/shm/pgc29/
        cp /home/pgc29/scratch60/helen_mixed_infection/data/kraken2/k2_pluspf_20210517/*k2d /dev/shm/pgc29/
    fi

    module load miniconda
    conda activate kraken2
    kraken2 --db /dev/shm/pgc29 \
        --classified-out {_output["kraken2_single_classified_reads_1"]:n} \
        --output {_output["kraken2_single_output"]} \
        --report {_output["kraken2_single_report"]} \
        --gzip-compressed \
        --memory-mapping \
        {_input[0]}
    gzip -f {_output["kraken2_single_classified_reads_1"]:n}
    rm -rf /dev/shm/pgc29/*

However this cleans up the database after each sample is processed. Is there any way to run rm -rf /dev/shm/pgc29/* only after the 50th sample is run?

Bo
@BoPeng
I am not sure if "cleans up" is what has happened, since /dev/shm is node specific and will not be shared across computing nodes, so you will have to do this for all computing nodes, right?
Patrick Cudahy
@pgcudahy
Correct, but each group of 50 samples (the trunk_size) will get sent to one node. So once all 50 are run, I'd like to clean up on that node
Bo
@BoPeng

I see. You have rm -rf at the end. sos provides an option active to enable disable actions at particular steps, so it is possible to do something like

task: trunk_size=50
sh: active=slice(0, None, 50)
    cp stuff
sh: 
   do stuff
sh: active=slice(49, None, 50)
   clean up

but you should better test it before submitting this to cluster. sos may also regroup substeps when you run the step again if some of them have been completed, in that case the active by substeps will not work at all, I think.

Patrick Cudahy
@pgcudahy
How about
task: trunk_size=50
sh: active=slice(0, None, 50)
    cp stuff
sh: 
   do stuff
sh: active=-1
   clean up
Bo
@BoPeng
Then this will only clean up on one of the computing nodes that executes the last substep, right?
Patrick Cudahy
@pgcudahy

Hello Bo, I have another question I'd like help with. Does skip_if() work with cluster workloads? I have a step where I validate samples with kraken and if they're good, the sample name gets written to a "good_samples" folder. For the next step I only want to apply it to those with a matching entry in the good_samples folder.

[validate_fastq]
input: single_read, group_by=1
output: f'/home/pgc29/scratch60/helen_mixed_infection/data/fqtools/{_input[0]:bnn}_good.fastq.gz'
task: walltime='00:30:00', mem='1G', trunk_size=40, trunk_workers=10, 
    workdir='/home/pgc29/scratch60/helen_mixed_infection/data/fqtools'

skip_if(str(_input).split("/")[-1] not in os.listdir("/home/pgc29/scratch60/helen_mixed_infection/data/kraken2/good_samples/"))

run: expand=True
    module load miniconda
    source activate fqtools
    fqtools validate {_input} > {_output}

This will start running with many samples initially processed, but then one of the nodes will fail with

standard error:
================
INFO: t40t3859ba5a495d1 started
ERROR: Variable _output can only be set by SoS

and then all of the nodes seem to stop working until they finally timeout. Am I doing something wrong, or is this not compatible with clusters?

Bo
@BoPeng
Variable _output can only be set by SoS looks like a bug since your code does not explicitly set _output. skip_if is dangerous since an earlier version of SoS would assume the _output is not needed and even tried to remove it, maybe this was the problem? In this case done_if looks like the right choice.