"Select jobs to execute..." runs literally forever - solver

I have a rather complex workflow with 750 samples and roughly 18.000 jobs, at first snakemake runs just fine but then after around 4.000 jobs it suddenly freezes and upon restart it hangs with "Select jobs to execute..." for 24h, after that I terminated it. The initial DAG building takes roughly 2-3 minutes, though.
When I run snakemake (v5.32.0 and v5.32.1) with the --verbose option, I get tons of lines similar to this one:
Cbc0010I After 600 nodes, 304 on tree, -52534.791 best solution, best possible -52538.194 (7.08 seconds
I tried to delete the .snakemake folder in the hope that something went riot there, but that wasn't the case, unfortunately. To me it seems that the CBC MILP Solver somehow does not converge, and it keeps going and going to bring the best and the best possible solution closer together!?
Now I do not have any idea anymore, how to proceed and fix the problem. My possible solutions are somehow to change the convergence criteria or the solver itself. In the manual I found the option --scheduler-ilp-solver but it has apparently only one option, the default COIN_CMD.
After terminating a (shorter) run, I get this verbose output
Result - User ctrl-cuser ctrl-c
Objective value: 52534.79114334
Upper bound: 52538.202
Gap: -0.00
Enumerated nodes: 186926
Total iterations: 1807277
Time (CPU seconds): 1181.97
Time (Wallclock seconds): 1188.11
Next I will try to limit the number of samples in the workflow and see if this has any impact (for other datasets with 500 samples, it ran without any problems (with snakemake version 5.24), but there the DAG building took some hours. Hence, I am not very eager to try the old version.)
So, any idea how to fix the problem is highly appreciated. Also, I do not even know, if this is a bug!?
EDIT Actually, I believe it is a bug in the current version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...

I also ran into this issue with a smaller workflow (~1500 jobs total) and snakemake version 6.0.2. About half the jobs had run when the workflow got stuck, and refused to run any more jobs. Looks like it's a problem specific to the ILP solver, because when I re-ran with --scheduler greedy, it worked fine.

Actually, I believe it is a bug in the current snakemake version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...

Related

Rcpp in Rstudio, can't cache in memory when parallel if I don't open the cpp file in Rstudio

I met a wired problem but I wonder if I'm asking the correct question:
result = parLapply(cl, 1:4,
function(j,rho_list_needed,delta0_needed,
V_iter_s,Sigma_list_needed) {
rhoj = rho_list_needed[[j]]
delta0_in_cpp = delta0_needed
v = as.vector(V_iter_s[,,,j])
sigmaj = Sigma_list_needed[[j]]
sourceCpp('sample_Z.cpp')#first time complie slow,then cashed
return(Sample_Z(rhoj,delta0_in_cpp, v,sigmaj,A,Cmatrix))
},rho_list_needed,delta0_needed,
V_iter[[s]],Sigma_list_needed)
When I was testing my sample_Z.cpp with parallel through parLapply, the single calculation takes around 1 sec. By parallel, my 4 iterations takes around 1.2 secs, which is a big improvement compared to unparalleled version, which is 8 sec.
There's no problem at all when I run my program yesterday. Just now I noticed a bug and revised my program. To give my PC a fresh environment, I restarted my computer. When started to run my program, I only opened the .R file, and run. But it took 9 sec for that parallel, which used to be 1.2 sec. The 9 sec was after warming up my cores, i.e., already sourced the cpp before I time it.
I just don't know where is the bug. Then try to source the cpp file directly in my global merriment, and I found out that there was no caching at all. The second time took the same time as the first one.
But I accidentally opened the sample_Z.cpp in Rstudio, explicitly at the editor. And then, everything works correct now.
I don't know how to search this similar problem on google with what kind of key words and I don't know if opening the cpp file is a must, while I never known before.
Can anyone tell me what's the real issue? Thanks!
After restarting your PC, you probably had extra processes running which would have competed for CPU cores that slowed down your algorithm. The fact you're rebooting suggests to me you're not using Linux... but if you are, watch with top while starting your code, or equivalent for your platform.

RethinkDB 'make' fails on a Raspberry Pi

I have tried running the 'make ALLOW_WARNINGS=1' command multiple times, but it seems to fail every time.
I have made sure gcc is updated and so is the rethinkDB package.
Here's the error it produces:
It was down to the fact I didn't have enough swap. I ordered a new RPi2 and it worked (due to 1GB of ram built in), but I also found out that I didn't increase the swap file on the RPi1 correctly, I was following this tutorial and I didn't carry out the last two commands, so it might still be possible on a RPi1, I haven't checked.

mahout Seq2Sparse taking a very long time

I am using seq2sparse to convert sequence files to sparse vectors. Is it normal for this to take so long? Mine has been stuck on a monitorAndPrintJob for thirty minutes now. It seems to be doing something because any attempt to do anything else on the machine is very very sluggish. But it'd be nice to know if I should stop it and try again or if I should just wait.
Here's the command I used:
mahout home/bin/mahout seq2sparse -o outputDirectory -i inputDirectory -ml 10 -ng 2 -seq
Should I tweak some of the switches to help it run more efficiently?
This is on one local machine. The sequence file is 151MB.
Edit: it is no longer sluggish to do other things, but htop shows java is doing things to do with hadoop and mahout so I guess I should leave it? It's been at this one part of the process for forty minutes now.
Edit 2: ah not to worry, I went out and came back and it had finished without crashing or anything. Cheers anyway if you read all this!

Why is this Perl require line taking so much time?

I have a Perl script that runs via a system() command from C. On a specific site (SunOS 5.10), when that script is run, it nearly always takes 6 seconds or more. On other sites, it runs pretty much instantly (0.1s). If I run the script manually, i.e. not from the C code, it also runs instantly. I eventually tracked the slowness down (by spitting out the time a whole bunch in a lot of different places), to a single require line. The file that it is requiring is another Perl script we wrote. The script consists of a single require (this file here), 3 scalars that are assigned integer values, and a handful of time/date conversion routines. The file ends with a 1;. That single require appears to take as much as 6 seconds on occasion, but as I said, not always even on the same machine. I'm absolutely stumped here. My only last thought is to turn on profiling, but the site doesn't have Devel::Profiler and my only other option (that I know of) would be to add it to the Perl command which would require me altering and recompiling the C code (doable but non-trivial).
Anybody have ANY idea what could be going on here? I don't think I can/want to put the entire date.pl that is being required, but it's pretty much exactly as I described; I could answer any questions about it that you have.
Thanks in advance.
You might be interested in A Timely Start by Jean-Louis Leroy. He had a similar problem and tracked it down to a long and deep module search path where perl usually found the modules in the last entries in #INC.
Six seconds is a long time. Have you checked what your network is doing during this?
My first thought was that spawning the new process when using the system() command could be the problem, but six seconds is too long.
I don't know much about perl, but I could imagine that for any reason, the access of the time module could invoke a call to a network time server. Just to get synchronized. Maybe this takes so long or maybe it is getting a time out.
It could be that this only happens for a newly spawned process -- hence only when you use the system() command.
just wild guessing...
So, this does nothing to answer your question directly, but please tell me that you're not actually running on perl 4? Assuming you're on perl 5, you could remove the entire file and replace the require with use POSIX qw(ctime) to get the version that comes with Perl.
If you do have to support perl4, I'll merely grumble something about version 5 being fifteen years old now and go away. :)

Trouble with parallel make not always starting another job when one finishes

I'm working on a system with four logical CPS (two dual-core CPUs if it matters). I'm using make to parallelize twelve trivially parallelizable tasks and doing it from cron.
The invocation looks like:
make -k -j 4 -l 3.99 -C [dir] [12 targets]
The trouble I'm running into is that sometimes one job will finish but the next one won't startup even though it shouldn't be stopped by the load average limiter. Each target takes about four hours to complete and I'm wondering if this might be part of the problem.
Edit: Sometimes a target does fail but I use the -k option to have the rest of the make still run. I haven't noticed any correlation with jobs failing and the next job not starting.
I'd drop the '-l'
If all you plan to run the the system is this build I think the -j 4 does what you want.
Based on my memory, if you have anything else running (crond?), that can push the load average over 4.
GNU make ref
Does make think one of the targets is failing? If so, it will stop the make after the running jobs finish. You can use -k to tell it to continue even if an error occurs.
#BCS
I'm 99.9% sure that the -l isn't causeing the problem because I can watch the load average on the machine and it drops down to about three and sometimes as low as one (!) without starting the next job.

Resources