SageMaker Speed in different region - runtime

I am working on AWS SageMaker service. First, I ran my notebook file under the region , "Hong Kong", this work cost me around 5 minutes. However, after I switched to "Tokyo", the same script spend 8 minutes for executing.
Does region effect script's runtime ?
Thanks.

Based on the comments, this is no more an issue.

Related

"Select jobs to execute..." runs literally forever

I have a rather complex workflow with 750 samples and roughly 18.000 jobs, at first snakemake runs just fine but then after around 4.000 jobs it suddenly freezes and upon restart it hangs with "Select jobs to execute..." for 24h, after that I terminated it. The initial DAG building takes roughly 2-3 minutes, though.
When I run snakemake (v5.32.0 and v5.32.1) with the --verbose option, I get tons of lines similar to this one:
Cbc0010I After 600 nodes, 304 on tree, -52534.791 best solution, best possible -52538.194 (7.08 seconds
I tried to delete the .snakemake folder in the hope that something went riot there, but that wasn't the case, unfortunately. To me it seems that the CBC MILP Solver somehow does not converge, and it keeps going and going to bring the best and the best possible solution closer together!?
Now I do not have any idea anymore, how to proceed and fix the problem. My possible solutions are somehow to change the convergence criteria or the solver itself. In the manual I found the option --scheduler-ilp-solver but it has apparently only one option, the default COIN_CMD.
After terminating a (shorter) run, I get this verbose output
Result - User ctrl-cuser ctrl-c
Objective value: 52534.79114334
Upper bound: 52538.202
Gap: -0.00
Enumerated nodes: 186926
Total iterations: 1807277
Time (CPU seconds): 1181.97
Time (Wallclock seconds): 1188.11
Next I will try to limit the number of samples in the workflow and see if this has any impact (for other datasets with 500 samples, it ran without any problems (with snakemake version 5.24), but there the DAG building took some hours. Hence, I am not very eager to try the old version.)
So, any idea how to fix the problem is highly appreciated. Also, I do not even know, if this is a bug!?
EDIT Actually, I believe it is a bug in the current version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...
I also ran into this issue with a smaller workflow (~1500 jobs total) and snakemake version 6.0.2. About half the jobs had run when the workflow got stuck, and refused to run any more jobs. Looks like it's a problem specific to the ILP solver, because when I re-ran with --scheduler greedy, it worked fine.
Actually, I believe it is a bug in the current snakemake version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...

mclapply and spark_read_parquet

I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...
Example: (the same code works when using one core, but fails when using 2)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2)
Warning message:
In mclapply(seq(file_paths), function(i) { :
all scheduled cores encountered errors in user code
Any suggestion???
Thanks in advance.
Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.

Python subprocess.call or Popen limit CPU resources

I would like to set 4 additional properties on 8000 images using the Google Earth Engine command line tool. Properties are unique per image. I was using Python 2.7 and the subprocess.call(cmd) or subprocess.check_output(cmd) methods. Both are very slow (takes 9 seconds per image i.e. 20 hours total). Therefore I tried to send the commands without waiting for a response using supprocess.Popen(). It causes my PC to crash due to the amount of tasks (CPU close to 100%)
I was looking for ways to have my PC use say 80% of its CPU or even better, scale down CPU use if I am using other things. I found the os.nice() and nice argument for subprocess.Popen(["nice",20]) but struggle to use them in my code.
this is an example of a command I send using the subprocess method:
earthengine asset set -p parameter=DomWN_month users/rutgerhofste/PCRGlobWB20V04/demand/global_historical_PDomWN_month/global_historical_PDomWN_month_millionm3_5min_1960_2014I000Y1960M01

Elasticsearch bulk update is extremely slow

I am indexing a large amount of daily data ~160GB per index into elasticsearch. I am facing this case where I need to update almost all the docs in the indices with a small amount of data(~16GB) which is of the format
id1,data1
id1,data2
id2,data1
id2,data2
id2,data3
.
.
.
My update operations start happening at 16000 lines per second and in over 5 minutes it comes down to 1000 lines per second and doesnt go up after that. The update process for this 16GB of data is currently longer than the time it takes for my entire indexing of 160GB to happen
My conf file for the update operation currently looks as follows
output
{
elasticsearch {
action => "update"
doc_as_upsert => true
hosts => ["host1","host2","host3","host4"]
index => "logstash-2017-08-1"
document_id => "%{uniqueid}"
document_type => "daily"
retry_on_conflict => 2
flush_size => 1000
}
}
The optimizations I have done to speed up indexing in my cluster based on the suggestions here https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html are
Setting "indices.store.throttle.type" : "none"
Index "refresh_interval" : "-1"
I am running my cluster on 4 instances of the d2.8xlarge EC2 instances. I have allocated 30GB of heap to each nodes.
While the update is happening barely any cpu is used and the load is very less as well.
Despite everything the update is extremely slow. Is there something very obvious that I am missing that is causing this issue? While looking at the threadpool data I find that the number of threads working on bulk operations are constantly high.
Any help on this issue would be really helpful
Thanks in advance
There are a couple of rule-outs to try here.
Memory Pressure
With 244GB of RAM, this is not terribly likely, but you can still check it out. Find the jstat command in the JDK for your platform, though there are visual tools for some of them. You want to check both your Logstash JVM and the ElasticSearch JVMs.
jstat -gcutil -h7 {PID of JVM} 2s
This will give you a readout of the various memory pools, garbage collection counts, and GC timings for that JVM as it works. It will update every 2 seconds, and print headers every 7 lines. Spending excessive time in the FCT is a sign that you're underallocated for HEAP.
I/O Pressure
The d2.8xlarge is a dense-storage instance, and may not be great for a highly random, small-block workload. If you're on a Unix platform, top will tell you how much time you're spending in IOWAIT state. If it's high, your storage isn't up to the workload you're sending it.
If that's the case, you may want to consider provisioned IOP EBS instances rather than the instance-local stuff. Or, if your stuff will fit, consider an instance in the i3 family of high I/O instances instead.
Logstash version
You don't say what version of Logstash you're using. Being StackOverflow, you're likely to be using 5.2. If that's the case, this isn't a rule-out.
But, if you're using something in the 2.x series, you may want to set the -w flag to 1 at first, and work your way up. Yes, that's single-threading this. But the ElasticSearch output has some concurrency issues in the 2.x series that are largely fixed in the 5.x series.
With elasticsearch version 6.0 we had an exactly same issue of slow updates on aws and the culprit was slow I/O. Same data was upserting on a local test stack completely fine but once in cloud on ec2 stack, everything was dying after an initial burst of speedy inserts lasting only for few minutes.
Local test stack was a low-spec server in terms of memory and cpu but contained SSDs.
s3 stack was EBS volumes with default gp2 300 IOPS.
Converting the volumes to type io1 with 3000 IOPS solved the issue and everything got back on track.
I am using amazon aws elasticsearch service version 6.0 . I need heavy write/insert from serials of json file to the elasticsearch for 10 billion items . The elasticsearch-py bulk write speed is really slow most of time and occasionally high speed write . i tried all kinds of methods , such as split json file to smaller pieces , multiprocess read json files , parallel_bulk insert into elasticsearch , nothing works . Finally , after I upgraded io1 EBS volume , everything goes smoothly with 10000 write IOPS .

Rcpp in Rstudio, can't cache in memory when parallel if I don't open the cpp file in Rstudio

I met a wired problem but I wonder if I'm asking the correct question:
result = parLapply(cl, 1:4,
function(j,rho_list_needed,delta0_needed,
V_iter_s,Sigma_list_needed) {
rhoj = rho_list_needed[[j]]
delta0_in_cpp = delta0_needed
v = as.vector(V_iter_s[,,,j])
sigmaj = Sigma_list_needed[[j]]
sourceCpp('sample_Z.cpp')#first time complie slow,then cashed
return(Sample_Z(rhoj,delta0_in_cpp, v,sigmaj,A,Cmatrix))
},rho_list_needed,delta0_needed,
V_iter[[s]],Sigma_list_needed)
When I was testing my sample_Z.cpp with parallel through parLapply, the single calculation takes around 1 sec. By parallel, my 4 iterations takes around 1.2 secs, which is a big improvement compared to unparalleled version, which is 8 sec.
There's no problem at all when I run my program yesterday. Just now I noticed a bug and revised my program. To give my PC a fresh environment, I restarted my computer. When started to run my program, I only opened the .R file, and run. But it took 9 sec for that parallel, which used to be 1.2 sec. The 9 sec was after warming up my cores, i.e., already sourced the cpp before I time it.
I just don't know where is the bug. Then try to source the cpp file directly in my global merriment, and I found out that there was no caching at all. The second time took the same time as the first one.
But I accidentally opened the sample_Z.cpp in Rstudio, explicitly at the editor. And then, everything works correct now.
I don't know how to search this similar problem on google with what kind of key words and I don't know if opening the cpp file is a must, while I never known before.
Can anyone tell me what's the real issue? Thanks!
After restarting your PC, you probably had extra processes running which would have competed for CPU cores that slowed down your algorithm. The fact you're rebooting suggests to me you're not using Linux... but if you are, watch with top while starting your code, or equivalent for your platform.

Resources