I have a pipeline that processes 1000 images. Each image goes through a 4 step process to create input for a model. So there are 4000 data preparation tasks plus a final prediction task. Total 4001 tasks.
The 4000 data preparation tasks are parallelised by luigi so 4 tasks run at once on 4 cpus. For this I set OMP_THREAD_LIMIT=1 otherwise it hangs due to a conflict between luigi and OMP.
The final prediction task uses pytorch. This is a single luigi task but is parallelised by pytorch via OMP. So I reset OMP_THREAD_LIMIT before starting the task.
This works but during the first 4000 tasks I get hundreds/thousands of warning messages....."OMP: Warning #96 Cannot form a team with 4 threads using 1 instead" and "OMP: Hint consider unsetting......OMP_THREAD_LIMIT".
How do I disable these messages? Or is there some other way to temporarily disable OMP without OMP_THREAD_LIMIT?
Related
I have 8 threads in JMeter, which i am executing for every 5 minutes using Task scheduler.
Now i have included 2 threads which want to run for 5 times per day only (ex: at 12am, 5am,10am...)
when the moment comes, the execution shall be 8+2 & remaining time, it shall be only 8 threads.
Is it possible to configure such usecase in Jmeter..
If you're going to use the same .jmx script and want to execute either 8 or 10 "threads" (whatever it is), you can go for:
If Controller - for conditional execution of this or that test elements
__groovy() function to check current time, an example condition which trigger the test at i.e. 5 AM would be:
${__groovy(Calendar.getInstance().get(Calendar.HOUR_OF_DAY) == 5 && Calendar.getInstance().get(Calendar.MINUTE) == 0,)}
I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html
I configured my test in this way (in windows 7):
1 Virtual machine is master, that run all vm slaves with the command for a distribuited testing (from command line) and show in jmeter GUI some graphs (for example jp#gc Active thread over time , hits/sec, response time, etc..).
3 Virtual machine are slave, to execute the testing;
When master run the "start" to 3 slave, the test works (each slave run 6 thread), and in the GUI on master, there are only 6 thread in the graph (jp#gc - Active Threads Over Time), but in reality are 18 (6 thread for slaves, with 3 slaves).
So my question is: how can I see the total data for all slaves?
jp#gc - Active Threads Over Time = to see 18 thread (thread slave1 +thread slave2+thread slave3)
jp#gc - Hits per Second = Hits slave 1 +Hits slave 2+ Hits slave 3
and so on...
You need to add __machineName or __machineIP function so the listeners could distinguish results coming from different nodes.
Also be aware of mode property which is configured to send results from slave machines each 100 results or each minute (whatever comes the first) so you might want to amend it, i.e. add mode=Standard line to user.properties file on each slave node.
# Remote batching support
# Since JMeter 2.9, default is MODE_STRIPPED_BATCH, which returns samples in
# batch mode (every 100 samples or every minute by default)
# Note also that MODE_STRIPPED_BATCH strips response data from SampleResult, so if you need it change to
# another mode
# Hold retains samples until end of test (may need lots of memory)
# Batch returns samples in batches
# Statistical returns sample summary statistics
# hold_samples was originally defined as a separate property,
# but can now also be defined using mode=Hold
# mode can also be the class name of an implementation of org.apache.jmeter.samplers.SampleSender
#mode=Standard
#mode=Batch
#mode=Hold
#mode=Statistical
See Apache JMeter Properties Customization Guide for more information on working with JMeter properties.
Be aware that sending results in case of severe load may cause network IO overhead so it might be a good idea to consider Backend Listener instead
Add the Machine Info function to the thread group name area as shown below:
Im using the Matpower - Matlab toolbox in parallel computing and building the computer cluster to simulate the programme which is shown below:
matlabpool open job1 5 % matlabpool means computer cluster
spmd %the statement from the Parallel computing toolbox
% Run all the statements in parallel
% first part of code
if labindex==1
runopf('casea');
end
% second part of code
if labindex==2
runopf('caseb');
end
end
matlabpool close;
When the labindex is 1 the first part of code in this program is running in "computer1" in the cluster, and so forth when the labindex is 2, then the second part of code in the program is "running in computer2". My question is the main code shown above running in sequence or in parallel?
By which I mean, does the second part code has to wait to be executed until the first part of code is executed or two parts of codes can be executed parallel at the two different computers in the cluster?
The code between spmd and corresponding end is sent to all workers (5 in your case) and they execute these instructions in parallel. Then, in your code you instructed worker #1 to execute runopf('casea'); and worker #2 runopf('caseb');. Workers #3 to #5 will effectively do nothing.
Technically, worker #2 will execute runopf('caseb'); a little later. The delay appears because worker #2 will also check the first if statement (but will not execute the code in it).
All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.