When running opreport after profiling a binary I get something like what is seen below:
How do I read this exactly?
For example for the second to last item, can we say that it was executed 66389 times and occupied ~6 of the total runtime, and additionally what are the cumulative columns for?
Related
I think this is difficult thing.
In general I know that I need total and current count for gaining rate to something.
But in this case, I cannot get total count.
For example, there are two jobs, A and B.
Their total process will be always set randomly.
Also, I cannot get job's total process count before job be ended.
I have one of method that set concreted rate each jobs like if A is done, set rate 50%.
But in this situation that A's count is 10 and B's count is 1000 will make strange result.
Although total count is 1010, it is 50% that 10 process is done.
It is something strange.
So, I want to offer more natural progress rate to users. But I don't have total process count.
Is there any useful method alternative generic percentage calculation?
If you want to know how much total progress you have without knowing how much total progress there could be, this is logically impossible
However, you could
estimate it
keep historical data
assume the maximum and just surprise the user when it's faster
To instead show the rate of progress
take the current time at the start of your process and subtract the time when you check again
divide the completed jobs by that amount to get the jobs/second
Roughly
rate = jobs_completed / (time_now - time_start)
You can also do this over some window, but you need to record both the time and the number of jobs completed at the start of the window to subtract off both to get just the jobs in your time window
rate_windowed = (jobs_completed - jobs_previous) / (time_now - time_previous)
I called take() method of an RDD[LabeledPoint] from spark-shell, which seemed to be a laborious job for spark.
The spark-shell shows a progress-bar:
The progress-bar fills again and again and I don't know how to produce a reasonable estimate of the needed time (or total-progress ) from those numbers above.
Does anyone know what those numbers mean ?
Thanks in advance.
The numbers show the Spark stage that is running, the number of completed, in-progress, and total tasks in the stage. (See
What do the numbers on the progress bar mean in spark-shell? for more on the progress bar.)
Spark stages run tasks in parallel. In your case 5 tasks are running in parallel at the moment. If each task takes roughly the same time, this should give you an idea of how much longer you have to wait for this stage to finish.
But RDD.take can take more than one stage. take(1) will first get the first element of the first partition. If the first partition is empty, it will take the first elements from the second, third, fourth, and fifth partitions. The number of partitions it looks at in each stage is 4× the number of partitions already checked. So if you have a whole lot of empty partitions, take(1) can take many iterations. This can be the case for example if you have a large amount of data, then do filter(_.name == "John").take(1).
If you know your result will be small, you can save time by using collect instead of take(1). This will always gather all the data in a single stage. The main advantage is that in this case all the partitions will be processed in parallel, instead of the somewhat sequential manner of take.
In the spirit of Dimitre Novatchev's answer at XSLT Performance, I want to create a profile that shows where the time consumed by my XSL transform has gone. Using the Saxon -TP:profile.html option, we created an “Analysis of Stylesheet Execution Time” HTML document.
At the top of that document, we see:
Total time: 1102316.688 milliseconds
This figure (1,102 seconds) corresponds with my measured program execution time.
However, the sum of the “total time (net)” column values is less than 2% of this total. I assume, per http://www.saxonica.com/html/documentation/using-xsl/performanceanalysis.html, that the “total time (net)” column values are reported in milliseconds.
I would normally work a profile from the top down, but in this case, I don’t want to invest effort into optimizing a template that is reported to have contributed less than 0.5% of my total response time.
How can I find out where my time has really gone? Specifically, how can I learn where the unreported 98% of my program's time has been consumed?
I would like
to know the real meaning of these two counters Total time spent by all
maps in occupied slots (ms) and Total time spent by all reduces in
occupied slots (ms). I just wrote MR program similar to word count
I got
**Total time spent by all maps in occupied slots (ms)=15667400
Total time spent by all reduces in occupied slots (ms)=158952
CPU time spent (ms)=51930
real 7m38.886s**
Why is it so?????? The first counter is having a very very high value
which is actually incomparable with the other three. Kindly clear this
to me.
Thank You
With Regards
Probably need some more context around your input data but the first two counters show how much time was spent across all map and reduce tasks. This number is larger than everything else as you probably have a multi-node hadoop cluster and a large input dataset - meaning you have lots of map tasks running in parallel. Say you have 1000 map tasks running in parallel and each takes 10 seconds to complete - in this case the total time across all mappers would be 1000*10, 10000 secs. In reality the map phase may only take 10-30 seconds to complete in parallel, but if you were to run them in serial they would take 10000 secs to complete with a single node, single map slot cluster.
The CPU time spent refers to the how much of the total time was pure CPU processing - this is smaller than the others as your job is mostly IO bound (reading from and writing to disk, or across the network).
Are there any widgets for predicting when a download (or any other process) will finish based on percent done history?
The trivial version would just do a 2 point fit based on the start time, current time and percent done but better option are possible.
A GUI widgest would be nice but a class that just returns the value would be just fine.
For the theoretical algorithm that I would attempt, if I would write such a widget, would be something like:
Record the amount of data transferred within a one second period (a literal KiB/s)
Remember the last 5 or 10 such periods (to get an an recent average KiB/s)
Subtract the total size from the transferred size (to get a "bytes remaining")
???
Widget!
That oughta do it...
(the missing step being: kibibytes remaining divided by average KiB/s)
Most progress bar widgets have an update() method which takes a percentage done.
They then calculate the time remainging based on the time since start and the latest percentage. If you have a process with a long setup time you might want to add more functionality so that it doesn't include this time, perhaps by having the prediction clock reset when sent a 0% update.
It would sort of have the same effect as some of the other proposals but by collecting percent done data as a function of time statistical methods could be used to generate a R^2 best fit line and project it to 100% done. To make it take into account the current operation a heavier weight could be placed on the newer data (or the older data could be thinned) and to make it ignore short term fluctuations, a heavy weighting can be put on the first data point.