Does spark automatically cache some results? - caching

I run an action two times, and the second time takes very little time to run, so I suspect that spark automatically cache some results. But I did find any source.
Im using Spark1.4.
doc = sc.textFile('...')
doc_wc = doc.flatMap(lambda x: re.split('\W', x))\
.filter(lambda x: x != '') \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x,y: x+y)
%%time
doc_wc.take(5) # first time
# CPU times: user 10.7 ms, sys: 425 µs, total: 11.1 ms
# Wall time: 4.39 s
%%time
doc_wc.take(5) # second time
# CPU times: user 6.13 ms, sys: 276 µs, total: 6.41 ms
# Wall time: 151 ms

From the documentation:
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.
The underlying filesystem will also be caching access to the disk.

Related

Creating a Trace File in ChampSim

Whenever I execute a trace file that I created using a pin tool, I get this on the terminal
Heartbeat CPU 0 instructions: 10000003 cycles: 2500068 heartbeat IPC: 3.99989 cumulative IPC: 3.99989 (Simulation time: 0 hr 0 min 20 sec), followed by reached end of trace and I tried to execute different matrix multiplication trace files with different matrix sizes and it takes the same amount of time to execute all of them (30 mins) does anyone know what's the problem and how to fix it?
I tried executing a trace file provided by 3rd Data Prefetching Championship (DPC-3) and it works just fine and I don't get reached end of trace on the terminal and when I executed different trace files all of them took different amount of time

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Speed of NSFileManager CopyFileAtPath is wildly different than finder in cases

I am using NSFileManager to copy a lot of files from one drive to another.
In some cases I am seeing users say " The app is unusable, It transfers at 0.33 MB/s, on a USB2 Connection. What would take me 10 min when I just Drag and drop"
I am running this on a background thread - is that maybe the issue?
secondaryTask=dispatch_queue_create( "com.myorg.myapp.task2",NULL);
dispatch_sync(secondaryTask,^{
NSFileManager *manager;
[manager copyItemAtPath:sourceFile toPath:filePath error:&error];
});
This seems to be related to OS X actually throttling my app. Some users actually see this in the log:
5/9/16 15:26:31.000 kernel[0]: process MyApp[937] thread 36146 caught burning CPU! It used more than 50% CPU (Actual recent usage: 91%) over 180 seconds. thread lifetime cpu usage 90.726617 seconds, (49.587139 user, 41.139478 system) ledger info: balance: 90006865992 credit: 90006865992 debit: 0 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 98013987431
So... this is a GCD question... and Ive brought it up with Apple directly.

Why is my wordpress website so slow and am I having so much downtime?

I have used Yslow and PageSpeed to find the cause, but I can't seem to figure out why my blog http://www.fotokringarnhem.nl sometimes loads blazing fast (cached files I guess), and other times takes about 10 seconds or longer to load.
I am on a shared server, but haven't had problems like this with other websites on shared servers.
I'm using cloudflare to speed up my blog to speed things up, but to no avail.
Am I missing something?
Pingdom reports of last 30 days (also see http://stats.pingdom.com/hseaskprwiaz):
Response time average: 7.620 ms
Slowest Average: 18.307 ms
Fastest Average: 4.237 ms
Uptime: 96,24%
Edit 1:
from basicstate.com
diagnostics
+dns
+connect
-request
-response
So I guess it fails on the requests. Options to narrower it down?
Edit 2:
I used P3 (Plugin Performance Profiler) to determine which plugins caused the most loadtime. Turns out that User Access Manager caused about 60% of load time, so I deleted it.
This did something, I now get way less time outs, but it still takes a long time for anything to popup on the screen.
I used the plugin SQL monitor and determined there are 82 queries being executed on request which takes about 10 seconds!!!!
If you have a static site with not millions of users and performance is highly variable, your host is probably to blame. I have tried about 8 different hosts and researched a dozen others, I highly suggest Media Temple (mt). Best customer service and performance you can get for the money.
Also, check out a speed test tool by WP Engine: http://speed.wpengine.com/ - great insight into why your site is slow. Takes a few hours to generate a report.
Use the P3 plugin
and use this article to further optimize your website: http://andbreak.com/articles/guide-speed-wordpress/
when all else fails, try switching providers
Edit
Turns out that deleting all automatically saved backups of pages and post as well as the concepts was the key. This dramatically cuts my query time to the server.
Now at lightning speeds!
Here is the report from the last few days:
uptime: 99,21%
overall average: 3.322 ms
There is a usefull plugin for WP to also limit the number of autosaves and concepts for posts and pages: Revision control
Results weren't instant by the way. Took a day to take effect.
Basicstate results (clearly showing the improvement when I deleted the revisions and concepts on the 11th of 12th (not sure which)).
date uptime dns connect request ttfb ttlb
2012-09-18 98.97 0.031 0.047 0.047 0.353 0.475
2012-09-17 100.00 0.031 0.047 0.047 0.389 0.810
2012-09-16 100.00 0.029 0.045 0.045 0.342 0.499
2012-09-15 93.81 0.029 0.045 0.045 0.739 1.035
2012-09-14 98.97 0.053 0.068 0.068 0.387 0.565
2012-09-13 100.00 0.058 0.074 0.074 0.499 0.853
2012-09-12 95.00 0.030 0.046 0.046 5.994 7.024
2012-09-11 96.97 0.051 0.096 0.096 9.707 9.949
2012-09-10 73.15 0.027 0.043 0.043 6.765 6.952
2012-09-09 43.48 0.027 0.121 0.121 3.652 3.724
2012-09-08 31.82 0.028 0.045 0.045 2.757 2.867
2012-09-07 71.93 0.026 0.042 0.042 5.917 6.091
2012-09-06 60.49 0.027 0.043 0.043 4.590 4.751

Testing with JMeter: how to run N requests per second

I need to test if our system can perform N requests per second.
Technically, it's 2 requests to one API, 2 requests to another, and 6 requests to third one.
But the important thing that they should happen simultaneously - so 10 requests per second.
So, in JMeter I've created three Thread Groups, first defines number of threads 1, and ramp-up time 0.
Second thread group is the same, and third thread group defines number of threads 6 and ramp-up time 0.
But that doesn't really guarantee it's going to run them per second
How do I emulate that? And how do I see the results -- if it was able to perform or wasn't?
Thanks!
You could use ConstantThroughputTimer.
Quote from JMeter help files below:
18.6.4 Constant Throughput Timer
This timer introduces variable pauses, calculated to keep the total throughput (in terms of samples per minute) as close as possible to a give figure. Of course the throughput will be lower if the server is not capable of handling it, or if other timers or time-consuming test elements prevent it.
N.B. although the Timer is called the Constant Throughput timer, the throughput value does not need to be constant. It can be defined in terms of a variable or function call, and the value can be changed during a test.
For example I've used it to generate 40 requests per second:
<ConstantThroughputTimer guiclass="TestBeanGUI" testclass="ConstantThroughputTimer" testname="Constant Throughput Timer" enabled="true">
<stringProp name="calcMode">all active threads in current thread group</stringProp>
<doubleProp>
<name>throughput</name>
<value>2400.0</value>
<savedValue>0.0</savedValue>
</doubleProp>
</ConstantThroughputTimer>
And thats a summary:
Created the tree successfully using performance/search-performance.jmx
Starting the test # Tue Mar 15 16:28:39 CET 2011 (1300202919244)
Waiting for possible shutdown message on port 4445
Generate Summary Results + 3247 in 80,3s = 40,4/s Avg: 18 Min: 0 Max: 1328 Err: 108 (3,33%)
Generate Summary Results + 7199 in 180,0s = 40,0/s Avg: 15 Min: 1 Max: 2071 Err: 378 (5,25%)
Generate Summary Results = 10446 in 260,3s = 40,1/s Avg: 16 Min: 0 Max: 2071 Err: 486 (4,65%)
Generate Summary Results + 7200 in 180,0s = 40,0/s Avg: 14 Min: 0 Max: 152 Err: 399 (5,54%)
Generate Summary Results = 17646 in 440,4s = 40,1/s Avg: 15 Min: 0 Max: 2071 Err: 885 (5,02%)
Generate Summary Results + 7199 in 180,0s = 40,0/s Avg: 14 Min: 0 Max: 1797 Err: 436 (6,06%)
Generate Summary Results = 24845 in 620,4s = 40,0/s Avg: 15 Min: 0 Max: 2071 Err: 1321 (5,32%)
But I run this test inside my network.
As with any network test, there's always going to be problems, especially with latency - even if you could send exactly 6 per second, they're going to be sent sequentially (that's just how packets get sent) and may not all hit in that second, plus processing time.
Generally when performance metrics specific x per second, it's measured over a period of time. Your API may even have a buffer - so you could technically send 6 per second, but process 5 per second, with a buffer of 20, meaning it'd be fine for 20 seconds of traffic, as you'd have sent 120, which would take 120/5 = 24 seconds to process. But any more than that would overflow the buffer. So to just send exactly 6 in a second to test is insufficient.
In the thread group, you're right setting number of threads (users) to 6. Then run it looping forever (tick it or put it in a while loop) and add a listener like aggregate report and results tree. The results you can use to check the right stuff is being sent and responded to (assuming you validate the responses) and in the aggregate report, you can see how many of each activity is happening per hour (obviously multiply by 3600 for seconds, but because of this inaccuracy it's best to run it for a good length of time).
The initial load test can now be run, and as a more accurate test, you can leave it running for longer (soak test) to see if any other problems surface - buffer overflows, memory leaks, or other unexpected events.
Use the Throughput Shaping Timer
I had similar problem and here are two solutions I found:
Solution 1:
You can use Stepping Thread Group (allows to set thread number increase stages over set periods of time) with Constant Throughput Timer in it.
Throughput Timer allows you to set number of samples that thread can send per minute (e.g. if you set it to 1, the thread will only send one request per minute). Also, you can apply Throughput Timer to all threads in your Thread Group or have Timer for each thread with its own settings.
Read more about Throughput Timer here: https://www.blazemeter.com/blog/how-use-jmeters-throughput-constant-timer
Solution 2:
Use "SetUp Thread Group". You can calculate thread number and rump up time to get Threads per Second desired.
You can use Schedule Feedback Function and will also need Concurrency Thread Group
Same can Also be done by configuring "ConstantThroughputTimer" as suggested above from UI also by adding "Constant Throughput Timer" by navigating by right click on Thread Group and then click on Timer and then choose the "Constant Throughput Timer".

Resources