Apache Spark Cache not working properly - caching

I am running a very simple program which counts words in a S3 Files
JavaRDD<String> rdd = sparkContext.getSc().textFile("s3n://" + S3Plugin.s3Bucket + "/" + "*", 10);
JavaRDD<String> words = rdd.flatMap(s -> java.util.Arrays.asList(s.split(" ")).iterator()).persist(StorageLevel.MEMORY_AND_DISK_SER());
JavaPairRDD<String, Integer> pairs = words.mapToPair(s -> new Tuple2<String, Integer>(s, 1)).persist(StorageLevel.MEMORY_AND_DISK_SER());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b).persist(StorageLevel.MEMORY_AND_DISK_SER());
//counts.cache();
Map m = counts.collectAsMap();
System.out.println(m);
After running the program multiple times, I can see multiple entries
Storage entries
This means that everytime I am running the process it keeps on creating new cache.
The time taken for running the script everytime remains the same.
Also when I run the program, I always see these kind of logs
[Stage 12:===================================================> (9 + 1) / 10]
My understanding was that when we cache Rdds, it wont do the operations again and fetch the data from the cache.
So I need to understand that why Spark doesnt use the cached rdd and instead creates a new cached entry when the process is run again.
Does spark allows to use cached rdds across Jobs or is it available only in the current context

Cached data only persists for the length of your Spark application. If you run the application again, you will not be able to make use of cached results from previous runs of the application.

In logs it will show the total stages but when you go to localhost:4040 you can see there is some task skip because of caching so monitor jobs more properly with spark UI localhost:4040

Related

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Apache Spark DAGScheduler Missing Parents for Stage

When running my iterative program on Apache Spark I occasionally get the message:
INFO scheduler.DAGScheduler: Missing parents for Stage 4443: List(Stage 4441, Stage 4442)
I gather it means it needs to compute the parent RDD - but I am not 100% sure. I don't just get one of these, I end up with 100's if not thousands of them at a time - it completely slows down my programme and another iteration does not complete for 10-15 minutes (they usually take 4-10 seconds).
I cache the main RDD on each iteration, using StorageLevel.MEMORY_AND_DISK_SER. The next iteration uses this RDD. The lineage of the RDD therefore gets very large hence the need for caching. However, if I am caching (and spilling to disk) how can a parent be lost?
I quote Imran Rashid from Cloudera:
It's normal for stages to get skipped if they are shuffle map stages, which get read multiple times. Eg., here's a little example program I wrote earlier to demonstrate this: "d3" doesn't need to be re-shuffled since each time its read w/ the same partitioner. So skipping stages in this way is a good thing:
val partitioner = new org.apache.spark.HashPartitioner(10)
val d3 = sc.parallelize(1 to 100).map { x => (x % 10) -> x}.partitionBy(partitioner)
(0 until 5).foreach { idx =>
val otherData = sc.parallelize(1 to (idx * 100)).map{ x => (x % 10) -> x}.partitionBy(partitioner)
println(idx + " ---> " + otherData.join(d3).count())
}
If you run this, f you look in the UI you'd see that all jobs except for the first one have one stage that is skipped. You will also see this in the log:
15/06/08 10:52:37 INFO DAGScheduler: Parents of final stage: List(Stage 12, Stage 13)
15/06/08 10:52:37 INFO DAGScheduler: Missing parents: List(Stage 13)
Admittedly that is not very clear, but that is sort of indicating to you that the DAGScheduler first created stage 12 as a necessary step, and then later on changed its mind by realizing that everything it needed for stage 12 already existed, so there was nothing to do.
See the following for the email source:
http://apache-spark-developers-list.1001551.n3.nabble.com/

spark map(func).cache slow

When I use the cache to store data,I found that spark is running very slow. However, when I don't use cache Method,the speed is very good.My main profile is follows:
SPARK_JAVA_OPTS+="-Dspark.local.dir=/home/wangchao/hadoop-yarn-spark/tmp_out_info
-Dspark.rdd.compress=true -Dspark.storage.memoryFraction=0.4
-Dspark.shuffle.spill=false -Dspark.executor.memory=1800m -Dspark.akka.frameSize=100
-Dspark.default.parallelism=6"
And my test code is:
val file = sc.textFile("hdfs://10.168.9.240:9000/user/bailin/filename")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).cache()..reduceByKey(_+_)
count.collect()
Any answers or suggestions on how I can resolve this are greatly appreciated.
cache is useless in the context you are using it. In this situation cache is saying save the result of the map, .map(word => (word, 1)) in memory. Whereas if you didn't call it the reducer could be chained to the end of the map and the maps results discarded after they are used. cache is better used in a situation where multiple transformations/actions will be called on the RDD after it is created. For example if you create a data set you want to join to 2 different datasets it is helpful to cache it, because if you don't on the second join the whole RDD will be recalculated. Here is an easily understandable example from spark's website.
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR")).cache() //errors is cached to prevent recalculation when the two filters are called
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains("MySQL")).collect()
What cache is doing internally is removing the ancestors of an RDD by keeping it in memory/saving to disk(depending on the storage level), the reason an RDD must save its ancestors is so it can be recalculated on demand, this is the recovery method of RDD's.

MRJob and mapreduce task partitioning over Hadoop

I am trying to perform a mapreduce job using the Python MRJob lib and am having some issues getting it to properly distribute across my Hadoop cluster. I believe I am simply missing a basic principle of mapreduce. My cluster is a small, one master one slave test cluster. The basic idea is that I'm just requesting a series of web pages with parameters, doing some analysis on them and returning back some properties on the web page.
The input to my map function is simply a list of URLs with parameters such as the following:
http://guelph.backpage.com/automotive/?layout=bla&keyword=towing
http://guelph.backpage.com/whatever/?p=blah
http://semanticreference.com/search.html?go=Search&q=red
http://copiahcounty.wlbt.com/h/events?ename=drupaleventsxmlapi&s=rrr
http://sweetrococo.livejournal.com/34076.html?mode=ffff
Such that the key-value pairs for the initial input are just key:None, val:URL.
The following is my map function:
def mapper(self, key, url):
'''Yield domain as the key, and (url, query parameter) tuple as the value'''
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc + "/"
if self.myclass.check_if_param(parsed_url):
parsed_url_query = parsed_url.query
url_q_dic = parse_qs(parsed_url_query)
for query_param, query_val in url_q_dic.iteritems():
#yielding a tuple in mrjob will yield a list
yield domain, (url, query_param)
Pretty simple, I'm just checking to make sure the URL has a parameter and yielding the URL's domain as key and a tuple giving me the URL and the query parameter as value which MRJob kindly transforms into a list to pass to the reducer, which is the following:
def reducer(self, domain, url_query_params):
final_list = []
for url_query_param in url_query_params:
url_to_list_props = url_query_param[0]
param_to_list_props = url_query_param[1]
#set our target that we will request and do some analysis on
self.myclass.set_target(url_to_list_props, param_to_list_props)
#perform a bunch of requests and do analysis on the URL requested
props_list = self.myclass.get_props()
for prop in props_list:
final_list.append(prop)
#index this stuff to a central db
MapReduceIndexer(domain, final_list).add_prop_info()
yield domain, final_list
My problem is that only one reducer task is run. I would expect the number of reducer tasks to be equal to the number of unique keys emitted by the mapper. The end result with the above code is that I have one reducer which runs on the master, but the slave sits idly and does nothing, which is obviously not ideal. I notice that in my output a few mapper tasks are started, but always only 1 reducer task. Other than that, the task runs smoothly and all works as expected.
My question is... what the heck am I doing wrong? Am I misunderstanding the reduce step or screwing up my key-value pairs somewhere? Why are there not multiple reducers running on this job?
Update: OK so from the answer given I increased mapred.reduce.tasks to higher (it was the default which I now realize is 1). This was indeed why I was getting 1 reducer. I now see 3 reduce tasks being performed simultaneously. I now have an import error on my slave that needs to be resolved but at least I am getting somewhere...
The number of reducers is totally unrelated to the form of your input data. For MRJob it looks like you need bootstrap options

Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1'

Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There are series of MapReduce jobs that i am performing on Nutch segments to get final output. But waiting for whole crawl to happen before running mapreduce causes solution to run for longer time. I am now triggering MapReduce jobs on segments as soon as they are dumped. I am running crawl in a loop('N=depth' times ) by giving depth=1.I am getting some urls getting lost when i crawl with depth 1 in a loop N times vs crawl giving depth N.
Please find below pseudo code:
Case 1: Nutch crawl on Hadoop giving depth=3.
// Create the list object to store arguments which we are going to pass to NUTCH
List nutchArgsList = new ArrayList();
nutchArgsList.add("-depth");
nutchArgsList.add(Integer.toString(3));
<...other nutch args...>
ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new String[nutchArgsList.size()]));
Case 2: Crawling in loop 3 times with depth='1'
for(int depthRun=0;depthRun< 3;depthRun++)
{
// Create the list object to store arguments which we are going to pass to NUTCH
List nutchArgsList = new ArrayList();
nutchArgsList.add("-depth");
nutchArgsList.add(Integer.toString(1)); //NOTE i have given depth as 1 here
<...other nutch args...>
ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new String[nutchArgsList.size()]));
}
I am getting some urls getting lost(db unfetched) when i crawling in loop as many times as depth.
I have tried this on standalone Nutch where i run with depth 3 vs running 3 times over same urls with depth 1. I have compared the crawldb and urls difference is only 12. But when i do the same on Hadoop using toolrunner i am getting 1000 urls as db_unfetched.
As far i understood till now,Nutch triggers crawl in a loop as many times as depth value. Please suggest.
Also please let me know why difference is huge when i do this on Hadoop using toolrunner vs doing the same on standalone Nutch.
I have found that the behavior of the Nutch fetching changes when running stand-alone (straight to hard disk) and integrated with a Hadoop cluster. The Generator score filtering appears to be much higher with a Hadoop cluster, so the "-topN" setting needs to be adequitely high.
I would suggest running your crawl with a high (at least 1000) "-topN" and not the default value of 5.
This is similar to my response here.
After doing this, I found that my Nutch crawls on stand-alone and HDFS started to line up better.

Resources