Spark write the file inside the worker process - caching

I have a Spark job that is generating a set of results with statistics. My number of work items are more than slave count. So I am doing more than one processing per slave.
I cache results after generating RDD objects to be able to reuse them as I have multiple write operations: one for result objects and another for statistics. Both write operations use saveAsHadoopFile.
Without caching Spark reruns the job again per each write operation and that is taking a long time and redoing the same execution twice (more if I had more writes).
With caching I am hitting the memory limit. Some of previously calculated results are lost during caching and I am seeing "CacheManager:58 - Partition rdd_1_0 not found, computing it" messages. Spark eventually goes into an infinite loop as it tries to cache more results while losing some others.
I am aware of the fact that Spark has different storage levels for caching. Using memory + disk would solve our problem. But I am wondering whether we can write down files right in the worker without generating RDD objects or not. I am not sure if that is possible though. Is it?

It turns out that writing files inside a Spark worker process is not different than writing a file in a Java process. Write operation just requires just creating functionality to serialize and save files to HDFS. This question has several answers on how to do it.
saveAsHadoopFile is just a convenient way of doing it.

Related

Is it possible to save intermediate Spark DataFrames to disk without significantly affecting performance?

Let's suppose I have the following data pipeline:
Companies and Shuttles are my input datasets read from disk using Spark, the preprocess_* functions perform some heavy data cleaning operations, and create_master_table_node is another function that takes 3 Spark DataFrames as input.
Generally, my understanding is that it's most performant to not write any of the intermediate Spark DataFrames (i.e. Preprocessed Companies, Preprocessed Shuttles) to disk. https://spokeo-engineering.medium.com/whats-the-fastest-way-to-store-intermediate-results-in-spark-54f2080defb6 suggests the same.
However, there are also benefits to persisting intermediate data to disk, such as being able to resume failed workflows or analyze intermediate results post-hoc. These benefits are important to me, so I want the intermediate outputs persisted at some point in my workflow--be it just after they're computed or one the entirely pipeline finishes or errors out. What's the least performance-impacting way to save Preprocessed Companies, Preprocessed Shuttles, etc. to disk? Is there a way to do this without affecting the rest of the computational graph?

Does Apache Spark read and process in the same time, or in first reads entire file in memory and then starts transformations?

I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.
Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.
Thanks
I think a fair characterisation would be this:
Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.
During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic.
Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work.
The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node.
Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.
In in the steps of a word-count job Hadoop does this:
Map all the words to 1.
Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
Sort the rows of (word, 1) in that shared file (this is the sorting phase)
Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.
Spark on the other hand will go the other way around:
It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
It works through steps 3 than 2 than 1.
Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1.
This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.
The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job.
On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion.
Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).
Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.
For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code.
You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.
Spark uses lazy iterators to process data and can spill data to disk if necessary. It doesn't read all data in memory.
The difference compared to Hadoop is that Spark can chain multiple operations together.

Performance optimization of DataFrame based application

I'm writing an application, which produces several files storing them back to S3.
Most of the transformations operate on DataFrames. The current state of the application is already somewhat complex being translated into 60 jobs, some of them mapped to hundreds of stages. Some of the DataFrames are reused along the way and those are cached.
The problem is performance, which is clearly impacted by dependencies.
I have a few questions, any input on any of them will be highly appreciated.
(1) When I split the application into parts, execute them individually reading the inputs from generated files by the previous ones, the total execution time is a fraction (15%) of the execution time of the application run as a whole. This is counterintuitive as the whole application reuses DataFrames already in memory, caching guarantees that no DataFrame is computed more than once and various jobs are executing in parallel wherever possible.
I also noticed that the total number of stages in the latter case is much higher than the first one and I would think they should be comparable. Is there an explanation for this?
(2) If the approach with executing parts of the application individually is the way to go then how to enforce the dependencies between the parts to make sure the necessary inputs are ready.
(3) I read a few books, which devote some chapters to the execution model and performance analysis through the Spark Web UI. All of them use RDDs and I need DataFrames. Obviously even for DataFrame based applications Spark Web UI provides a lot of useful information but the analysis is much harder. Is there a good resource I could use to help me out?
(4) Is there a good example demonstrating how to minimize shuffling by appropriate partitioning of the DataFrame? My attempts so far have been ineffective.
Thanks.
Splitting application is not recommended, if you have more stages and having performance issues then try Checkpointing which saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD’s lineage completely.
//set this property in program
SparkContext.setCheckpointDir(directory: String) method.
//checkpoint RDD
RDD.checkpoint()
If you are using DataFrames then manually checkpoint the data after logical points by introducing Parquet/ORC hops (Writing & Reading data from Parquet/ORC files)
//Write to ORC
dataframe.write.format("orc").save("/tmp/src-dir/tempFile.orc")
//where /tmp/src-dir/ is a HDFS directory
//Read ORC
val orcRead = sqlContext.read.format("orc").load("/tmp/src-dir/tempFile.orc")
Spiliting program is not recommended but still if you want to do it then create separate ".scala" programs and apply dependencies at Oozie level.
3.In Spark web UI refer SQL tab which will give you the execuiton plan. For detailed study on dataframes run
DF.explain() //which will show you the execution plan
DataFrames in Spark have their execution automatically optimized by a query optimizer. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution. Because the optimizer understands the semantics of operations and structure of the data, it can make intelligent decisions to speed up computation.
Refer Spark guide - http://spark.apache.org/docs/latest/sql-programming-guide.html
4.Sort the data before any operations such as join. To reduce shuffling use repartition function.
DF1 = DF.repartition(10)
Please post your code if you have any other specific doubt.

Hadoop speculative execution testing

I am working on Hadoop for my master thesis, Hadoop 1.1.2.
I am studying a new algorithm for speculative task and so in this first step i m trying to apply some changes in the code.
Sadly, also using 2 node, i cannot cause the speculative execution. I wrote some lines of code as Log in the class DefaultTaskSelector (this is the class for speculative task), but this class, after the initialization, is never called by the FairScheduler class.
I activated the option "speculative" in the config file too (mapred-site...xml) but nothing.
So the question is: How can i cause/force the speculative execution?
Regards
Speculative execution typically happens when there are multiple mappers running and one or more of them lag the others. A good way to get it to happen:
set up hive
set up a partitioned table
make sure the data is big enough to cause many mappers to run. This means: at least a few dozen HDFS blocks worth of data
enter data into the partitions: have one of the partitions with highly skewed data much more than the other partitions.
run a select * from the table
Now you may see speculative execution run.
If not, feel free to get back here. I can provide further suggestions (e.g. making some moderately complicated queries that would likely induce SE)
EDIT
Hive may be a bit of a stretch for you. But you can apply the "spirit" of the strategy to regular HDFS files as well. Write a map/reduce program with a custom partitioner that is intentionally skewed: i.e. it causes a single mapper to do an outsized proportion of the work.
Remember to have some tens of hdfs blocks (at least) to give the task trackers some decent amount of work to chew on.
You should be able to cause speculative execution using the two methods called setMapSpeculativeExecution(boolean) and setReduceSpeculativeExecution(boolean) that you can specify using Job, the MapReduce job configuration.

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Resources