Fraction cached larger than 100%

Fraction cached larger than 100% - caching

I have the following Spark job, some RDD have RDD fraction cached more than 100%. How can this be possible? What did I miss? Thanks!

I believe this is because you can have the same partition cached in multiple locations. See SPARK-4049 for more details.
EDIT:
I'm wondering if maybe you have speculative execution (see spark.speculation) set? If you have straggling tasks, they will be relaunched which I believe will duplicate a partition. Also, another useful thing to do might be to call rdd.toDebugString which will provide lots of info on an RDD including transformation history and number of cached partitions.

Related

Spark: how long does it keep RDD in cache

For example, I cached a number of RDDs in memory.
Then I leave the application for a few days or more.
And then I try to access cached RDDs.
Will they still be in memory?
Or Spark will clean unused cached RDDs after some period of time.
Please help!

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

How to join big dataframes in Spark SQL? (best practices, stability, performance)

I'm getting the same error than Missing an output location for shuffle when joining big dataframes in Spark SQL. The recommendation there is to set MEMORY_AND_DISK and/or spark.shuffle.memoryFraction 0. However, spark.shuffle.memoryFraction is deprecated in Spark >= 1.6.0 and setting MEMORY_AND_DISK shouldn't help if I'm not caching any RDD or Dataframe, right? Also I'm getting lots of other WARN logs and task retries that lead me to think that the job is not stable.
Therefore, my question is:
What are best practices to join huge dataframes in Spark SQL >= 1.6.0?
More specific questions are:
How to tune number of executors and spark.sql.shuffle.partitions to achieve better stability/performance?
How to find the right balance between level of parallelism (num of executors/cores) and number of partitions? I've found that increasing the num of executors is not always the solution as it may generate I/O reading time out exceptions because of network traffic.
Is there any other relevant parameter to be tuned for this purpose?
My understanding is that joining data stored as ORC or Parquet offers better performance than text or Avro for join operations. Is there a significant difference between Parquet and ORC?
Is there an advantage of SQLContext vs HiveContext regarding stability/performance for join operations?
Is there a difference regarding performance/stability when the dataframes involved in the join are previously registerTempTable() or saveAsTable()?
So far I'm using this is answer and this chapter as a starting point. And there are a few more stackoverflow pages related to this subject. Yet I haven't found a comprehensive answer to this popular issue.
Thanks in advance.

That are a lot of questions. Allow me to answer these one by one:
Your number of executors is most of the time variable in a production environment. This depends on the available resources. The number of partitions is important when you are performing shuffles. Assuming that your data is now skewed, you can lower the load per task by increasing the number of partitions.
A task should ideally take a couple of minus. If the task takes too long, it is possible that your container gets pre-empted and the work is lost. If the task takes only a few milliseconds, the overhead of starting the task gets dominant.
The level of parallelism and tuning your executor sizes, I would like to refer to the excellent guide by Cloudera: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
ORC and Parquet only encode the data at rest. When doing the actual join, the data is in the in-memory format of Spark. Parquet is getting more popular since Netflix and Facebook adopted it and put a lot of effort in it. Parquet allows you to store the data more efficient and has some optimisations (predicate pushdown) that Spark uses.
You should use the SQLContext instead of the HiveContext, since the HiveContext is deprecated. The SQLContext is more general and doesn't only work with Hive.
When performing the registerTempTable, the data is stored within the SparkSession. This doesn't affect the execution of the join. What it stores is only the execution plan which gets invoked when an action is performed (for example saveAsTable). When performining a saveAsTable the data gets stored on the distributed file system.
Hope this helps. I would also suggest watching our talk at the Spark Summit about doing joins: https://www.youtube.com/watch?v=6zg7NTw-kTQ. This might provide you some insights.
Cheers, Fokko

Why does YARN takes a lot of memory for a simple count operation?

I have a standard configured HDP 2.2 environment with Hive, HBase and YARN.
I've used Hive (/w HBase) to perform a simple count operation on a table that has about 10 million rows and it resulted with a 10gb of memory consumption from YARN.
How can I reduce this memory consumption? Why does it need so much memory just to count rows?

A simple count operation involves a map reduce job at the back end. And that involves 10 million rows in your case. Look here for a better explanation. Well this is just for the things happening at the background and execution time and not your question regarding memory requirements. Atleast, it will give you a heads up for the places to look for. This has few solutions to speed up as well. Happy coding

Hadoop speculative execution testing

I am working on Hadoop for my master thesis, Hadoop 1.1.2.
I am studying a new algorithm for speculative task and so in this first step i m trying to apply some changes in the code.
Sadly, also using 2 node, i cannot cause the speculative execution. I wrote some lines of code as Log in the class DefaultTaskSelector (this is the class for speculative task), but this class, after the initialization, is never called by the FairScheduler class.
I activated the option "speculative" in the config file too (mapred-site...xml) but nothing.
So the question is: How can i cause/force the speculative execution?
Regards

Speculative execution typically happens when there are multiple mappers running and one or more of them lag the others. A good way to get it to happen:
set up hive
set up a partitioned table
make sure the data is big enough to cause many mappers to run. This means: at least a few dozen HDFS blocks worth of data
enter data into the partitions: have one of the partitions with highly skewed data much more than the other partitions.
run a select * from the table
Now you may see speculative execution run.
If not, feel free to get back here. I can provide further suggestions (e.g. making some moderately complicated queries that would likely induce SE)
EDIT
Hive may be a bit of a stretch for you. But you can apply the "spirit" of the strategy to regular HDFS files as well. Write a map/reduce program with a custom partitioner that is intentionally skewed: i.e. it causes a single mapper to do an outsized proportion of the work.
Remember to have some tens of hdfs blocks (at least) to give the task trackers some decent amount of work to chew on.

You should be able to cause speculative execution using the two methods called setMapSpeculativeExecution(boolean) and setReduceSpeculativeExecution(boolean) that you can specify using Job, the MapReduce job configuration.

HBase bulk load spawn high number of reducer tasks - any workaround

HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. So if there are few hundred regions then the job would spawn few hundred reducer tasks. This could get very slow on a small cluster..
Is there any workaround possible by using MultipleOutputFormat or something else?
Thanks

Sharding the reduce stage by region is giving you a lot of long-term benefit. You get data locality once the imported data is online. You also can determine when a region has been load balanced to another server. I wouldn't be so quick to go to a coarser granularity.
Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). That might speed it up more.
It's very easy to get network bottlenecked. Make sure you're compressing your HFile & your intermediate MR data.
job.getConfiguration().setBoolean("mapred.compress.map.output", true);
job.getConfiguration().setClass("mapred.map.output.compression.codec",
org.apache.hadoop.io.compress.GzipCodec.class,
org.apache.hadoop.io.compress.CompressionCodec.class);
job.getConfiguration().set("hfile.compression",
Compression.Algorithm.LZO.getName());
Your data import size might be small enough where you should look at using a Put-based format. This will call the normal HTable.Put API and skip the reducer phase. See TableMapReduceUtil.initTableReducerJob(table, null, job).

When we use HFileOutputFormat, its overrides number of reducers whatever you set.
The number of reducers is equal to number of regions in that HBase table.
So decrease the number of regions if you want to control the number of reducers.
You will find a sample code here:
Hope this will be useful :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio