Spark RDD Lineage and Storage - hadoop

inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
badLinesCount = badLinesRDD.count()
warningCount = warningsRDD.count()
In the code above none of the transformations are evaluated until the second to last line of code is executed where you count the number of objects in the badLinesRDD. So when this badLinesRDD.count() is run it will compute the first four RDD's up till the union and return you the result. But when warningsRDD.count() is run it will only compute the transformation RDD's until the top 3 lines and return you a result correct?
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored? Does it get stored in memory on the each of the DataNodes where the filter transformation was run in parallel?

Unless task output is persisted explicitly (cache, persist for example) or implicitly (shuffle write) and there is enough free space every action will execute complete lineage.
So when you call warningsRDD.count() it will load the file (sc.textFile("log.txt")) and filter (inputRDD.filter(lambda x: "warning" in x)).
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored?
Assuming data is not persisted, nowhere. All task outputs are discarded after data is passed to the next stage or output. If data is persisted it depends on the settings (disk, on-heap, off-heap, DFS).

Related

Getting duplicates with NiFi HBase_1_1_2_ClientMapCacheService

I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.

Minimizing shuffle when mapper output is mostly sorted

I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
}
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.

spark map(func).cache slow

When I use the cache to store data,I found that spark is running very slow. However, when I don't use cache Method,the speed is very good.My main profile is follows:
SPARK_JAVA_OPTS+="-Dspark.local.dir=/home/wangchao/hadoop-yarn-spark/tmp_out_info
-Dspark.rdd.compress=true -Dspark.storage.memoryFraction=0.4
-Dspark.shuffle.spill=false -Dspark.executor.memory=1800m -Dspark.akka.frameSize=100
-Dspark.default.parallelism=6"
And my test code is:
val file = sc.textFile("hdfs://10.168.9.240:9000/user/bailin/filename")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).cache()..reduceByKey(_+_)
count.collect()
Any answers or suggestions on how I can resolve this are greatly appreciated.
cache is useless in the context you are using it. In this situation cache is saying save the result of the map, .map(word => (word, 1)) in memory. Whereas if you didn't call it the reducer could be chained to the end of the map and the maps results discarded after they are used. cache is better used in a situation where multiple transformations/actions will be called on the RDD after it is created. For example if you create a data set you want to join to 2 different datasets it is helpful to cache it, because if you don't on the second join the whole RDD will be recalculated. Here is an easily understandable example from spark's website.
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR")).cache() //errors is cached to prevent recalculation when the two filters are called
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains("MySQL")).collect()
What cache is doing internally is removing the ancestors of an RDD by keeping it in memory/saving to disk(depending on the storage level), the reason an RDD must save its ancestors is so it can be recalculated on demand, this is the recovery method of RDD's.

How to pass initial value to a reduce phase in riak?

I am trying to write a riak map reduce using riak-ruby-client. Javascript reduce function looks like this:
arr.reduce(callback,[initialValue]);
I am doing something like this:
map_reduce = Riak::MapReduce.new(Ripple.client)
map_reduce.add(bucket) // I have passed a valid bucket
var callback = "function(previous, current){return previous + current;}"
results = map_reduce.map(map_func).reduce(callback,1,:keep=>true).run //1 is the initial value as in javascript reduce func.
But riak does not treat 1 as the initial value here. Can someone tell how do I pass an initial value to reduce phase??
A reduce phase function takes two arguments, The first one is a list of inputs, containing the output from previous reduce phase iteration(s) as well as a batch of output from the preceding map/input phase. The second argument is a configuration parameter that is passed in every time the reduce phase function executes. This is described in greater detail in the Riak MapReduce documentation.
As reduce phase functions need to be commutative, associative, and idempotent, it is not possible to reliably identify which is the first iteration and it is therefore not possible to set an initial value.

On what basis mapreduce framework decides whether to launch a combiner or not

As per definition "The Combiner may be called 0, 1, or many times on each key between the mapper and reducer."
I want to know that on what basis mapreduce framework decides how many times cobiner will be launched.
Simply the number of spills to disk. Sorting happens after the MapOutputBuffer filled up, at the same time the combining will take place.
You can tune the number of spills to disk with the parameters io.sort.mb, io.sort.spill.percent, io.sort.record.percent - those are also explained in the documentation (books and online resources).
Example for specific numbers of combiner runs:
0 -> no combiner was defined
1 -> a combiner was defined and the MapOutputBuffer filled up once
>1 -> a combiner was defined and the MapOutputBuffer filled up more than once
Note that even if the MapOutputBuffer never fills up completely, this buffer must be flushed at the end of the map stage and thus triggers the combiner to run at least once (if defined).
First of all, Thomas Jungblut's answer is great and I gave me upvote. The only thing I want to add is that the Combiner will always be run at least once per Mapper if defined, unless the mapper output is empty or is a single pair. So having the combiner not being executed in the mapper is possible but highly unlikely.
Source code which has logic to invoke combiner based on condition.
Line 1950 - Line 1955 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java
if (combinerRunner == null || numSpills < minSpillsForCombine) {
Merger.writeFile(kvIter, writer, reporter, job);
} else {
combineCollector.setWriter(writer);
combinerRunner.combine(kvIter, combineCollector);
}
So Combiner runs if :
It is not defined , and
If the spills are greater than minSpillsForCombine. minSpillForCombine is driven by property "mapreduce.map.combine.minspills" whose default value is 3.

Resources