What happen internally when we join two DStream grouped by keys? - spark-streaming

I'am new in spark (spark-streaming in Python) and if i have understood correctly, a DStream is a sequence of RDD.
Imagine that we have in our code:
ssc = StreamingContext(sc, 5)
So for every 5s a DSTream object is generated which is a sequence of RDDs.
Imagine i have two DStreams DS1 and DS2 (each 5s). My code is:
DGS1 = DS1.groupByKey()
DGS2 = DS2.groupByKey()
FinalStream = DS1.join(DS2)
What happen internally when i call groupByKey and Join (in the RDDs level) ?
Thank you !

When you use groupByKey and join, you're causing a shuffle. A picture to illustrate:
Assume you have a stream of incoming RDD's (called a DStream) which are tuples of a String, Int. What you want is to group them by key (which is a word in this example). But, all the keys aren't locally availaible in the same executor, they are potentionally spread between many workers which have previously done work on the said RDD.
What spark has to do now, is say "Hey guys, I now need all keys which values are equal to X to go to worker 1, and all keys which values are Y to go to worker 2, etc", so you can have all values of a given key in a single worker node, which can then continue to do more work on each RDD which is now of type (String, Iterator[Int]) as a cause of the grouping.
A join is similar in it's behavior to a groupByKey, as it has to have all keys available in order to compare every two keys stream of RDDs.
Behind the scenes, spark has to do a couple of things in order for this to work:
Repartitioning of the data: Since all keys may not be available on a single worker
Data serialization/deserialization and Compression: Since spark has to potentially transfer data across nodes, it has to be serialized and later deserialized
Disk IO: As a cause of a shuffle spill since a single worker may not be able to hold all data in-memory.
For more, see this introduction to shuffling.

Related

Input Sampler in Hadoop

My understanding about InputSampler is that it gets data from record reader and samples keys and then creates a partition file in HDFS.
I have few queries about this sampler:
1) Is this sampling task a map task ?
2) My data is on HDFS (distributed across nodes of my cluster). Will this sampler run on nodes which has the data to be sampled?
3) Will this consume my map slots?
4) Will the sample run simultaneously with the map tasks of my MR job ? I want to know whether it will affect time consumed by mappers by reducing the number of slots?
I found that the InputSampler makes a seriously flawed assumption and is therefore not very helpful.
The idea is that it samples key values from the mapper input and then uses the resulting statistics to evenly partition the mapper output. The assumption then is that the key type and value distribution are the same for the mapper input and output. In my experience the mapper almost never sends the same key value types to the reducer as it reads in. So the InputSampler is useless.
In the few times where I had to sample in order to partition effectively, I ended up doing the sampling as part of the mapper (since only then did I know what keys were being produced) and writing the results out in the mapper's close() method to a directory (one set of stats per mapper). My partitioner then had to perform lazy initialization on its first call to read the mapper-written files, assimilate the stats into some useful structure and then to partition subsequent keys accordingly.
Your only other real option is to guess at development time how the key values are distributed and hard-code that assumption into your partitioner.
Not very clean but it was the best I could figure.
This question was asked a long time ago, and many questions were left unanswered.
The only and most voted answer by #Chris does not really answer the questions, but gives an interesting point of view, though a bit too pessimistic and misleading in my opinion, so I'll discuss it here as well.
Answers to the original questions
The sampling task is done in the call to InputSampler.writePartitionFile(job, sampler). The call to this method is blocking, during which the sampling is done, in the same thread.
That's why you don't need to call job.waitForCompletion(). It's not a MapReduce Job, it simply runs in your client's process. Besides, a MapReduce job needs at least 20 seconds just to start, but sampling a small file only takes a couple of second.
Thus, the answer to all of your questions is simply "No".
More details from reading the code
If you look at the code of the writePartitionFile(), you will find that it calls sampler.getSample(), who will call inputformat.getSplits() to get a list of all input splits to be samples.
These input formats will then be read sequentially to extract the samples. Each input split is read by a new record reader created within the same method. This means that your client is doing the reading and sampling.
Your other nodes are not running any "map" or other processes, they are simply serving HDFS the block data needed by your client for its input splits needed for sampling.
Using Different key types between Map input and output
Now, to discuss the answer given by Chris. I agree that the InputSampler and TotalOrderPartitioner are probably flawed in some ways, since they are really not easy to understand and use ... But they do not impose key types to be the same between map input and output.
The InputSampler uses the job's InputFormat (and its RecordReader) keys to create the partition file containing all sampled keys. This file is then used by the TotalOrderPartitioner during the partitioning phase at the end of the Mapper's process to create partitions.
The easiest solution is to create a custom RecordReader, for the InputSampler only, which performs the same key transformation as your Mapper.
To illustrate this, let's say your dataset contains pairs of (char, int), and that your mapper transforms them into (int, int), by taking the character's ascii value. For example 'a' becomes 97.
If you want to perform total order partitioning of this job, your InputSampler would sample letters 'a', 'b', 'c'. Then during the partitioning phase, your mapper output keys will be integer values like 102 or 107, which wouldn't be comparable to 'a', 'g' or 't' from the partition-file for partition distribution. This is not consistent, and this is why it looks like the input and output key types are assumed to be the same, when using the same InputFormat for sampling and your mapreduce job.
So the solution is to write a custom InputFormat and its RecordReader, used only for the sampling client-side job, which reads your input file and does the same transformation from char to int before returning each record. This way the InputSampler will directly write the integer ascii values from the custom record reader to the partition-file, which maintain the same distribution, and will be usable with your mapper's output.
It's not so easy to grasp in a few lines of text explanation texts, but anybody interested in fully understanding how the InputSampler and TotalOrderPartitioner work should check out this page : http://blog.ditullio.fr/2016/01/04/hadoop-basics-total-order-sorting-mapreduce/
It explains in details how to use them in different cases.

How is partitioned file with intermediate values on map worker in MapReduce?

I'm trying to understand MapReduce model and I need advice because I'm not sure about the way how is sorted and partitioned file with intermediate results of map function. The most my knowledges about MapReduce I got from MapReduce papers of Jeffrey Dean & Sanjay Ghemawat and from Hadoop: The Definitive Guide.
The file with intermediate results of map function is compound of small sorted and partitioned files. These small files are divided into partitions corresponding to reduce workers. Then small files are merged into one file. I need to know how is partitioning of small files done. First I thought that every partition has some range of keys.
For example: if we've got keys as integer in range <1;100> and file is divided to three partitions then the first partition can consists of values with keys in range <1,33>, second partition with keys in range <34;66> and third partition <67;100>. The same partitioning is in merged file too.
But I'm not sure about it. Every partition is send to corresponding reduce worker. In our example, if we have two reduce workers then partitions with first two ranges of keys (<1,33> and <34;66>) can be sent to first worker and last partition to third worker. But if I'm wrong and the files are divided in another way (I mean that partitions hasn't got their own range of possible keys) then every reduce worker can has results for the same keys. So I need somehow merge results of these reduce workers, right? Can I send these results to master node and merge them there?
In short version: I need explain the way how files in map phase are divided (if my description is wrong) and explain how and where I can process results of reduce workers.
I hope I described my problem enough to understand. I can explain it more, of course.
Thanks a lot for your answers.
There is a Partitioner class that does this. Each key/value pair in the intermediate file is passed to the partitioner along with the total number of reducers (partitions) and the partitioner returns the partition number that should handle that specific key/value pair.
There is a default partitioner that does an OK job of partitioning, but if you want better control or if you have a specially formatted (e.g. complex) key then you can and should write your own partitioner.

Can you know how many input values has a reducer in Hadoop without iterating on them?

I am writing a Reducer in Hadoop and I am using its input values to build a byte array which encodes a list of elements. The size of the buffer in which I write my data depends on the number of values the reducer receives. It would be efficient to allocate its size in memory in advance, but I don't know how many values are without iterating on them with a "foreach" statement.
Hadoop output is an HBase table.
UPDATE:
After processing my data with the mapper the reducer keys have a power law distribution. This means that only a few keys have a lot of value (at most 9000), but most of them have just a few values. I noticed that by allocating a buffer of 4096 bytes, 97.73% of the values fit in it. For the rest of them I can try to reallocate a buffer with double capacity, until all values fit in it. For my test case this can be accomplished by reallocating memory 6 times for the worst case, when there are 9000 values for a key.
I assume you're going to go through them with for-each anyway, after you've allocated your byte array, but you don't want to have to buffer all the records in memory (as you can only loop through the iterator you get back from your value collection once). Therefore, you could
Run a counting reducer that outputs every input record and also outputs the count to a record that is of the same value class as the map output, and then run a "reduce-only" job on that result using a custom sort that puts the count first (recommended)
Override the built-in sorting you get with Hadoop to count while sorting and inject that count record as the first record of its output (it's not totally clear to me how you would accomplish the override, but anything's possible)
If the values are unique, you might be able to have a stateful sort comparator that retains a hash of the values with which it gets called (this seems awfully hacky and error prone, but I bet you could get it to work if the mechanics of secondary sort are confined to one class loader in one JVM)
Design your reducer to use a more flexible data structure than a byte array, and convert the result to a byte array before outputting if necessary (highly recommended)
You can use the following paradigm:
Map: Each mapper keeps a map from keys to integers, where M[k] is number of values sent out with a certain key k. At the end of its input, the map will also send out the key-value pairs (k, M[k]).
Sort: Use secondary sort so that the pairs (k, M[k]) come before the pairs (k, your values).
Reduce: Say we're looking at key k. Then the reducer first aggregates the counts M[k] coming from the different mappers to obtain a number n. This is the number you're looking for. Now you can create your data structure and do your computation.

Join vs COGROUP in PIG

Are there any advantages (wrt performance / no of map reduces ) when i use COGROUP instead of JOIN in pig ?
http://developer.yahoo.com/hadoop/tutorial/module6.html talks about the difference in the type of output they produce. But, ignoring the "output schema", are there any significant difference in performance ?
There are no major performance differences. The reason I say this is they both end up being a single MapReduce job that send the same data forward to the reducers. Both need to send all of the records forward with the key being the foreign key. If at all, the COGROUP might be a bit faster because it does not do the cartesian product across the hits and keeps them in separate bags.
If one of your data sets is small, you can use a join option called "replicated join". This will distribute the second data set across all map tasks and load it into main memory. This way, it can do the entire join in the mapper and not need a reducer. In my experience, this is very worth it because the bottleneck in joins and cogroups is the shuffling of the entire data set to the reducer. You can't do this with COGROUP, to my knowledge.

How to ensure that MapReduce tasks are independent of each other?

I'm curious, but how does MapReduce, Hadoop, etc., break a chunk of data into independently operated tasks? I'm having a hard time imagining how that can be, considering it is common to have data that is quite interrelated, with state conditions between tasks, etc.
If the data IS related it is your job to ensure that the information is passed along. MapReduce breaks up the data and processes it regardless of any (not implemented) relations:
Map just reads data in blocks from the input files and passes them to the map-function one "record" at a time. Default-record is a line (but can be modified).
You can annotate the data in Map with its origin but what you can basically do with Map is: categorize the data. You emit a new key and new values and MapReduce groups by the new key. So if there are relations between different records: choose the same (or similiar *1) key for emitting them, so they are grouped together.
For Reduce the data is partitioned/sorted (that is where the grouping takes places) and afterwards the reduce-function receives all data from one group: one key and all its associated values. Now you can aggregate over the values. That's it.
So you have an over-all group-by implemented by MapReduce. Everything else is your responsibility. You want a cross product from two sources? Implement it for example by introducing artifical keys and multi-emitting (fragment and replicate join). Your imagination is the limit. And: you can always pass the data through another job.
*1: similiar, because you can influence the choice of grouping later on. normally it is group be identity-function, but you can change this.

Resources