mapWithState updates the state for each key that is received. How would this work when executors are running in parallel? Is there a single state for a key or there is a state for each partition in Dstream rdd?
Before updatkeByKey, spark shuffles the records using hash partition, so the records always end up in the same partition.
Related
I am trying to explore Spark streaming from Kafka as the source. As per this link, createDirectStream has 1:1 parallelism between kafka partitions and Spark. So this would mean that, if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
Questions
Suppose i have a window operation after the data is read. Does the window operation apply window across partitions or within one
partition i.e. lets say my batch interval is 10s and window interval
is 50s. Does window accumulate data for 50s of data across partitions
(if each partition has 10 records each for 50s, does window hold 30
records) or 50s of data per partition in parallel (if each partition
has 10 records each for 50s, does window hold 10 records)?
pseudo code:
rdd = createDirectStream(...)
rdd.window()
rdd.saveAsTextFile() //Does this write 30 records in 1 file or 3 files
with 10 records per file?
Suppose i have this...
Pseudo code:
rdd = createDirectStream()
rdd.action1()
rdd.window()
rdd.action2()
Lets say, i have 3 kafka partitions and 3 executors (each reading a
topic). This spins 2 jobs as there are 2 actions. Each spark executor
would have partition of the RDD and action1 is applied in parallel.
Now for action2, would the same set of executors be used (otherwise,
the data has to be read from Kafka again - not good)?
Q) if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
In more specific terms, there will be 3 tasks submitted to the Spark cluster, one for each partition. Where these tasks execute depend on your cluster topology and locality settings but in general you can consider that these 3 tasks will run in parallel.
Q) Suppose I have a window operation after the data is read. Does the window operation apply window across partitions or within one partition?
The fundamental model of Spark and by transitivity of Spark Streaming is that operations are declared on an abstraction (RDD/Datasets for Spark, DStream for Spark Streaming) and at the execution level, those operations will be applied in a distributed fashion, using the native partitioning of the data.
((I'm not sure about the distinction the question makes between "across partitions or within one partition". The window will be preserved per partition. The operation(s) will be applied according to their own semantics. For example, a map operation will be applied per partition, while a count operation will be first applied to each partition and then consolidated to one result.))
Regarding the pseudo code:
val dstream = createDirectStream(..., Seconds(30))
dstream.window(Seconds(600)) // this does nothing as the new dstream is not referenced any further
val windowDstream = dstream.window(timePeriod) // this creates a new Windowed DStream based on the base DStream
dstream.saveAsTextFiles() // this writes using the original streaming interval (30 seconds). It will write 1 logical file in the distributed file system with 3 partitions
windowDstream.saveAsTextFiles() // this writes using the windowed interval (600 seconds). It will write 1 logical file in the distributed file system with 3 partitions.
Given this code (note naming changes!):
val dstream = createDirectStream(...)
dstream.action1()
val windowDStream = dstream.window(...)
windowDStream.action2()
for action2, would the same set of executors be used (otherwise, the data has to be read from Kafka again - not good)?
In the case of Direct Stream model, the RDDs at each interval do not contain any data, only offsets (offset-start, offset-end). It's only when an action is applied that the data is read.
A windowed dstream over a direct producer is, therefore, just a series of offsets: Window (1-3) = (offset1-start, offset1-end), (offset2-start, offset2-end), (offset3-start, offset3-end). When an action is applied to that window, these offsets will be fetched from Kafka and the operation will be applied. This is not "bad" as implied in the question. This prevents us from having to store intermediate data for long periods of time and lets us preserve operation semantics on the data.
So, yes, the data will be read again, and that's a good thing.
When reading about MapReduce, I read the below interesting lines:
"But how do the Reducer’s know which nodes to query to get their
partitions? This happens through the Application Master. As each
Mapper instance completes, it notifies the Application Master about
the partitions it produced during its run. Each Reducer
periodically queries the Application Master for Mapper hosts until it
has received a final list of nodes hosting its partitions."
I have a doubt here. When they say Each Reducer what does it mean exactly? Will the reducers be allocated before the starting of the map phase and also how are the reducer nodes chosen?
Reducers can start before the maps are done with the processing of the data. Once they start they can pull the data from the mapper machines, but they will start the processing only after all the mappers are done processing of the data.
mapred.reduce.slowstart.completed.maps is the property to configure this behaviour. More information on the property here.
I have been using Spark and I am curious about how exactly RDDs work. I understand that an RDD is a pointer to the data. If I am trying to create an RDD for a HDFS file, I understand that the RDD will be a pointer to the actual data on the HDFS file.
What I do not understand is where the data gets stored in-memory. When a task is sent to the worker node, does the data for a specific partition get stored in-memory on that worker node? If so, what happens when an RDD partition is stored in memory on worker node1, but worker node2 has to compute a task for the same partition of the RDD? Does worker node2 communicate with worker node1 to get the data for the partition and store it in its own memory?
In principle, tasks are divided across executors, each one representing its own separate chunk of data (for instance, from HDFS files or folders). The data for a task is loaded in local memory for that executor. Multiple transformations can be chained on the same task.
If, however, a transformation needs to pull data from more than one executor, a new set of tasks will be created, and the results from the previous tasks will be shuffled and re-distributed across executors. For instance, many of the *byKey transformation will shuffle the entire data around, through HDFS, so that executors can perform the second set of tasks. The number of times and valume of shuffled data is critical in Spark's performance.
What is difference between partition and replica of a topic in kafka cluster.
I mean both store the copies of messages in a topic. Then what is the real diffrence?
When you add the message to the topic, you call send(KeyedMessage message) method of the producer API. This means that your message contains key and value. When you create a topic, you specify the number of partitions you want it to have. When you call "send" method for this topic, the data would be sent to only ONE specific partition based on the hash value of your key (by default). Each partition may have a replica, which means that both partitions and its replicas store the same data. The limitation is that both your producer and consumer work only with the main replica and its copies are used only for redundancy.
Refer to the documentation: http://kafka.apache.org/documentation.html#producerapi
And a basic training: http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign
Topics are partitioned across multiple nodes so a topic can grow beyond the limits of a node. Partitions are replicated for fault tolerance. Replication and leader takeover is one of the biggest difference between Kafka and other brokers/Flume. From the Apache Kafka site:
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
partition: each topic can be splitted up into partitions for load balancing (you could write into different partitions at the same time) & scalability (the topic can scale up without the instance limitations); within the same partition the records are ordered;
replica: for fault-tolerant durability mainly;
Quotes:
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
There is a quite intuitive tutorial to explain some fundamental concepts in Kafka: https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm
Furthermore, there is a workflow to get you through the confusing jumgle: https://www.tutorialspoint.com/apache_kafka/apache_kafka_workflow.htm
Partitions
A topic consists of a bunch of buckets. Each such bucket is called a partition.
When you want to publish an item, Kafka takes its hash, and appends it into the appropriate bucket.
Replication Factor
This is the number of copies of topic-data you want replicated across the network.
In simple terms, partition is used for scalability and replication is for availability.
Kafka topics are divided into a number of partitions. Any record written to a particular topic goes to particular partition. Each record is assigned and identified by an unique offset. Replication is implemented at partition level. The redundant unit of topic partition is called replica. The logic that decides partition for a message is configurable. Partition helps in reading/writing data in parallel by splitting in different partitions spread over multiple brokers. Each replica has one server acting as leader and others as followers. Leader handles the read/write while followers replicate the data. In case leader fails, any one of the followers is elected as the leader.
Hope this explains!
Further Reading
Partitions store different data of the same type and
Yes, you can store the same message in different topic partitions but your consumers need to handle duplicated messages.
Replicas are a copy of these partitions in other servers.
Your number of replicas will be defined by the number of kafka brokers (servers) of your cluster
Example:
Let's suppose you have a Kafka cluster of 3 brokers and inside you have a topic with name AIRPORT_ARRIVALS that receives messages of Flight information and it has 3 partitions; partition 1 for flight arrivals from airline A, partition 2 from airline B, and partition 3 from airline C. All these messages will be initially written in one broker (leader) and a copy of each message will be stored/replicated to the other 2 Kafka broker (followers). Disclaimer; this example is only for an easier explanation and not an ideal way to define a message key because you could ending up with unbalanced load over specific partitions.
Partitions are the way that Kafka provides redundancy.
Kafka keeps more than one copy of the same partition across multiple brokers.
This redundant copy is called a Replica. If a broker fails, Kafka can still serve consumers with the replicas of partitions that failed broker owned
I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful