I am trying to explore Spark streaming from Kafka as the source. As per this link, createDirectStream has 1:1 parallelism between kafka partitions and Spark. So this would mean that, if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
Questions
Suppose i have a window operation after the data is read. Does the window operation apply window across partitions or within one
partition i.e. lets say my batch interval is 10s and window interval
is 50s. Does window accumulate data for 50s of data across partitions
(if each partition has 10 records each for 50s, does window hold 30
records) or 50s of data per partition in parallel (if each partition
has 10 records each for 50s, does window hold 10 records)?
pseudo code:
rdd = createDirectStream(...)
rdd.window()
rdd.saveAsTextFile() //Does this write 30 records in 1 file or 3 files
with 10 records per file?
Suppose i have this...
Pseudo code:
rdd = createDirectStream()
rdd.action1()
rdd.window()
rdd.action2()
Lets say, i have 3 kafka partitions and 3 executors (each reading a
topic). This spins 2 jobs as there are 2 actions. Each spark executor
would have partition of the RDD and action1 is applied in parallel.
Now for action2, would the same set of executors be used (otherwise,
the data has to be read from Kafka again - not good)?
Q) if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
In more specific terms, there will be 3 tasks submitted to the Spark cluster, one for each partition. Where these tasks execute depend on your cluster topology and locality settings but in general you can consider that these 3 tasks will run in parallel.
Q) Suppose I have a window operation after the data is read. Does the window operation apply window across partitions or within one partition?
The fundamental model of Spark and by transitivity of Spark Streaming is that operations are declared on an abstraction (RDD/Datasets for Spark, DStream for Spark Streaming) and at the execution level, those operations will be applied in a distributed fashion, using the native partitioning of the data.
((I'm not sure about the distinction the question makes between "across partitions or within one partition". The window will be preserved per partition. The operation(s) will be applied according to their own semantics. For example, a map operation will be applied per partition, while a count operation will be first applied to each partition and then consolidated to one result.))
Regarding the pseudo code:
val dstream = createDirectStream(..., Seconds(30))
dstream.window(Seconds(600)) // this does nothing as the new dstream is not referenced any further
val windowDstream = dstream.window(timePeriod) // this creates a new Windowed DStream based on the base DStream
dstream.saveAsTextFiles() // this writes using the original streaming interval (30 seconds). It will write 1 logical file in the distributed file system with 3 partitions
windowDstream.saveAsTextFiles() // this writes using the windowed interval (600 seconds). It will write 1 logical file in the distributed file system with 3 partitions.
Given this code (note naming changes!):
val dstream = createDirectStream(...)
dstream.action1()
val windowDStream = dstream.window(...)
windowDStream.action2()
for action2, would the same set of executors be used (otherwise, the data has to be read from Kafka again - not good)?
In the case of Direct Stream model, the RDDs at each interval do not contain any data, only offsets (offset-start, offset-end). It's only when an action is applied that the data is read.
A windowed dstream over a direct producer is, therefore, just a series of offsets: Window (1-3) = (offset1-start, offset1-end), (offset2-start, offset2-end), (offset3-start, offset3-end). When an action is applied to that window, these offsets will be fetched from Kafka and the operation will be applied. This is not "bad" as implied in the question. This prevents us from having to store intermediate data for long periods of time and lets us preserve operation semantics on the data.
So, yes, the data will be read again, and that's a good thing.
Related
The following are phases of my job :
Phase 1 - Do some computation and persist the temp data into file.There will be multiple temp dataframes persisted and read in the flow.
Phase 2 - Read the temp data and do some other computation and store it to the final data file.
NOTE: I am persisting multiple temp files as I cannot hold them in memory since the data is huge.(84 million rows , 2 million distinct primary key kindoff value )
I use coleasce(n) or repartition(n) , where n is a large number eg: 200. Now this leads to 200 files created in the output for each of the temp data that I'm persisting. I know coleasce/repartition is a costly job for write performance. But i do get better parallelism when i use n=200 than when n=50. This is all with respect to write.
Now , this temp data is going to be read by next processes , So will n=200 be better or n=50 ?
Also, I am aware that the parent partition number (n) will be the base for the next write operation and so on.
Qs:
When to use coleasce(no shuffle) and when to use repartition (shuffle) ?
The partition value to be used and why?
What strategy should I follow for getting a better performance?
1) Use coalesce when the size of the output files is unlikely to be skewed (1 files 2GB, rest 0GB). Repartitioning is most useful when you want to balance the work among executors so each partition is similarly sized.
2) set your output partitions based on assigned value to write and read time. For example, have a large number of partitions (smaller files) for writing and reading both once (re intermediate output), but set the partitions lower (larger files) when writing once reading many times (WORM for using parquet as analytics). The more partitions, the more concurrent tasks that can be done at once.
3) try the different approaches if you can and measure the writing and reading times; determine the tradeoffs that best suit your use case.
It's a lot like compression algorithms where some can compress fast (eg LZO), others store with minimal footprint (eg BZip2) and others decompress fast (eg Snappy).
I just started with Camus.
I am planning to run Camus, every one hour. We get around ~80000000 messages every hour and average message size is 4KB (we have a single topic in Kafka).
I first tried with 10 mappers, it took ~2hours to copy one hour's data and it created 10 files with ~7GB size.
Then I tried 300 mappers, it brought down the time to ~1 hour. But it created 11 files. Later, I tried with 150 mappers and it took ~30 minutes.
So, how do I choose the number of mappers in this? Also, I want to create more files in hadoop as one size is growing to 7GB. What configuration do I have to check?
It should ideally be equal or less than the kafka partitions you have , in your topic .
That means , for better throughput you topic should have more partitions and same number of camus mappers
I have found best answer in this article
The number of maps is usually driven by the number of DFS blocks in the input files. It causes people to adjust their DFS block size to adjust the number of maps.
The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks.
It is best if the maps take at least a minute to execute.
All depends on the power of CPU you have, the type of application - IO Bound (heavy read/write) Or CPU bound ( heavy processing) and number of nodes in your Hadoop cluster.
Apart from setting number of mappers and reducers at global level, override those values at Job level depending on data to be processing needs of the Job.
And one more thing in the end : If you think Combiner reduces the IO transfers between Mapper and Reducer, use it effectively in combination with Partitioner
So I am using MultipleOutputs from the package org.apache.hadoop.mapreduce.lib.output.
I have a reducer that is doing a join of 2 data sources and emitting 3 different outputs.
55 reduce tasks were invoked and on an average each of them took about 6 minutes to emit data. There were outliers that took about 11 minutes.
So I observed that if I comment the pieces where actual output is happening, i.e. the call to mos.write() (multiple output) then the average time reduces to seconds and the whole job completes in about 2 minutes.
I do have a lot of data to emit (Approximately 40-50 GBs) of data.
What can I do to speed up things a bit, with and without considering compression.
Details: I am using TextOutputFormat and giving a hdfs path/uri.
Further clarifications:
I have small input data to my reducer, however the reducers are doing a reduce side join and hence emitting large amount of data. Since an outlier reducer is approximately taking about 11 minutes, reducing the number of reducers will increase this time and hence increase the overall time of my job and won't solve my purpose.
Input to the reducer comes from 2 mappers.
Mapper 1 -> Emits about 10,000 records. (Key Id)
Mapper 2 -> Emits about 15M records. (Key Id, Key Id2, Key Id3)
In reducer I get everything belonging to Key Id, sorted by Key Id, KeyId2 and KeyId3.
So I know I have an iterator which is like:
Mapper1 output and then Mapper2 output.
Here I store Mapper1 output in an ArrayList and start streaming Mapper2's output.
For every Mapper2 record I do, a mos.write(....)
I conditionally store a part of this record in memory (in a HashSet)
For every time KeyId2 changes, I do an extra mos.write(...)
At the close method of my reducer, I emit if I stored anything in the conditional step. So third mos.write(...)
I have gone through the article http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
as mentioned:
Tip1 : Configuring my cluster correctly is beyond my control.
Tip2 : Use LZO compression - or compression in general. Something I am trying alongside.
Tip3 : Tune the number of mappers and reducers - My mappers finish really fast (in order of seconds) Probably because they are almost identity mappers. Reducers take some time, as mentioned above (This is the time I'm trying to reduce) So increasing the number of reducers will probably help me - but then there will be resource contention and some reducers will have to wait. This is more of an experimental try and error sort of stuff for me.
Tip4 : Write a combiner. Does not apply to my case (of reduce side joins)
Tip5 : Use apt writable - I need to use Text as of now. All these 3 outputs are going into directories that have a hive schema sitting on top of it. Later when I figure how to emit ParquetFormat files from multiple outputs, I might change this and the tables storage method.
Tip6 : Reuse writables. Okay this is something I have not considered so far, but I still believe that its the disk writes that are taking time and not processing or java heap. But anyway I'll give it a shot again.
Tip7 : Use poor man's profiling. Kind of already done that and figured out that its actually mos.write steps that are taking most of the time.
DO
1. Reduce the number of reducers.
Optimized no of reducer count(for a general ETL operation) is to have around 1GB data for 1 reducer.
Here your input data(in GBs) itself is less than the no of reducers.
2. Code optimization can be done. Share the code else refer http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ to optimize it.
3. If this does not help then understand your data. The data might be skewed. In case if you don't know what is skewed, skewed join in pig will help.
My use case as mentioned below.
Read input data from local file system using sparkContext.textFile(input path).
partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or repartition() on the input data spark executes really slow and fails with out of memory exception.
The issue i am facing here is in deciding the number of partitions to be applied on the input data. The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.
My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated.
I am using spark 1.0 on yarn.
Thanks,
AG
Two notes from Tuning Spark in the Spark official documentation:
1- In general, we recommend 2-3 tasks per CPU core in your cluster.
2- Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.
These are two rule of tumb that help you to estimate the number and size of partitions. So, It's better to have small tasks (that could be completed in hundred ms).
Determining the number of partitions is a bit tricky. Spark by default will try and infer a sensible number of partitions. Note: if you are using the textFile method with compressed text then Spark will disable splitting and then you will need to re-partition (it sounds like this might be whats happening?). With non-compressed data when you are loading with sc.textFile you can also specify a minium number of partitions (e.g. sc.textFile(path, minPartitions) ).
The coalesce function is only used to reduce the number of partitions, so you should consider using the repartition() function.
As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
I assume you know the size of the cluster going in,
then you can essentially try to partition the data in some multiples of
that & use rangepartitioner to partition the data roughly equally. Dynamic
partitions are created based on number of blocks on filesystem & hence the
task overhead of scheduling so many tasks mostly kills the performance.
import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))
Is it possible to achieve distributed reads from HDSF cluster using an HDFS client on one machine?
I have carried out an experiment with a cluster consisting of 3 data nodes (DN1,DN2,DN3). Then I run 10 simultaneous reads from 10 independent files from a client program located on DN1, and it appeared to be only reading data from DN1. Other data nodes (DN2,DN3) were showing zero activity (judging from debug logs).
I have checked that all files' blocks are replicated across all 3 datanodes, so if I shut down DN1 then data is read from DN2 (DN2 only).
Increasing the amount of data read did not help (tried from 2GB to 30GB).
Since I have a need to read multiple large files and extract only a small amount of data from it (few Kb), I would like to avoid using map/reduce since it requires settings up more services and also requires writing the output of each split task back to HDFS. Rather it would be nice to have the result streamed directly back to my client program from the data nodes.
I am using SequenceFile for reading/writing data, in this fashion (jdk7):
//Run in thread pool on multiple files simultaneously
List<String> result = new ArrayList<>();
LongWritable key = new LongWritable();
Text value = new Text();
try(SequenceFile.Reader reader = new SequenceFile.Reader(conf,
SequenceFile.Reader.file(filePath)){
reader.next(key);
if(key.get() == ID_I_AM_LOOKING_FOR){
reader.getCurrentValue(value);
result.add(value.toString());
}
}
return result; //results from multiple workers are merged later
Any help appreciated. Thanks!
I'm afraid the behavior you see is by-design. From Hadoop document:
Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries
to satisfy a read request from a replica that is closest to the
reader. If there exists a replica on the same rack as the reader node,
then that replica is preferred to satisfy the read request. If angg/
HDFS cluster spans multiple data centers, then a replica that is
resident in the local data center is preferred over any remote
replica.
It can be further confirmed by corresponding Hadoop source code:
LocatedBlocks getBlockLocations(...) {
LocatedBlocks blocks = getBlockLocations(src, offset, length, true, true);
if (blocks != null) {
//sort the blocks
DatanodeDescriptor client = host2DataNodeMap.getDatanodeByHost(
clientMachine);
for (LocatedBlock b : blocks.getLocatedBlocks()) {
clusterMap.pseudoSortByDistance(client, b.getLocations());
// Move decommissioned datanodes to the bottom
Arrays.sort(b.getLocations(), DFSUtil.DECOM_COMPARATOR);
}
}
return blocks;
}
I.e., all available replicas are tried one after another if former one fails but the nearest one is always the first.
On the other hand, if you access HDFS files through HDFS Proxy, it does pick datanodes randomly. But I don't think that's what you want.
In addition to what Edwardw said note that your current cluster is very small (just 3 nodes) and in this case you see the files on all the nodes. This happens because the default replication factor of Hadoop is also 3. In a larger cluster your files will not be available on each node and so accessing multiple files is likely to go to different nodes and spread the load.
If you work with smaller datasets you may want to look at HBase which lets you work with smaller chunks and spread the load between nodes (by splitting regions)
I would tell that your case sounds good for MR. If we put aside particular MR computational paradigm we can tell that hadoop is built to bring code to the data, instead of opposite. Moving code to the data is essential to get scalable data processing.
In other hand - setting up MapReduce is easier then HDFS - since it store no state between jobs.
In the same time - MR framework will care about parallel processing for you - something it will take time to do properly.
Another point- if results of data processing is so small - there will be no significant performance impact if you will combine them together in reducer.
In other words - I would suggest to reconsider use of the MapReduce.