I'm try to use the SSD in order to improve the hive performance.
SSD is, have a high-speed random access. Taking advantage to try to change the hive to be executed in the mapreduce code.
Now my idea is to simplify or eliminate the shuffling step.
Is it possible this? If possible, Where you do change?
ps. Tell us what happens when the hive is operating, where temporary files are stored.
I do not know English well. I'm sorry.
thank you.
In theory you can write your own partitioner and send the data on reducer which runs on the same node where the mapper ran.
Doing so you will never get the output file "unsplitted", so avoid the shuffling is not a good idea.
If you have a fast disk like SSD can be, you can increase the block size.
Usually the block size is computed to have the seek time no bigger than the 1% of the whole block transfer.
This will also reduce the number of the mapper used, since the number of the splits are few. Somewhat, less mapper means also less shuffling.
Using a compressed file format for the intermediate file, also speedup the work.
Related
The question is in the title - when is it good to use compression? By good I meen faster processing.
My pipeline consists of multiple MR jobs and intermediate results are stored in sequence files.
The data is numeric - time series. Also, it happens that output of one job has same size as the input. So, the data transfered/stored can be large.
I would like to know whether I can expect speedup due to compression, or it will take more time to compress/decompress data?
It is almost always a good idea to enable compression of intermediate data with a fast codec (read snappy). You won't get penalized too much even if your data is uncompressible.
Compression doesn't affect your job as long as you are aware what you are trying to achieve, make sure your compressed data is splittable. I found bzip2 format more convenient with compression ratio and CPU usages but better to do in-house testing with different formats on your data set.
Compression give two major benefits.
1) use less disk space while running mapreduce job (intermittent output and final output compressed).
2) Increase job performance since we are sending compressed data during shuffling phase across the cluster nodes.
Hope that will help.
I was reading hadoop definitive guide , It was written Map Reduce is good for updating larger portions of the database , and it uses Sort & Merge to rebuild the database which is dependent on transfer time .
Also RDBMS is good for updating only smaller portions of a big database , It uses a B-Tree which is limited by seek time
Can anyone elaborate on what both these claims really mean ?
I am not really sure what the book means, but you will usually do a map reduce job to rebuild the entire database/anything if you still have the raw data.
The real good thing about hadoop is that it's distributed, so performance is not really a problem since you could just add more machines.
Let's take an example, you need to rebuild a complex table with 1 billion rows. With RDBMS, you can only scale vertically, so you will be depending more on the power of the CPU, and how fast the algorithm is. You will be doing it with some SQL command. You will need to select a few data, process them, do stuffs, etc. So you will most likely be limited by the seek time.
With hadoop map reduce, you could just add more machines, so performance is not the problem. Let's say you you use 10000 mappers, that means the task will be divided to 10000 mapper containers, and because of hadoop's nature, all these containers usually already have the data on their harddrive stored locally. The output of each mapper is always a key value structured format on their local harddrive. These data are sorted using the key by the mapper.
Now the problem is, they need to combine the data together, so all of these data will be sent to a reducer. This happens through the network, is usually the slowest part if you have big data. The reducer will receive all of the data and will merge-sort them for further processing. In the end you have a file which could be just uploaded to your database.
The transfer from mapper to reducer is usually what's taking the longest time if you have a lot of data, and network is usually your bottleneck. Maybe this is what it meant by depending on the transfer time.
I've tried to combine my file to upload into HDFS into one file. So, the HDFS have files number smaller than before but with the same size. So, in this condition I get faster mapreduce time, because I think the process make fewer container (map task or reduce task).
So, I want to ask, how can I set the block size properly, to get faster mapreduce? Should I set bigger than default (minimze container number)?
Thanks a lot....
Do you know, why hadoop have strong and fast compute capacity? Because it divides a single big job into many small jobs. That's the spirit of hadoop.
And there are many mechanism to coordinate it work flow, maybe adjust the blocksize couldn't attain your target.
You can set the parameter "dfs.block.size" in bytes to adjust the blocksize.
I'm using a custom output format that outputs a new sequence file per mapper per key, so you end up with something like this..
Input
Key1 Value
Key2 Value
Key1 Value
Files
/path/to/output/Key1/part-00000
/path/to/output/Key2/part-00000
I've noticed a huge performance hit, it usually takes around 10 minutes to simply map the input data, however after two hours the mappers weren't even half way complete. Though they were outputting rows. I expect the number of unique keys to be around half the number of input rows, around 200,000.
Has anyone ever done anything like this, or could suggest anything that might help the performance? I'd like to keep this key-splitting process within hadoop of possible.
Thanks!
I believe you should revisit your design. I don't believe HDFS scales well beyound 10M files. I suggest to read more on Hadoop, HDFS and Map/Reduce. A good place to start would be http://www.cloudera.com/blog/2009/02/the-small-files-problem/.
Good luck!
EDIT 8/26: Based on the #David Gruzman's comment, I looked deeper into the issue. Indeed the penalty for storing a large number of the small files is only for the NameNode. There is no additional space penalty to the data nodes. I removed the incorrect part of my answer.
It sounds like making output to some Key-Value store might help a lot.
For example HBASE might suit Your need since it is optimized for big number of writes, and you will reuse part of Your hadoop infrastructure.
There is existing output format to write right to HBase: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html
When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.
If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?
Thanks in advance!
Combiners are there to save network bandwidth.
The mapoutput directly gets sorted:
sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.
The important parts are in the MapTask, if you'd like to see it for yourself.
sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
// some fields
for (int i = 0; i < partitions; ++i) {
// check if configured
if (combinerRunner == null) {
// spill directly
} else {
combinerRunner.combine(kvIter, combineCollector);
}
}
This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered.
During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.
Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.
There are two opportunities for running the Combiner, both on the map side of processing. (A very good online reference is from Tom White's "Hadoop: The Definitive Guide" - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort )
The first opportunity comes on the map side after completing the in-memory sort by key of each partition, and before writing those sorted data to disk. The motivation for running the Combiner at this point is to reduce the amount of data ultimately written to local storage. By running the Combiner here, we also reduce the amount of data that will need to be merged and sorted in the next step. So to the original question posted, yes, the Combiner is already being applied at this early step.
The second opportunity comes right after merging and sorting the spill files. In this case, the motivation for running the Combiner is to reduce the amount of data ultimately sent over the network to the reducers. This stage benefits from the earlier application of the Combiner, which may have already reduced the amount of data to be processed by this step.
The combiner is only going to run how you understand it.
I suspect the reason that the combiner only works in this way is that it reduces the amount of data being sent to the reducers. This is a huge gain in many situations. Meanwhile, in the reducer, the data is already there, and whether you combine them in the sort/merge or in your reduce logic is not really going to matter computationally (it's either done now or later).
So, I guess my point is: you may get gains by combining like you say in the merge, but it's not going to be as much as the map-side combiner.
I haven't gone through the code but in reference to Hadoop : The definitive guide by Tom White 3rd edition, it does mention that if the combiner is specified it will run during the merge phase in the reducer. Following is excerpt from the text:
" The map outputs are copied to the reduce task JVM’s memory if they are small enough
(the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which
specifies the proportion of the heap to use for this purpose); otherwise, they are copied
to disk. When the in-memory buffer reaches a threshold size (controlled by
mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs
(mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
"