Ideas to improve the performance Java MapReduce - hadoop

I am currently working on Java MapReduce.We have functionality where we read each line in Java Mapper class and then do some validate against DB.The issue is in DB we have around 5 million records.
The input file to Mapper may also contains records #1 million.
So its like for each line we scan 8 million records.
This process is taking very huge time.
Can anybody suggest if we have any better way to improve the performance.
Running multiple maps, parallel execution(though Hadoop Java Map reduce itself does this) but looking at the current time I think it should not take this much time
May be I am missing any configuration for the Java Map reduce etc.
Thanks for help in advance.

I would suggest not to validate rows in Java code, but to filter unwanted rows using more restrictive SQL WHERE clause instead. It should give you couple of % in performance depending on rows count difference.
I would also suggest you to interest in Apache Spark which is way faster Hadoop overlay.

Related

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

What can I expect about hive and hadoop in performance?

I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is:
1 Namenode
3 DataNode
I'm querying ~123 millions rows with 21 columns
My node are virtualized with 2vCPU #2.27 and 8 GO RAM
So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.
And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?
Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?
And if someone try some request with appoximately the same architecture, you are welcome to share me your results.
Thanks !
I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
Edit:
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.

Performance with a large number of multiple output files in Hadoop

I'm using a custom output format that outputs a new sequence file per mapper per key, so you end up with something like this..
Input
Key1 Value
Key2 Value
Key1 Value
Files
/path/to/output/Key1/part-00000
/path/to/output/Key2/part-00000
I've noticed a huge performance hit, it usually takes around 10 minutes to simply map the input data, however after two hours the mappers weren't even half way complete. Though they were outputting rows. I expect the number of unique keys to be around half the number of input rows, around 200,000.
Has anyone ever done anything like this, or could suggest anything that might help the performance? I'd like to keep this key-splitting process within hadoop of possible.
Thanks!
I believe you should revisit your design. I don't believe HDFS scales well beyound 10M files. I suggest to read more on Hadoop, HDFS and Map/Reduce. A good place to start would be http://www.cloudera.com/blog/2009/02/the-small-files-problem/.
Good luck!
EDIT 8/26: Based on the #David Gruzman's comment, I looked deeper into the issue. Indeed the penalty for storing a large number of the small files is only for the NameNode. There is no additional space penalty to the data nodes. I removed the incorrect part of my answer.
It sounds like making output to some Key-Value store might help a lot.
For example HBASE might suit Your need since it is optimized for big number of writes, and you will reuse part of Your hadoop infrastructure.
There is existing output format to write right to HBase: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html

Hadoop/Hbase performance improvement of bulk loading

I am loading 10 million records into Hbase table through importsv tool from hadoop multinode cluster. Right now it is taking 5 minutes for this task. But i was wondering how i could improve the performance of this. The importtsv tool does not seem like using reducers at all. I was wondering if i could anyway force this to use reducers, it could improve performance or any other way which you think would improve the performance would be appreciated. Thank you.
Try Importtsv with HfileOutPutFormat , completeBulkLoadTool.
when it comes to performance, there is no easy answer. If the 5 minutes equals to the speed of the network, or the speed of the hard disk, you have to move the source data to somewhere else or change the hardware.
I don't know importsv. I would suggest you to try multi-way load. Take a look at Sqoop.
You can get best HBase bulk load performance with use of HFileOutputFormat and CompleteBulkLoad
Check here.

Query related to Hadoop's map-reduce

Scenario:
I have one subset of database and one dataware house. I have bring this both things on HDFS.
I want to analyse the result based on subset and datawarehouse.
(In short, for one record in subset I have to scan each and every record in dataware house)
Question:
I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.
Pls suggest me some idea so that I can able to perform it?
Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.
FYI, you could use Apache Sqoop to import DB into HDFS.
Some time ago I wrote a Hadoop map reduce for one of my classes. I was scanning several IMD databases and producing a merged information about actors (basically the name, biography and films he acted in was in different databases). I think you can use the same approach I used for my homework:
I wrote a separate map reduce turning every database file in the same format, just placing a two-letter prefix infront of every row the map-reduce produced to be able to tell 'BI' (biography), 'MV' (movies) and so on. Then I used all these produced files as input for my last map reduced that processed them grouping them in the desired way.
I am not even sure that you need so much work if you are really going to scan every line of the datawarehouse. Maybe in this case you can just do this scan either in the map or the reduce phase (based on what additional processing you want to do), but my suggestion assumes that you actually need to filter the datawarehouse based on the subsets. If the latter my suggestion might work for you.

Resources