I am loading 10 million records into Hbase table through importsv tool from hadoop multinode cluster. Right now it is taking 5 minutes for this task. But i was wondering how i could improve the performance of this. The importtsv tool does not seem like using reducers at all. I was wondering if i could anyway force this to use reducers, it could improve performance or any other way which you think would improve the performance would be appreciated. Thank you.
Try Importtsv with HfileOutPutFormat , completeBulkLoadTool.
when it comes to performance, there is no easy answer. If the 5 minutes equals to the speed of the network, or the speed of the hard disk, you have to move the source data to somewhere else or change the hardware.
I don't know importsv. I would suggest you to try multi-way load. Take a look at Sqoop.
You can get best HBase bulk load performance with use of HFileOutputFormat and CompleteBulkLoad
Check here.
Related
I am currently working on Java MapReduce.We have functionality where we read each line in Java Mapper class and then do some validate against DB.The issue is in DB we have around 5 million records.
The input file to Mapper may also contains records #1 million.
So its like for each line we scan 8 million records.
This process is taking very huge time.
Can anybody suggest if we have any better way to improve the performance.
Running multiple maps, parallel execution(though Hadoop Java Map reduce itself does this) but looking at the current time I think it should not take this much time
May be I am missing any configuration for the Java Map reduce etc.
Thanks for help in advance.
I would suggest not to validate rows in Java code, but to filter unwanted rows using more restrictive SQL WHERE clause instead. It should give you couple of % in performance depending on rows count difference.
I would also suggest you to interest in Apache Spark which is way faster Hadoop overlay.
I know spark does the in memory computation and is much faster then MapReduce.
I was wondering how well does spark work for say records < 10000 ?
I have huge number of files around ( each file having around 10000 records , say 100 column file) coming into my hadoop data platform and i need to perform some data quality checks before i load then into hbase.
I do the data quality check in hive which uses MapReduce at the back-end. For each file it takes about 8 mins and thats pretty bad for me.
Will spark give me a better performance lets say 2-3 mins ?
I know I got to do a bench marking but i was trying to understand the basics here before i really get going with spark.
As I recollect creating RDD's for the first time will be an overhead and since i got to create a new RDD for each incoming file that going to cost me a bit.
I am confused which would be the best approach for me - spark , drill, storm or Mapreduce itself ?
I am just exploring the performance of Drill vs Spark vs Hive over around millions of records. Dill & Spark both are around 5-10 times faster in my case (I did not perform any performance test over cluster with significant RAM, I just tested on single node) The reason for fast computation - both of them perform the in-memory computation.
The performance of drill & spark is almost comparable in my case. So, I can't say which one is better. You need to try this at your end.
Testing on Drill will not take much time. Download the latest drill, install on your mapr hadoop cluster, add hive-storage plugin and perform the query.
I have a standard configured HDP 2.2 environment with Hive, HBase and YARN.
I've used Hive (/w HBase) to perform a simple count operation on a table that has about 10 million rows and it resulted with a 10gb of memory consumption from YARN.
How can I reduce this memory consumption? Why does it need so much memory just to count rows?
A simple count operation involves a map reduce job at the back end. And that involves 10 million rows in your case. Look here for a better explanation. Well this is just for the things happening at the background and execution time and not your question regarding memory requirements. Atleast, it will give you a heads up for the places to look for. This has few solutions to speed up as well. Happy coding
Hadoop is not designed to do updates. I tried with hive it has to do insert overwrite which is a costly operation also we can do some work around using Map reduce which again is a costly operation.
Is their any other tool or way by which i can do frequent updates on Hadoop or can i use spark for the same. Please help me i am not getting enough information about this even after googling 100 times. Thanks in advance.
If you need to update realtime on Hadoop, Hbase is the solution you might want to take a look at, Hive is not meant for random/frequent updates its more of a Data crunching tool not a replacement of RDBMS
I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is:
1 Namenode
3 DataNode
I'm querying ~123 millions rows with 21 columns
My node are virtualized with 2vCPU #2.27 and 8 GO RAM
So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.
And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?
Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?
And if someone try some request with appoximately the same architecture, you are welcome to share me your results.
Thanks !
I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
Edit:
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.