Hive query generation is taking long time to generate dataset - hadoop

I am trying to run hive query on huge amount of data(almost in half of petabyte), and these query running map reduce internally. it takes very long time to generate the data set(map reduce to complete) what optimization mechanism for hive and Hadoop i can use to make these query faster, one more important question i have does the amount of disk available for map reduce or in /tmp directory is important for faster map reduce?

There is not too much you can do, but I can give a few direction what usually can be done with Hive:
You should select SQLs which cause less shuffling. For example you can try to cause map side joins when possible. You can also do some operations in a way that will lead to map-only queries.
Another way is to tune number of reducers - sometimes Hive defines much less reducers then needed - so you can set it manually to better utilize your cluster
If you have number of queries to run to do your transformation - you can define low replication factor for this temporary data in HDFS
More help can be provided if we have info what are you doing.

Related

How do we optimise the spark job if the base table has 130 billions of records

We are joining multiple tables and doing complex transformations and enrichments.
In that the base table will have around 130 billions of records, how can we optimise the spark job when the spark filters all the records keep in memory and do the enrichments with other left outer join tables. Currently spark job is running for more than 7 hours, can you suggest some techniques
Here is what you can try
Partition your base tables on which you want to run your query, create partition on specific column like Department, or Date etc which you use during joining. If the under lying table is hive you can also try bucketing.
Try optimised joins which suits your requirement such sorted merge join, hash join.
File format, use parquet file format as it much faster compared to ORC for analytical queries, and it also stores data in columnar format.
If your query has multiple steps and some steps are reused try to use caching, as spark supports memory and disk caching.
Tune your spark jobs by specifying the number of partitions, executor, cores, driver memory as per the resources available. Check spark history UI to understand how data is distributed. Try various configurations see what works best for you.
Spark might perform poorly if there large skewness in data. if that is the case you might need further optimisation to handle it.
Apart from the above mentioned techniques, you can try below option as well to optimize your job.
1.You can partition your data by inspecting your data fields. Most common columns that are used for partitioning are like date columns, region ID, country code etc.Once data is partitioned your can explain your dataframe like df.explain() and see if is using PartitioningAwareFileIndex.
2.Try tuning the spark settings and cluster configuration to scale with the input data volume.
Try changing the spark.sql.files.maxPartitionBytes to 256 MB or 512
MB , we have see significant performance gain by changing this
parameter.
Use appropriate number of executor , cores & executor memory based on
compute need
Try analyzing the spark history to identify the stage jobs which are
consuming significant time. This would be good point to start
debugging your job.

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

Hadoop Performance When retrieving Data Only

We know that performance of Hadoop may be increased by adding more data nodes. My question is: if we want to retrieve the data only without the need to process it or analyze it, is adding more data nodes will be useful? or it won't increase performance at all because we have retrieve operations only without any computations or map reduce jobs?
I will try to answer in parts:
If you only retrieve information from a hadoop cluster or HDFS then
it is similar to Cat command in linux, meaning only reading data
not processing.
If you want some calculations like SUM, AVG or any other aggregate
functions on top of your data then comes the concept of REDUCE ,
hence Map reduce comes into picture.
So hadoop is useful or worthy when your data is Huge and you do
calculations also. I think their is no performance benefits while
reading a small amount of data in HDFS than reading a Large amount
of data in HDFS (just think like you are storing your data in RDBMS
regularly and you only query select * statements on daily basis),
but when your data grows exponentially and you want to do
calculations your RDBMS query would take time to execute.
For Map reduce to work efficiently on huge data sets , you need to
have good amount of nodes and computing power, depending upon your
use case.

What can I expect about hive and hadoop in performance?

I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is:
1 Namenode
3 DataNode
I'm querying ~123 millions rows with 21 columns
My node are virtualized with 2vCPU #2.27 and 8 GO RAM
So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.
And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?
Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?
And if someone try some request with appoximately the same architecture, you are welcome to share me your results.
Thanks !
I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
Edit:
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.

How does sorting(Order by) be implemented in Hive?

We know that hive doesn't do sampling before a sorting job start.It just leverage the sorting machenism of MapReduce and perform merge-sort in reduce side and only one reduce is used.Since reduce collects all data output by mapper in this scenario,say a machine running reduce has ony 100GB disk, what if the data is too big to fit in the disk?
The parallel sorting mechanism of Hive is still under development, see here.
A well designed data warehouse or database application will avoid such global sorting. If needed, try using Pig or Terasort(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html)

Resources