Hive performance - hadoop

I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.
Number of mappers allocated to my job is very low even though there
are hundreds of mappers available. I have tried setting
mapred.map.tasks=200. But it takes only 20 to 30 mappers. I
understand, number of mappers depend upon the inputsplit. Is there
any other option to increase the mappers? if no then why is the
parameter(mapred.map.tasks) introduced ?
Is there any resource where i can understand to correlate hive
queries to map-reduce jobs, i.e where the different part of the
query is executed?

For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.
To see how Hive queries are executed, simply preface your query with explain. For example: explain select foo from bar;. If you need even more information, there's also explain extended.

I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.
To optimize Hive performance:
Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper mapreduce.input.fileinputformat.split.maxsize, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer
bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.
Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)...
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequently SELECT * from myTable WHERE someColumn = 'someValue' it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec
Choosing the right compression codec, for example using Snappy (as in here):
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
Not been available at the time of question:
Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
For further optimization you could refer here

You can decrease 'mapreduce.input.fileinputformat.split.maxsize' to increase the number of mappers (more splits).

Related

How to find optimum Spark-athena file size

I have a spark job that writes to s3 bucket and have a athena table on top of this location.
The table is partitioned. Spark was writing 1GB single file per partition. We experimented with maxRecordsPerFile option thus writing only 500MB data per file. In the above case we ended up having 2 files with 500MB each
This saved 15 mins in run-time on the EMR
However, there was a problem with athena. Athena query CPU time started getting worse with the new file size limit.
I tried comparing the same data with the same query before and after execution and this is what I found:
Partition columns = source_system, execution_date, year_month_day
Query we tried:
select *
from dw.table
where source_system = 'SS1'
and year_month_day = '2022-09-14'
and product_vendor = 'PV1'
and execution_date = '2022-09-14'
and product_vendor_commission_amount is null
and order_confirmed_date is not null
and filter = 1
order by product_id
limit 100;
Execution time:
Before: 6.79s
After: 11.102s
Explain analyze showed that the new structure had to scan more data.
Before: CPU: 13.38s, Input: 2619584 rows (75.06MB), Data Scanned: 355.04MB; per task: std.dev.: 77434.54, Output: 18 rows (67.88kB)
After: CPU: 20.23s, Input: 2619586 rows (74.87MB), Data Scanned: 631.62MB; per task: std.dev.: 193849.09, Output: 18 rows (67.76kB)
Can you please guide me why this takes double the time? What are the things to look out for? Is there a sweet spot on file size that would be optimal for spark & athena combination?
One hypothesis is that pushdown filters are more effective with the single file strategy.
From AWS Big Data Blog's post titled Top 10 Performance Tuning Tips for Amazon Athena:
Parquet and ORC file formats both support predicate pushdown (also
called predicate filtering). Both formats have blocks of data that
represent column values. Each block holds statistics for the block,
such as max/min values. When a query is being run, these statistics
determine whether the block should be read or skipped depending on the
filter value used in the query. This helps reduce data scanned and
improves the query runtime. To use this capability, add more filters
in the query (for example, using a WHERE clause).
One way to optimize the number of blocks to be skipped is to identify
and sort by a commonly filtered column before writing your ORC or
Parquet files. This ensures that the range between the min and max of
values within the block are as small as possible within each block.
This gives it a better chance to be pruned and also reduces data
scanned further.
To test it I would suggest to do another experiment if possible. Change the spark job and sort the data before persisting it into the two files. Use the following order:
source_system, execution_date, year_month_day, product_vendor, product_vendor_commission_amount, order_confirmed_date, filter and product_id. Then check the query statistics.
At least the dataset would be optimised for the presented use case. Otherwise, change it according to the most heavy queries.
The post comments about optimal file sizes too and it gives a general rule of thumb. From my experience, Spark works well with sizes between 128MB and 2GB. It should be also fine for other query engines like Presto used by Athena.
My suggestion would be to break year_month_day/execution date ( as mostly used in the queries ) to Year, Month and Day partitions , which would reduce the amount of data scan and efficient filtering.

reading a hadoop.hive.ql.io.HiveSequenceFileOutputFormat hive table in spark

I have a hive table in hadoop, which has an output format of
hadoop.hive.ql.io.HiveSequenceFileOutputFormat
I am reading this table using the spark sql
spark.sql('select * from testtable where y = 2021 and month = 12 and day =12')
The spark job runs super slow, i have tried adjusting the number of executors and memory per executor, but nothing seems to improve the performance. I read on a blog that SequenceFile are not the best when it comes to hive table.
Is there a better way of reading this table ?
Thanks in advance for any help.
You should consider partitioning your table by date if you will continue to access it regularly by date. (Lookups on the partition will be very fast, at the cost of queries that don't use partions).
You should also look into the "small files problem" with hadoop. You can get some nice speed out of making files larger.
You should look at using Parquet or Orc. They're compress nicely and often boost performance.
You should also look at running table stats on the hive table, this also helps to increase performance.

To speed up hive process, how to adjust mapper and reducer number using tez

I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then
I tried to specify number of mapper.
Though I set mapred.map.tasks =2000,
but I can't stop mapper being set to about 150,
so I can't do what I want to do.
I specify the map value in oozie workflow file and use the tez.
How can I specify the number of mapper?
Finally I want to speed up the process, it is ok not to use tez.
In addition, I would like to count labeled sentence by reducer, it takes so much time,too.
And , I also want to know how I adjust memory size to use each mapper and reducer process.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...
... set tez.grouping.split-count=4 will create 4 mappers
https://community.pivotal.io/s/article/How-to-manually-set-the-number-of-mappers-in-a-TEZ-Hive-job
However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings. Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.
If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere

increase efficiency of sqoop export from hdfs

I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.

Is there an Alternative for HBaseStorage in PIG

I am using HBaseStorage with -caching option in pig script as follows
HBaseStorage('countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount countDetails:ineffCount countDetails:totalCount', '-caching 1000');
I can see this was reflecting in my job.xml
but I can see there is no time difference in it I am processing 10 million records and store data around 160mb in to HBase.
When I store the result in hdfs its taking 3 mins to process the same job takes 30mins to store into HBase.
I even tried by setting
SET hbase.client.scanner.caching 1000;
Please let me know how can I reduce the time.
Is there any alternative for HBaseStorage?
http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/
the above blog says that I have to set hbase.client.scanner.caching in bootstrap scrip
I don't know how to do that
will it be enough If I set it in Hbase-conf.
Please help me out of this
hbase.client.scanner.caching points to number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory.
Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer time when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout;
i.e. hbase.regionserver.lease.period This property is 1 min by default. Clients must
report in within this period else they are considered dead.
In my experience HBase doesn't perform very well with Pig. It you don't have requirement for random look-up then use only HDFS otherwie HBase MR job would be better option. Also, In Hadoop MR job, you can connect to Hbase(This option gave me the best performance).

Resources