Fail to Increase Hive Mapper Tasks? - hadoop

I have a managed Hive table, which contains only one 150MB file. I then do "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to a bigger number.
First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I also used 'set dfs.block.size=8388608;', not working either.
Then I tried a vanilla map-reduce job to do the same thing. It initially uses 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem lies in Hive, I suppose.
I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, etc. can't find a clue.
What else settings can I use?

I combined #javadba 's answer with that I received from Hive mailing list, here's the solution:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapred.map.tasks = 20;
select count(*) from dw_stage.st_dw_marketing_touch_pi_metrics_basic;
From the mailing list:
It seems that HIVE is using the old Hadoop MapReduce API and so mapred.max.split.size won't work.
I would dig into source code later.

Try adding the following:
set hive.merge.mapfiles=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Related

How does the CONCATENATE in ALTER TABLE command in HIVE works

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.
I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.
I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand
How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read
Any help is appreciated and thanks in advance
Concatenated file size can be controlled with following two values:
set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;
These values should be set based on your HDFS/MapR-FS block size.
As commented by #leftjoin it is indeed the case that you can get different output files for the same underlying data.
This is discussed more in the linked HCC thread but the key point is:
Concatenation depends on which files are chosen first.
Note that having files of different sizes, should not be a problem in normal situations.
If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.

how to set Hive reduce operator since reduce operator is always is 0

I am trying to upload data to hive rc and orc file but number of reducer is always 0. I try to to set the reducer in hive with set mapred.reducer.tasks=1 but it does not work. I found internet that default size per reducer is 1G so i try to upload 3G data so reducer would be at least 2. what i have to work reduce operator?
I would need more information about the query to know for sure but my guess is that the query you are running is a map only job, thus not requiring any reducers. You can add a DISTRIBUTE BY statement to force Hadoop to use reducers. For example,
SELECT txn_id FROM table;
will be a map only job. You can force Hive to add a reduce step by adding this clause.
SELECT txn_id FROM table
DISTRIBUTE BY txn_id;
Try
set mapred.reduce.tasks=99;
set hive.exec.reducers.max=99;
However it is likely that your tasks do not require a reducer.

Why are my avro output files so small and so numerous in my pig job?

I'm running a pig script that does a series of joins and write using AvroStorage()
All is running well, and I am getting the data that I want... but it is being written to 845 avro files (~30kb each). This does not seem right at all... but I cannot seem to find any settings that I may have changed to go from my previous output of 1 large avro to 845 small avros (except adding another data source).
Would this change anything? And how can I get it back to one or two files??
Thanks!
A possibility is to change your block size. If you want to go back to less files, you can also try to use parquet. Transform your .avro files through a pig script and store it like a .parquet file this will reduce your 845 to less files.
But it isn't necessary to get back to less files except for a performance advantage.
The number of files written by MR job is defined by the number of reducers ran. You can use PARALLEL in Pig script to control the number of reducers.
If you are sure that the final data is small enough (comparable to your block size), you can add PARALLEL 1 to your JOIN statement to make sure that JOIN is translated to 1 reducers and thus writes output in only 1 file.
I solved that using SET pig.maxCombinedSplitSize 134217728;
with SET default_parallel 10; it may still output many small files depending on the PIG job.

Hive performance

I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.
Number of mappers allocated to my job is very low even though there
are hundreds of mappers available. I have tried setting
mapred.map.tasks=200. But it takes only 20 to 30 mappers. I
understand, number of mappers depend upon the inputsplit. Is there
any other option to increase the mappers? if no then why is the
parameter(mapred.map.tasks) introduced ?
Is there any resource where i can understand to correlate hive
queries to map-reduce jobs, i.e where the different part of the
query is executed?
For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.
To see how Hive queries are executed, simply preface your query with explain. For example: explain select foo from bar;. If you need even more information, there's also explain extended.
I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.
To optimize Hive performance:
Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper mapreduce.input.fileinputformat.split.maxsize, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer
bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.
Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)...
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequently SELECT * from myTable WHERE someColumn = 'someValue' it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec
Choosing the right compression codec, for example using Snappy (as in here):
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
Not been available at the time of question:
Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
For further optimization you could refer here
You can decrease 'mapreduce.input.fileinputformat.split.maxsize' to increase the number of mappers (more splits).

What's wrong with my Hive-UDF?How to set the map number of hive?

I use Hadoop-Hive to analyse apache log to statis access features. I write a UDF named GetCity to convert the remote_ip to city name, but when I run "select GetCity(remote_ip) from log_pre;", it's very slow, and even failed when the data is too large as more than 1000 items.
I tried to set mapred.reduce.tasks=10, but the jobtracker shown the map total num is 1 all the same. How can I set more maps when select?
When performing a query like this the "GetCity(remote_ip)" call always happens on the mapper. In fact, I am doubtful there is anything going on in the reducer here except for maybe file concatenation. You can control the number of tasks that get used in the mapper from hive by calling:
SET mapred.map.tasks=10;
Hope this helps,
synctree

Resources