Optimizing the hive query :Apache Hive - hadoop

The following hive query which finds the lead and lag on a single column. The query spawns 1 Mapper and 50 Reducers. How can i optimize the query to spawn less reduces.
Table description
col_name data_type comment
# col_name data_type comment
a int
Data in tale
select * from foo;
OK
foo.a 1 2 3 4 5 6 3 4 6 78 9 7 NULL
select lag(a,1) over (order by a) as next,lead(a,1) over (order by a) as prev from foo;
Query ID =
phodisvc_20170403015502_de129135-eb19-4c4d-8161-c3f217a45928 Total
jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not
specified. Defaulting to jobconf value of: 50 In order to change the
average load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapreduce.job.reduces= Kill Command =
/opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill
job_1489146839620_136214 Hadoop job information for Stage-1: number of
mappers: 1; number of reducers: 50

Related

select count(*) failing on hive / hadoop on windows

hive> select count(*) from test_db.cust;
Query ID = EMMAdmin_20200106222630_32064e30-7ae6-4e0a-bf1b-b0979e297102
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Kill Command = C:\hadoop-2.10.0\bin\mapred job -kill job_1578366054556_0003
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2020-01-06 22:26:40,010 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1578366054556_0003 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
hive>
error on Hadoop admin:
Application application_1578366054556_0001 failed 2 times due to AM Container for appattempt_1578366054556_0001_000002 exited with exitCode: -1
Failing this attempt.Diagnostics:
[2020-01-06 22:08:40.708]The command line has a length of 12995 exceeds maximum allowed length of 8191. Command starts with: "#set HADOOP_CLASSPATH=%PWD%;job.jar/*;job.jar/classes/;job.jar/lib/*;%PWD%/*;C:\apache-hive-3.0.0-bi "
normal select queries are working fine
hive> select country from test_db.cust where first_name like 'S%';
OK
USA
USA
USA
USA
USA
USA
USA
USA
Time taken: 0.229 seconds, Fetched: 8 row(s)
hive>
enter image description here
The problem is from commander.
You could use the next syntax, in order to output the result into a file:
hive -e "your query" > ~/sample_output.txt

Why mapred.min.split.size doesnt change the number of mappers for my query

I was doing some deep dive with a simple use case to see how we can control the number of mappers launched in hive.
This is what I did :
Step1 : Found out how much is the block size of the system.
hdfs getconf -confKey dfs.blocksize
Output : 134217728 => 128 Megabytes (MB)
Step2 : Placed a file of 392781672 bytes (392 MB) in HDFS and created a table on top of it
Step 3 : Ran a simple count (select count(1) from table) which triggered.
Mappers : 3
Reducers : 1
which is as expected.
Step 4 : Now changed the setting :
set mapred.min.split.size = 392503151
set mapred.max.split.size = 392503000
Step 5 : Ran a select count(1) from table and
it still triggers 3 mappers and 1 reducer.
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
Question : I would expect this to run only 1 mapper since now the file size and my min max splits size is the same , then why its not following this principle here.

Hive partition query is scanning all partitions

When I write the hive query like below
select count(*)
from order
where order_month >= '2016-11';
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
I am getting 5 mappers only it means reading required partitions only(2016-11 and 2016-12)
Same query I write using function
select count(*)
from order
where order_month >= concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)));
Note:
concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)))
= '2016-11'
Hadoop job information for Stage-1: number of mappers: 216; number of reducers: 1
this time it is reading all partitions {i.e. 2004-10 to 2016-12}. .
How to modify the query to read required partitions only.
unix_timestamp() function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP and CURRENT_DATE.
Use current_date, also no need to calculate year and month separately:
where order_month >= substr(date_sub(current_date, 10),1,7)

DSE 4.0.1: hive count different than cassandra count

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?
It's probably the consistency setting on the HIVE table as per this document.
Change the hive query to "select count(*) from pageviews_v1 ;"
The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

datastax hive select count(*) on table which only 3 records,but need 1hour to count() ,why?

hive> select * from example;
OK
1 hello yang
2 hello bear
3 aaa
Time taken: 51.273 seconds -> that's ok !
hive> select count(key) from example;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201309170341_0001, Tracking URL = ...
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=10.10.5.153:8012 -kill job_201309170341_0001
Hadoop job information for Stage-1: number of mappers: 1537; number of reducers: 1
Then wait for 1 hour ,I get the count : 3 !
why need so many time ? and why mappers so big: 1537 ?
Do you enable the vnodes? It looks like you enable vnode. We are working on support hadoop on vnodes, but before it's done, it's recommended to disable it for a hadoop data center/cluster

Resources