Hive partition query is scanning all partitions - hadoop

When I write the hive query like below
select count(*)
from order
where order_month >= '2016-11';
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
I am getting 5 mappers only it means reading required partitions only(2016-11 and 2016-12)
Same query I write using function
select count(*)
from order
where order_month >= concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)));
Note:
concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)))
= '2016-11'
Hadoop job information for Stage-1: number of mappers: 216; number of reducers: 1
this time it is reading all partitions {i.e. 2004-10 to 2016-12}. .
How to modify the query to read required partitions only.

unix_timestamp() function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP and CURRENT_DATE.
Use current_date, also no need to calculate year and month separately:
where order_month >= substr(date_sub(current_date, 10),1,7)

Related

Presto for loop

I am new to presto and I would like to know if there is any way to have for loop. I have a query that aggregates some data date by date, and when i run it it throws an error of: exceeded max memory size of 30GB.
I can use other suggestions if looping is not an option.
the query I am using:
select dt as DATE_KPI,brand,count(distinct concat(cast(post_visid_high as varchar),
cast(post_visid_low as varchar)))as kpi_value
from hive.adobe.tbl
and dt >= date '2017-05-15' and dt <= date '2017-06-13'
group by 1,2
Assuming you are using, Hive you can write the source data to a table bucketed bucketed on brand, and then process groups of buckets with WHERE "$bucket" % 32 = <N>.
Otherwise, you can fragment the query into n queries and then process 1/n of the "brand" in each query. You use WHERE abs(from_big_endian_64(xxhash64(to_utf8(brand)))) % 32 = <N> to bucketize the brands.

Optimizing the hive query :Apache Hive

The following hive query which finds the lead and lag on a single column. The query spawns 1 Mapper and 50 Reducers. How can i optimize the query to spawn less reduces.
Table description
col_name data_type comment
# col_name data_type comment
a int
Data in tale
select * from foo;
OK
foo.a 1 2 3 4 5 6 3 4 6 78 9 7 NULL
select lag(a,1) over (order by a) as next,lead(a,1) over (order by a) as prev from foo;
Query ID =
phodisvc_20170403015502_de129135-eb19-4c4d-8161-c3f217a45928 Total
jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not
specified. Defaulting to jobconf value of: 50 In order to change the
average load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapreduce.job.reduces= Kill Command =
/opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill
job_1489146839620_136214 Hadoop job information for Stage-1: number of
mappers: 1; number of reducers: 50

Hive Count(DISTINCT column) versus SELECT COUNT(*) from (SELECT DISTINCT column)

There have been discussions and claims that the query 2 is faster than query 1.
Query 1
SELECT COUNT(DISTINCT A) FROM TAB_X;
QUERY 2
SELECT COUNT(*) FROM (SELECT DISTINCT A FROM TAB_X)
I fail to understand exactly why it is so.
This is my understanding of how these queries would be converted to the map reduce behind the scene.
Query 1
- Only one stage
- The mappers emit the Column A as the key and the value as 1. **Is this correct? How distinct is achieved?**
- There would be only one reducer, which would have to just increment the counter for every key and the list of values that it gets. However, not sure how would that single reducer knows when to emit the final count (**how does it know when to emit eventually?**).
Query -2
- Two stages
- Stage 1
- The mappers emit the key as the column A and the value as 1
- There will be a lot of reducers, which can aggregate the results for each key and emit the results of that key (which is column A).
Stage 2
The mappers gets details of each user and emits the same key for all and value as 1.
The reducers would just sum these counts and emits the final result.
Can you please help understand/answer my questions inline for query 1 and confirm my understanding of query 2?

DSE 4.0.1: hive count different than cassandra count

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?
It's probably the consistency setting on the HIVE table as per this document.
Change the hive query to "select count(*) from pageviews_v1 ;"
The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

datastax hive select count(*) on table which only 3 records,but need 1hour to count() ,why?

hive> select * from example;
OK
1 hello yang
2 hello bear
3 aaa
Time taken: 51.273 seconds -> that's ok !
hive> select count(key) from example;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201309170341_0001, Tracking URL = ...
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=10.10.5.153:8012 -kill job_201309170341_0001
Hadoop job information for Stage-1: number of mappers: 1537; number of reducers: 1
Then wait for 1 hour ,I get the count : 3 !
why need so many time ? and why mappers so big: 1537 ?
Do you enable the vnodes? It looks like you enable vnode. We are working on support hadoop on vnodes, but before it's done, it's recommended to disable it for a hadoop data center/cluster

Resources