HIVE returning wrong date - hadoop

I'm getting some odd results from HIVE when working with dates.
For starters, I'm using Hive 1.2.1000.2.4.0.0-169
I have a table defined (snipped) of the sort:
hive> DESCRIBE proto_hourly;
OK
elem string
protocol string
count bigint
date_val date
hour_id tinyint
# Partition Information
# col_name data_type comment
date_val date
hour_id tinyint
Time taken: 0.336 seconds, Fetched: xx row(s)
hive>
Ok so I have data loaded for the current year. I started noticing some "weirdness" in queries with specific dates but for a pointed example, here's a pretty simple query where i'm just asking for '2016-06-01' but i get back '2016-05-31'...why
hive> SET i="2016-06-01";
hive> with uniq_dates AS (
> SELECT DISTINCT date_val as date_val
> FROM proto_hourly
> WHERE date_val = date(${hiveconf:i}) )
> select * from uniq_dates;
Query ID = hive_20160616154318_a75b3343-a2fe-41a5-b02a-d9cda8695c91
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1465936275203_0023)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.63 s
--------------------------------------------------------------------------------
OK
2016-05-31
Time taken: 6.738 seconds, Fetched: 1 row(s)
hive>

Testing this a bit more, I found that there was one server configured in a different timezone in the cluster. Two of the three nodes were UTC, but one node was still in America/Denver.
I believe what was happening was the Map/Reduce jobs were executing on the server in the different timezone thus giving me the weird data offset issue.
Date 2016-06-01 UTC does indeed equal Date 2016-05-31 America/Denver
Silent TZ conversion...

Related

How to insert into table select function query with Hive

Insert into table When a table with a function result value is selected, the value does not appear. What should I do?
Function query result
hive> SELECT start_num,geoip(start_ip,'COUNTRY_CODE','/usr/local/hive/lib/GeoLite2-Country.mmdb') from geoip limit 3;
OK
17/05/24 18:02:15 INFO mapred.FileInputFormat: Total input files to process : 1
16778240 AU
16779264 CN
16781312 JP
Time taken: 0.129 seconds, Fetched: 3 row(s)
When you insert function query query results
Query insert into table iptest2 SELECT start_num,geoip(start_ip,'COUNTRY_CODE','/usr/local/hive/lib/GeoLite2-Country.mmdb') from geoip limit3;
17/05/24 18:05:41 INFO mapred.FileInputFormat: Total input files to process : 2
16778240
16779264
16781312
Time taken: 0.115 seconds, Fetched: 3 row(s)
iptest2 table desc
hive> desc iptest2;
OK
17/05/25 09:26:28 INFO mapred.FileInputFormat: Total input files to process : 1
code string
ccode string
Time taken: 0.066 seconds, Fetched: 2 row(s)
)
GEOIP function UDF (Use the UDF function from the link below)
https://github.com/Spuul/hive-udfs/blob/master/src/main/java/com/spuul/hive/GeoIP2.java

how to achieve HiveQL error handling

I have multiple queries in a hql file (say 10, every query ending with ;) which I am running from a shell script.
When a query in between fails (say query #5), the queries after 5 do not execute, and the hive job is completed.
How can I do error handling to make sure that queries from 6 to 10 run even though query 5 fails?
Demo
myscript.sql
select 1;
select assert_true(false);
select 2;
Option 1
hive --hiveconf hive.cli.errors.ignore=true -f myscript.sql
OK
1
Time taken: 3.742 seconds, Fetched: 1 row(s)
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
Time taken: 0.264 seconds
OK
2
Time taken: 0.284 seconds, Fetched: 1 row(s)
Option 2
hive<myscript.sql
hive> select 1;
OK
1
Time taken: 3.181 seconds, Fetched: 1 row(s)
hive> select assert_true(false);
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
Time taken: 0.335 seconds
hive> select 2;
OK
2
Time taken: 0.225 seconds, Fetched: 1 row(s)

Hadoop Still Treats Commas as Delimiters after Explictly Declaring a Different Character

I'm currently importing data into a hive table. When we created the table we used
CREATE EXTERNAL TABLE Customers
(
Code string,
Company string,
FirstName string,
LastName string,
DateOfBirth string,
PhoneNo string,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
as there are commas in our data. However, we've now discovered that the commas are still being treated as field delimiters, as well as the | we're using to separate the fields. Is there any way to work around this? Do we have to escape every single comma in our data or is there an easier way to get it set up?
Example data
1|2|3|4
a|b|c|d
John|Joe|Bob, Jr|Alex
Which when put in the table appears as
1 2 3 4
a b c d
John Joe Bob Jr
With Jr occupying its own column and bumping Alex from the table.
It is working fine for me using your data. Hive version is 0.13
hive> create external table foo(
> first string,
> second string,
> third string,
> forth string)
> row format delimited fields terminated by '|' lines terminated by '\n';
OK
Time taken: 3.222 seconds
hive> load data inpath '/user/xilan/data.txt' overwrite into table foo;
hive> select third from foo;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1422157058628_0001, Tracking URL = http://host:8088/proxy/application_1422157058628_0001/
Kill Command = /scratch/xilan/hadoop/bin/hadoop job -kill job_1422157058628_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-27 07:05:41,901 Stage-1 map = 0%, reduce = 0%
2015-03-27 07:05:50,190 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.24 sec
MapReduce Total cumulative CPU time: 1 seconds 240 msec
Ended Job = job_1422157058628_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.24 sec HDFS Read: 245 HDFS Write: 12 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 240 msec
OK
3
c
Bob, Jr
Time taken: 18.853 seconds, Fetched: 3 row(s)
hive>

DSE 4.0.1: hive count different than cassandra count

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?
It's probably the consistency setting on the HIVE table as per this document.
Change the hive query to "select count(*) from pageviews_v1 ;"
The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

datastax hive select count(*) on table which only 3 records,but need 1hour to count() ,why?

hive> select * from example;
OK
1 hello yang
2 hello bear
3 aaa
Time taken: 51.273 seconds -> that's ok !
hive> select count(key) from example;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201309170341_0001, Tracking URL = ...
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=10.10.5.153:8012 -kill job_201309170341_0001
Hadoop job information for Stage-1: number of mappers: 1537; number of reducers: 1
Then wait for 1 hour ,I get the count : 3 !
why need so many time ? and why mappers so big: 1537 ?
Do you enable the vnodes? It looks like you enable vnode. We are working on support hadoop on vnodes, but before it's done, it's recommended to disable it for a hadoop data center/cluster

Resources