How to insert into table select function query with Hive - hadoop

Insert into table When a table with a function result value is selected, the value does not appear. What should I do?
Function query result
hive> SELECT start_num,geoip(start_ip,'COUNTRY_CODE','/usr/local/hive/lib/GeoLite2-Country.mmdb') from geoip limit 3;
OK
17/05/24 18:02:15 INFO mapred.FileInputFormat: Total input files to process : 1
16778240 AU
16779264 CN
16781312 JP
Time taken: 0.129 seconds, Fetched: 3 row(s)
When you insert function query query results
Query insert into table iptest2 SELECT start_num,geoip(start_ip,'COUNTRY_CODE','/usr/local/hive/lib/GeoLite2-Country.mmdb') from geoip limit3;
17/05/24 18:05:41 INFO mapred.FileInputFormat: Total input files to process : 2
16778240
16779264
16781312
Time taken: 0.115 seconds, Fetched: 3 row(s)
iptest2 table desc
hive> desc iptest2;
OK
17/05/25 09:26:28 INFO mapred.FileInputFormat: Total input files to process : 1
code string
ccode string
Time taken: 0.066 seconds, Fetched: 2 row(s)
)
GEOIP function UDF (Use the UDF function from the link below)
https://github.com/Spuul/hive-udfs/blob/master/src/main/java/com/spuul/hive/GeoIP2.java

Related

how to achieve HiveQL error handling

I have multiple queries in a hql file (say 10, every query ending with ;) which I am running from a shell script.
When a query in between fails (say query #5), the queries after 5 do not execute, and the hive job is completed.
How can I do error handling to make sure that queries from 6 to 10 run even though query 5 fails?
Demo
myscript.sql
select 1;
select assert_true(false);
select 2;
Option 1
hive --hiveconf hive.cli.errors.ignore=true -f myscript.sql
OK
1
Time taken: 3.742 seconds, Fetched: 1 row(s)
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
Time taken: 0.264 seconds
OK
2
Time taken: 0.284 seconds, Fetched: 1 row(s)
Option 2
hive<myscript.sql
hive> select 1;
OK
1
Time taken: 3.181 seconds, Fetched: 1 row(s)
hive> select assert_true(false);
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
Time taken: 0.335 seconds
hive> select 2;
OK
2
Time taken: 0.225 seconds, Fetched: 1 row(s)

HIVE returning wrong date

I'm getting some odd results from HIVE when working with dates.
For starters, I'm using Hive 1.2.1000.2.4.0.0-169
I have a table defined (snipped) of the sort:
hive> DESCRIBE proto_hourly;
OK
elem string
protocol string
count bigint
date_val date
hour_id tinyint
# Partition Information
# col_name data_type comment
date_val date
hour_id tinyint
Time taken: 0.336 seconds, Fetched: xx row(s)
hive>
Ok so I have data loaded for the current year. I started noticing some "weirdness" in queries with specific dates but for a pointed example, here's a pretty simple query where i'm just asking for '2016-06-01' but i get back '2016-05-31'...why
hive> SET i="2016-06-01";
hive> with uniq_dates AS (
> SELECT DISTINCT date_val as date_val
> FROM proto_hourly
> WHERE date_val = date(${hiveconf:i}) )
> select * from uniq_dates;
Query ID = hive_20160616154318_a75b3343-a2fe-41a5-b02a-d9cda8695c91
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1465936275203_0023)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.63 s
--------------------------------------------------------------------------------
OK
2016-05-31
Time taken: 6.738 seconds, Fetched: 1 row(s)
hive>
Testing this a bit more, I found that there was one server configured in a different timezone in the cluster. Two of the three nodes were UTC, but one node was still in America/Denver.
I believe what was happening was the Map/Reduce jobs were executing on the server in the different timezone thus giving me the weird data offset issue.
Date 2016-06-01 UTC does indeed equal Date 2016-05-31 America/Denver
Silent TZ conversion...

Hive Current date function

I want to get the current date in beeline.
I tried to use this:
FROM_UNIXTIME(UNIX_TIMESTAMP())
it outputs this:
16-03-21
What I was looking to get it:
2016-03-21 09:34
How do I do it? I see the beeline documentation here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
But it didnt work for me.
you can get it by passing expected format as a parameter of from_unixtime function.
Example :
select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:MM');
Result:
2016-03-21 16:03
Try this:
Select to_date(from_unixtime(unix_timestamp())) from my table ...
Results in '2016-03-21'
there are many functions you can use in hive : taken from http://atiblog.com/date-function-hive/
1)from_unixtime:
This function converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a STRING that represents the TIMESTAMP of that moment in the current system time zone in the format of “1970-01-01 00:00:00”. The following example returns the current date including the time.
hive> SELECT FROM_UNIXTIME(UNIX_TIMESTAMP());
OK
2015–05–18 05:43:37
Time taken: 0.153 seconds, Fetched: 1 row(s)
2)from_utc_timestamp:-
This function assumes that the string in the first expression is UTC and then, converts that string to the time zone of the second expression. This function and the to_utc_timestamp function do timezone conversions. In the following example, t1 is a string.
hive> SELECT from_utc_timestamp(‘1970-01-01 07:00:00’, ‘JST’);
OK
1970–01–01 16:00:00
Time taken: 0.148 seconds, Fetched: 1 row(s)
3)to_utc_timestamp:
This function assumes that the string in the first expression is in the timezone that is specified in the second expression, and then converts the value to UTC format. This function and the from_utc_timestamp function do timezone conversions.
hive> SELECT to_utc_timestamp (‘1970-01-01 00:00:00’,‘America/Denver’);
OK
1970–01–01 07:00:00
Time taken: 0.153 seconds, Fetched: 1 row(s)
4)unix_timestamp :
This function converts the date to the specified date format and returns the number of seconds between the specified date and Unix epoch. If it fails, then it returns 0. The following example returns the value 1237487400
hive> SELECT unix_timestamp (‘2009-03-20’, ‘yyyy-MM-dd’);
OK
1237487400
Time taken: 0.156 seconds, Fetched: 1 row(s)
5)unix_timestamp() :This function returns the number of seconds from the Unix epoch (1970-01-01 00:00:00 UTC) using the default time zone.
hive> select UNIX_TIMESTAMP(‘2000-01-01 00:00:00’);
OK
946665000
Time taken: 0.147 seconds, Fetched: 1 row(s)
6)unix_timestamp( string date ) :
This function converts the date in format ‘yyyy-MM-dd HH:mm:ss’ into Unix timestamp. This will return the number of seconds between the specified date and the Unix epoch. If it fails, then it returns 0.
hive> select UNIX_TIMESTAMP(‘2000-01-01 10:20:30’,‘yyyy-MM-dd’);
OK
946665000
Time taken: 0.148 seconds, Fetched: 1 row(s)
7)unix_timestamp( string date, string pattern ) :
This function converts the date to the specified date format and returns the number of seconds between the specified date and Unix epoch. If it fails, then it returns 0.
hive> select FROM_UNIXTIME( UNIX_TIMESTAMP() );
8)from_unixtime( bigint number_of_seconds [, string format] ) :The FROM_UNIX function converts the specified number of seconds from Unix epoch and returns the date in the format ‘yyyy-MM-dd HH:mm:ss’.
hive> SELECT FROM_UNIXTIME(UNIX_TIMESTAMP());
9)To_Date( string timestamp ) :
hive> select TO_DATE(‘2000-01-01 10:20:30’);
OK
2000–01–01
10)WEEKOFYEAR( string date )
The WEEKOFYEAR function returns the week number of the date.
hive> SELECT WEEKOFYEAR(‘2000-03-01 10:20:30’);
OK
9
11)DATEDIFF( string date1, string date2 )
The DATEDIFF function returns the number of days between the two given dates.
hive> SELECT DATEDIFF(‘2000-03-01’, ‘2000-01-10’);
OK
51
Time taken: 0.156 seconds, Fetched: 1 row(s)
12)DATE_ADD( string date, int days )
The DATE_ADD function adds the number of days to the specified date
hive> SELECT DATE_ADD(‘2000-03-01’, 5);
OK
2000–03–06
13)DATE_SUB( string date, int days )
The DATE_SUB function subtracts the number of days to the specified date
hive> SELECT DATE_SUB(‘2000-03-01’, 5);
OK
2000–02–25
14)DATE CONVERSIONS :Convert MMddyyyy Format to Unixtime
Note: M Should be Capital Every time in MMddyyyy Format
select cast(substring(from_unixtime(unix_timestamp(dt, ‘MMddyyyy’)),1,10) as date) from sample;

Hadoop Still Treats Commas as Delimiters after Explictly Declaring a Different Character

I'm currently importing data into a hive table. When we created the table we used
CREATE EXTERNAL TABLE Customers
(
Code string,
Company string,
FirstName string,
LastName string,
DateOfBirth string,
PhoneNo string,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
as there are commas in our data. However, we've now discovered that the commas are still being treated as field delimiters, as well as the | we're using to separate the fields. Is there any way to work around this? Do we have to escape every single comma in our data or is there an easier way to get it set up?
Example data
1|2|3|4
a|b|c|d
John|Joe|Bob, Jr|Alex
Which when put in the table appears as
1 2 3 4
a b c d
John Joe Bob Jr
With Jr occupying its own column and bumping Alex from the table.
It is working fine for me using your data. Hive version is 0.13
hive> create external table foo(
> first string,
> second string,
> third string,
> forth string)
> row format delimited fields terminated by '|' lines terminated by '\n';
OK
Time taken: 3.222 seconds
hive> load data inpath '/user/xilan/data.txt' overwrite into table foo;
hive> select third from foo;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1422157058628_0001, Tracking URL = http://host:8088/proxy/application_1422157058628_0001/
Kill Command = /scratch/xilan/hadoop/bin/hadoop job -kill job_1422157058628_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-27 07:05:41,901 Stage-1 map = 0%, reduce = 0%
2015-03-27 07:05:50,190 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.24 sec
MapReduce Total cumulative CPU time: 1 seconds 240 msec
Ended Job = job_1422157058628_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.24 sec HDFS Read: 245 HDFS Write: 12 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 240 msec
OK
3
c
Bob, Jr
Time taken: 18.853 seconds, Fetched: 3 row(s)
hive>

DSE 4.0.1: hive count different than cassandra count

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?
It's probably the consistency setting on the HIVE table as per this document.
Change the hive query to "select count(*) from pageviews_v1 ;"
The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

Resources