sum() function gives wrong answer in hiveql - hadoop

I was playing around with a simple dataset that you can find here.
No matter what I do, calling the SUM() aggregate function on the 4th column of the given data set returns the wrong answer.
Here is the exact code that I have used:
create database beep_boop;
use beep_boop;
create table cause (year INT, sex STRING, cause STRING, value INT)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
tblproperties("skip.header.line.count" = "1");
load data inpath '/user/verterse/CauseofDeath.csv' into table cause;
select sum(value) from cause;
The answer that I get is 11478567 as shown in the screenshot here.
But using the SUM() in MS Excel gives an answer of 12745563.
I tried deleting the table/database and recreating them from scratch. I tried uploading the csv file again. I tried using different datatypes like INT and BIGINT for the value column. I tried skipping and not skipping the header line. Nothing works. I also know that the file is being read completely because select count(*) from cause; returns a correct answer of 1016.
P.S.: I am new to Hadoop, Hive and big data in general.

Related

Load string data that does not have quotes to Hive

I'm trying to load some test data to a simple Hive table. The data is comma separated, but the individual elements are not enclosed in double quotes. I'm getting an error due to this. How do I tell Hive not to expect varchar fields to be enclosed in quotes. Manually adding quotes to varchar fields is not an option since the input file I'm trying to use has thousands of records. Sample query and data below.
create table mydatabase.flights(FlightDate varchar(10),Airline int,FlightNum int,Origin varchar(4),Destination varchar(4),Departure varchar(4),DepDelay double,Arrival varchar(4),ArrivalDelay double,Airtime double,Distance double) row format delimited;
insert into mydatabase.flights(FlightDate,Airline,FlightNum,Origin,Destination,Departure,DepDelay,Arrival,ArrivalDelay,Airtime,Distance)
values(2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00);
The insert query above gives me an error message. It works fine if I enclose the varchar fields in quotes.
Error while compiling statement: FAILED: ParseException line 11:11 mismatched input '-' expecting ) near '2014' in value row constructor
I'm loading data using the following query
load data inpath '/user/alpsusa/hive/flights.csv' overwrite into table mydatabase.flights;
After load, I see only the first field being loaded. Rest all are NULL.
Sample data
2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00
2014-04-01,19805,2,LAX,JFK,0944,14.00,1736,-29.00,269.00,2475.00
2014-04-01,19805,3,JFK,LAX,1224,-6.00,1614,39.00,371.00,2475.00
2014-04-01,19805,4,LAX,JFK,1240,25.00,2028,-27.00,264.00,2475.00
2014-04-01,19805,5,DFW,HNL,1300,-5.00,1650,15.00,510.00,3784.00
Below is the output of DESCRIBE FORMATTED

Using SQL reserved words in Hive when creating external temporary table

I need to create an external hive table from hdfs location where one column in files has reserved name (end).
When running the script I get the error:
"cannot recognize input near 'end' 'STRUCT' '<' in column specification"
I found 2 solutions.
The first one is to set hive.support.sql11.reserved.keywords=false, but this option has been removed.
https://issues.apache.org/jira/browse/HIVE-14872
The second solution is to use quoted identifiers (column).
But in this case I get the error:
"org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('c' (code 99)): was expecting comma to separate OBJECT entries"
This is my code for table creation:
CREATE TEMPORARY EXTERNAL TABLE ${tmp_db}.${tmp_table}
(
id STRING,
email STRUCT<string:STRING>,
start STRUCT<long:BIGINT>,
end STRUCT<long:BIGINT>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '${input_dir}';
It's not possible to rename the column.
Does anybody know the solution for this problem? Or maybe any ideas?
Thanks a lot in advance!
can you try below.
hive> set hive.support.quoted.identifiers=column;
hive> create temporary table sp_char ( `#` int, `end` string);
OK
Time taken: 0.123 seconds
OK
Time taken: 0.362 seconds
hive>
When you set hive property hive.support.quoted.identifiers=column all the values within back ticks are treated as literals.
Other value for above property is none , when it is set to none you can use regex to evaluate the column or expression value.
Hope this helps

Hive Query dumping issue

I am facing difficulties in getting the dump(text file delimited by ^) for a Query in hive for my project -sentimental analysis in stock market using twitter.
The query which should fetch me an output in hdfs or local file-system is given below:
hive> select t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
The query produces the correct output but when I try dumping it in hdfs it gives me an error.
I ran through various other solutions given in stackoverflow for dumping but I was not able to find an appropriate solution which suits me.
Thanks for your help.
INSERT OVERWRITE DIRECTORY '/path/to/dir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
SELECT t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st
ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));

timestamp not working in hive

I have a table with one column data type as 'timestamp'. Whenever i try to do some queries on the table, even simple select statement i am getting errors.
example of a row in my column,
'2014-01-01 05:05:20.664592-08'
The statement i am trying,
'select * from mytable limit 10;'
the error i am getting is
'Failed with exception java.io.IOException:java.lang.NumberFormatException: For input string: "051-08000"'
Date functions in hive like TO_DATE are also not working.If i change the data type to string, i am able to extract the date part using substring. But i need to work with timestamp.
Has anyone faced this error before? Please let me know.
Hadoop is having trouble understanding '2014-01-01 05:05:20.664592-08' as a timestamp because of the "592-08" at the end. You should change the datatype to string, cut off the offending portion with a string function, then cast back to timestamp:
select cast(substring(time_stamp_field,1,23) as timestamp) from mytable

How to give a function as a input for s3 location in hive script

I am trying to do achieve this;
location/11.11
location/12.11
location/13.11
In order to do that , i have tried many things and couldn't make it happen.
Now i have an Udf hive function which returns me the location of s3 table, but i am facing with an error ;
ParseException line 1:0 cannot recognize input near 'LOCATION'
'datenow' '(' LOCATION datenow(); NoViableAltException(143#[])
This is my hive script , i have two external tables.
CREATE TEMPORARY FUNCTION datenow AS 'LocationUrlGenerator';
CREATE EXTERNAL TABLE IF NOT EXISTS s3( file Array<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001' LINES TERMINATED BY '\n';
LOCATION datenow();
LOCATION accepts a string, not an UDF. The Language Manual si a bit unclear because it only specifies [LOCATION hdfs_path] and leaves hdfs_path undefined, but it can only be an URL location path, a string. In general UDFs are not acceptable in DDL context.
Build a script with any text tool of choice and run that script.
I managed it like that ,
INSERT INTO TABLE S3
PARTITION(time)
SELECT func(json),from_unixtime(unix_timestamp(),'yyyy-MM-dd') AS time FROM tracksTable;

Resources