I was playing around with a simple dataset that you can find here.
No matter what I do, calling the SUM() aggregate function on the 4th column of the given data set returns the wrong answer.
Here is the exact code that I have used:
create database beep_boop;
use beep_boop;
create table cause (year INT, sex STRING, cause STRING, value INT)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
tblproperties("skip.header.line.count" = "1");
load data inpath '/user/verterse/CauseofDeath.csv' into table cause;
select sum(value) from cause;
The answer that I get is 11478567 as shown in the screenshot here.
But using the SUM() in MS Excel gives an answer of 12745563.
I tried deleting the table/database and recreating them from scratch. I tried uploading the csv file again. I tried using different datatypes like INT and BIGINT for the value column. I tried skipping and not skipping the header line. Nothing works. I also know that the file is being read completely because select count(*) from cause; returns a correct answer of 1016.
P.S.: I am new to Hadoop, Hive and big data in general.
I'm trying to load some test data to a simple Hive table. The data is comma separated, but the individual elements are not enclosed in double quotes. I'm getting an error due to this. How do I tell Hive not to expect varchar fields to be enclosed in quotes. Manually adding quotes to varchar fields is not an option since the input file I'm trying to use has thousands of records. Sample query and data below.
create table mydatabase.flights(FlightDate varchar(10),Airline int,FlightNum int,Origin varchar(4),Destination varchar(4),Departure varchar(4),DepDelay double,Arrival varchar(4),ArrivalDelay double,Airtime double,Distance double) row format delimited;
insert into mydatabase.flights(FlightDate,Airline,FlightNum,Origin,Destination,Departure,DepDelay,Arrival,ArrivalDelay,Airtime,Distance)
values(2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00);
The insert query above gives me an error message. It works fine if I enclose the varchar fields in quotes.
Error while compiling statement: FAILED: ParseException line 11:11 mismatched input '-' expecting ) near '2014' in value row constructor
I'm loading data using the following query
load data inpath '/user/alpsusa/hive/flights.csv' overwrite into table mydatabase.flights;
After load, I see only the first field being loaded. Rest all are NULL.
Sample data
2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00
2014-04-01,19805,2,LAX,JFK,0944,14.00,1736,-29.00,269.00,2475.00
2014-04-01,19805,3,JFK,LAX,1224,-6.00,1614,39.00,371.00,2475.00
2014-04-01,19805,4,LAX,JFK,1240,25.00,2028,-27.00,264.00,2475.00
2014-04-01,19805,5,DFW,HNL,1300,-5.00,1650,15.00,510.00,3784.00
Below is the output of DESCRIBE FORMATTED
I'm reading a "|" delimited file(input.txt) using while loop and comparing columns.
I want if values of all columns is not correct then above code should throw an error otherwise it should pass the result.
I am trying to move data from a file into a hive table. The data in the file looks something like this:-
StringA StringB StringC StringD StringE
where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one column for StringD onwards i.e. StringD and String E should be part of the same column. If i use
ROW DELIMITED BY FIELDS TERMINATED BY ' ', Hive would produce separate columns for StringD and StringE. (StringD and StringE contain space within themselves whereas other strings do not contain spaces within themselves)
Is there any special syntax in hive to achieve this or do i need to pre-process my data file in some way?
Use regular expresion
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
you can define when use space as delimiter and when part of data
I am trying to do achieve this;
location/11.11
location/12.11
location/13.11
In order to do that , i have tried many things and couldn't make it happen.
Now i have an Udf hive function which returns me the location of s3 table, but i am facing with an error ;
ParseException line 1:0 cannot recognize input near 'LOCATION'
'datenow' '(' LOCATION datenow(); NoViableAltException(143#[])
This is my hive script , i have two external tables.
CREATE TEMPORARY FUNCTION datenow AS 'LocationUrlGenerator';
CREATE EXTERNAL TABLE IF NOT EXISTS s3( file Array<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001' LINES TERMINATED BY '\n';
LOCATION datenow();
LOCATION accepts a string, not an UDF. The Language Manual si a bit unclear because it only specifies [LOCATION hdfs_path] and leaves hdfs_path undefined, but it can only be an URL location path, a string. In general UDFs are not acceptable in DDL context.
Build a script with any text tool of choice and run that script.
I managed it like that ,
INSERT INTO TABLE S3
PARTITION(time)
SELECT func(json),from_unixtime(unix_timestamp(),'yyyy-MM-dd') AS time FROM tracksTable;