Load string data that does not have quotes to Hive - hadoop

I'm trying to load some test data to a simple Hive table. The data is comma separated, but the individual elements are not enclosed in double quotes. I'm getting an error due to this. How do I tell Hive not to expect varchar fields to be enclosed in quotes. Manually adding quotes to varchar fields is not an option since the input file I'm trying to use has thousands of records. Sample query and data below.
create table mydatabase.flights(FlightDate varchar(10),Airline int,FlightNum int,Origin varchar(4),Destination varchar(4),Departure varchar(4),DepDelay double,Arrival varchar(4),ArrivalDelay double,Airtime double,Distance double) row format delimited;
insert into mydatabase.flights(FlightDate,Airline,FlightNum,Origin,Destination,Departure,DepDelay,Arrival,ArrivalDelay,Airtime,Distance)
values(2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00);
The insert query above gives me an error message. It works fine if I enclose the varchar fields in quotes.
Error while compiling statement: FAILED: ParseException line 11:11 mismatched input '-' expecting ) near '2014' in value row constructor
I'm loading data using the following query
load data inpath '/user/alpsusa/hive/flights.csv' overwrite into table mydatabase.flights;
After load, I see only the first field being loaded. Rest all are NULL.
Sample data
2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00
2014-04-01,19805,2,LAX,JFK,0944,14.00,1736,-29.00,269.00,2475.00
2014-04-01,19805,3,JFK,LAX,1224,-6.00,1614,39.00,371.00,2475.00
2014-04-01,19805,4,LAX,JFK,1240,25.00,2028,-27.00,264.00,2475.00
2014-04-01,19805,5,DFW,HNL,1300,-5.00,1650,15.00,510.00,3784.00
Below is the output of DESCRIBE FORMATTED

Related

sum() function gives wrong answer in hiveql

I was playing around with a simple dataset that you can find here.
No matter what I do, calling the SUM() aggregate function on the 4th column of the given data set returns the wrong answer.
Here is the exact code that I have used:
create database beep_boop;
use beep_boop;
create table cause (year INT, sex STRING, cause STRING, value INT)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
tblproperties("skip.header.line.count" = "1");
load data inpath '/user/verterse/CauseofDeath.csv' into table cause;
select sum(value) from cause;
The answer that I get is 11478567 as shown in the screenshot here.
But using the SUM() in MS Excel gives an answer of 12745563.
I tried deleting the table/database and recreating them from scratch. I tried uploading the csv file again. I tried using different datatypes like INT and BIGINT for the value column. I tried skipping and not skipping the header line. Nothing works. I also know that the file is being read completely because select count(*) from cause; returns a correct answer of 1016.
P.S.: I am new to Hadoop, Hive and big data in general.

Select statement in hive return some columns with null value

I have seen this type of questions were asked many times, but those solutions not worked for me. I created a external hive table, since i had the data is from map-only job output. Then, by load command i given the path for the specific file. It showed ok. But when i give select * from table command it returns some column with null values. Each command i have executed is in the error pic.
My delimiter in file is ||, so i mentioned the same in create table command too.
Here is my input file pic file pic. And here is the error pic
. I have also tried a normal table instead of external table. That too showed the same error. I also tried by mentioning delimiter as //|| and also \|\|. But none worked.
The problem that you are facing is related to multiple characters as FIELD delimiter.
According to documentation FIELD delimiter should be a CHAR
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
You need to change your data to have only single char field delimiter.
If you can not do that then the other approach is to use stage table with single field. Load your data to that table and then in your actual target table, split the column in stage table by || delimiter and then insert. You need to make sure that field counts are consistent in the data otherwise your final output will be off.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

SQL Loader incompatible length

This is my control file
FIELDS (
dummy1 filler terminated by "cid=",
address enclosed by "<address>" and "</address>"
...
The address column in the table is varchar(10).
If the address in the file is over 10 characters then SQL*Loader cannot load it.
How I can capture address truncating to 10 characters?
The documentation has a section on applying SQL operators to fields.
A wide variety of SQL operators can be applied to field data with the SQL string. This string can contain any combination of SQL expressions that are recognized by the Oracle database as valid for the VALUES clause of an INSERT statement. In general, any SQL function that returns a single value that is compatible with the target column's datatype can be used.
In this case you can use the substr() function on the value from the file:
...
dummy filler terminated by "cid=",
address enclosed by "<address>" and "</address>" "substr(:address, 1, 10)"
...
The quoted "substr(:address, 1, 10)" passes the initial value from the file through the function before inserting the resulting 10 character (maximum) value, however long the original value in the file was. Note the colon before the name in that function call.
If your file is XML then you might be better off loading it as an external table and then using the built-in XML query tools to extract the data you want, rather than trying to parse it through delimited field definitions.

Hive histogram_numeric function outputs invalid character

I am using histogram_numeric function of hive and I want to output my select query to file.
However I get invalid characher in file and i cannot use it for plotting the data.
here is my code:
INSERT OVERWRITE LOCAL DIRECTORY '/home/cloudera/queries/histograms/q1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select explode(histogram_numeric(operationTime,30)) from transaction;
And as a result I get :
3.1968591661070107"someInvalidCharacter"196572.0
14.41629947203365"someInvalidCharacter"725191.0
27.84241052482667"someInvalidCharacter"27069.0
But I expect "," instead of "someInvalidCharacter".
What can be the problem
Per the Hive LanguageManual, histogram_numeric creates an array of structs. Trying using inline to "explode" your output instead of using explode.

How to give a function as a input for s3 location in hive script

I am trying to do achieve this;
location/11.11
location/12.11
location/13.11
In order to do that , i have tried many things and couldn't make it happen.
Now i have an Udf hive function which returns me the location of s3 table, but i am facing with an error ;
ParseException line 1:0 cannot recognize input near 'LOCATION'
'datenow' '(' LOCATION datenow(); NoViableAltException(143#[])
This is my hive script , i have two external tables.
CREATE TEMPORARY FUNCTION datenow AS 'LocationUrlGenerator';
CREATE EXTERNAL TABLE IF NOT EXISTS s3( file Array<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001' LINES TERMINATED BY '\n';
LOCATION datenow();
LOCATION accepts a string, not an UDF. The Language Manual si a bit unclear because it only specifies [LOCATION hdfs_path] and leaves hdfs_path undefined, but it can only be an URL location path, a string. In general UDFs are not acceptable in DDL context.
Build a script with any text tool of choice and run that script.
I managed it like that ,
INSERT INTO TABLE S3
PARTITION(time)
SELECT func(json),from_unixtime(unix_timestamp(),'yyyy-MM-dd') AS time FROM tracksTable;

Resources