I'm reading a "|" delimited file(input.txt) using while loop and comparing columns.
I want if values of all columns is not correct then above code should throw an error otherwise it should pass the result.
Related
I'm trying to load some test data to a simple Hive table. The data is comma separated, but the individual elements are not enclosed in double quotes. I'm getting an error due to this. How do I tell Hive not to expect varchar fields to be enclosed in quotes. Manually adding quotes to varchar fields is not an option since the input file I'm trying to use has thousands of records. Sample query and data below.
create table mydatabase.flights(FlightDate varchar(10),Airline int,FlightNum int,Origin varchar(4),Destination varchar(4),Departure varchar(4),DepDelay double,Arrival varchar(4),ArrivalDelay double,Airtime double,Distance double) row format delimited;
insert into mydatabase.flights(FlightDate,Airline,FlightNum,Origin,Destination,Departure,DepDelay,Arrival,ArrivalDelay,Airtime,Distance)
values(2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00);
The insert query above gives me an error message. It works fine if I enclose the varchar fields in quotes.
Error while compiling statement: FAILED: ParseException line 11:11 mismatched input '-' expecting ) near '2014' in value row constructor
I'm loading data using the following query
load data inpath '/user/alpsusa/hive/flights.csv' overwrite into table mydatabase.flights;
After load, I see only the first field being loaded. Rest all are NULL.
Sample data
2014-04-01,19805,1,JFK,LAX,0854,-6.00,1217,2.00,355.00,2475.00
2014-04-01,19805,2,LAX,JFK,0944,14.00,1736,-29.00,269.00,2475.00
2014-04-01,19805,3,JFK,LAX,1224,-6.00,1614,39.00,371.00,2475.00
2014-04-01,19805,4,LAX,JFK,1240,25.00,2028,-27.00,264.00,2475.00
2014-04-01,19805,5,DFW,HNL,1300,-5.00,1650,15.00,510.00,3784.00
Below is the output of DESCRIBE FORMATTED
I am newbie in hadoop and I have to add data into table in hive.
I have data from FIX4.4 protocol, something like this...
8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>
8=FIX.4.4<SHO>9=69<SHO>35=A<SHO>34=1<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>98=0<SHO>108=30<SHO>10=093<SHO>
8=FIX.4.4<SHO>9=66<SHO>35=2<SHO>34=2<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>7=1<SHO>16=0<SHO>10=174<SHO>
8=FIX.4.4<SHO>9=110<SHO>35=5<SHO>34=525<SHO>49=SSGMdemo<SHO>52=20150410-15:25:58.164<SHO>56=Trumid<SHO>58=MsgSeqNum too low, expecting 361 but received 1<SHO>10=195<SHO>
Firstly, what i want is, in 8=FIX.4.4 8 as column name, and FIX.4.4 as value of that column, in 9=66 9 should be column name and 66 would be value of that column and so on.... and there are so many rows in raw file like this.
Secondly, same thing for another row, and that data would append in next row of table in hive.
Now what should i do that i am not able to think.
Any help would be appriciable.
I would first create a tab-separated-file containing this data. I suggested to use a regex in the comments but if that is not your strong suit you can just split on the <SHO> tag and =. Since you did not specify the language you want to use I will suggest a 'solution' in Python.
The code below shows you how to write one of your input lines to a CSV file.
This can easily be extended to support multiple of these lines or to append lines to the CSV files once it is already created.
import csv
input = "8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>"
l = input.split('<SHO>')[:-1] # Don't include last element since it's empty
list_of_pairs = map(lambda x: tuple(x.split('=')),l)
d = dict(list_of_pairs)
with open('test.tsv', 'wb') as c:
cw = csv.writer(c, delimiter='\t')
cw.writerow(d.keys()) # Comment this if you don't want to have a header
cw.writerow(d.values())
What this code does is first split the input line on <SHO> meaning it creates a list of col=val strings. What I does next is create a list of tuple pairs where each tuple is (col,val).
Then it creates a dictionary from this, which is not strictly necessary but might help you if you want to extend the code for more lines.
Next I create a tab-separated-value file test.tsv containing a header and the values in the next line.
This means now you have a file which Hive can understand.
I am sure you can find a lot of articles on importing CSV or tab-separated-value files, but I will give you an example of a generic Hive query you can use to import this file once it is in HDFS.
CREATE TABLE if not exists [database].[table]
([Col1] Integer, [Col2] Integer, [Col3] String,...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA inpath '[HDFS path]'
overwrite INTO TABLE [database].[table];
Hope this gives you a better idea on how to proceed.
Copy the file to HDFS and create an external table with a single column (C8), then use the below select statement to extract each columns
create external table tablename(
c8 string )
STORED AS TEXTFILE
location 'HDFS path';
select regexp_extract(c8,'8=(.*?)<SHO>',1) as c8,
regexp_extract(c8,'9=(.*?)<SHO>',1) as c9,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c35,
regexp_extract(c8,'34=(.*?)<SHO>',1) as c34,
regexp_extract(c8,'49=(.*?)<SHO>',1) as c49,
regexp_extract(c8,'52=(.*?)<SHO>',1) as c52,
regexp_extract(c8,'56=(.*?)<SHO>',1) as c56,
regexp_extract(c8,'98=(.*?)<SHO>',1) as c98,
regexp_extract(c8,'108=(.*?)<SHO>',1) as c108,
regexp_extract(c8,'554=(.*?)<SHO>',1) as c554,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c10
from tablename
I am trying to move data from a file into a hive table. The data in the file looks something like this:-
StringA StringB StringC StringD StringE
where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one column for StringD onwards i.e. StringD and String E should be part of the same column. If i use
ROW DELIMITED BY FIELDS TERMINATED BY ' ', Hive would produce separate columns for StringD and StringE. (StringD and StringE contain space within themselves whereas other strings do not contain spaces within themselves)
Is there any special syntax in hive to achieve this or do i need to pre-process my data file in some way?
Use regular expresion
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
you can define when use space as delimiter and when part of data
I am using histogram_numeric function of hive and I want to output my select query to file.
However I get invalid characher in file and i cannot use it for plotting the data.
here is my code:
INSERT OVERWRITE LOCAL DIRECTORY '/home/cloudera/queries/histograms/q1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select explode(histogram_numeric(operationTime,30)) from transaction;
And as a result I get :
3.1968591661070107"someInvalidCharacter"196572.0
14.41629947203365"someInvalidCharacter"725191.0
27.84241052482667"someInvalidCharacter"27069.0
But I expect "," instead of "someInvalidCharacter".
What can be the problem
Per the Hive LanguageManual, histogram_numeric creates an array of structs. Trying using inline to "explode" your output instead of using explode.
I have a large input file, values are pipe delimited. And there are 20 values in a row. after 19th pipe, if new line character comes, that is a record.
But my input file is having \n not only after 19 pipes but also in the other values. sample line looks like this...
101101|this\nis my sample|12547|sample\nxyz|......(19th pipe)|end of record\n
I am new to Hadoop and I don't know how to divide lines to create key value pairs based on this condition.
Another related question I have is, input split happens at the client side and if I have to split the input file conditionally on the client side(one machine), will it not be very slow considering the large file? Please help.
In Hive NULL column values are represented as "\N" that's the default behaviour of Hive. This is done to differentiate NULL and "NULL" (string NULL).
If you don't want \N to appears to appear in your export you can use COALESCE UDF.
Roughly your query may look like this
SELECT
COALESCE (my_column, '') AS my_column
FROM
my_table