Share data between hive and hadoop streaming-api output - hadoop

I've several hadoop streaming api programs and produce output with this outputformat:
"org.apache.hadoop.mapred.SequenceFileOutputFormat"
And the streaming api program can read the file with input format "org.apache.hadoop.mapred.SequenceFileAsTextInputFormat".
For the data in the output file looks like this.
val1-1,val1-2,val1-3
val2-1,val2-2,val2-3
val3-1,val3-2,val3-3
Now I want to read the output with hive. I created a table with this script:
CREATE EXTERNAL
TABLE IF NOT EXISTS table1
(
col1 int,
col2 string,
col3 int
)
PARTITIONED BY (year STRING,month STRING,day STRING,hour STRING)
ROW FORMAT DELIMITED
FIELDs TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION '/hive/table1';
When I query data with query
select * from table1
The result will be
val1-2,val1-3
val2-2,val2-3
val3-2,val3-3
It seems the first column has been ignored. I think hive just use values as output not keys. Any ideas?

You are correct. One of the limitations of Hive right now is that ignores the keys from the Sequence file format. By right now, I am referring to Hive 0.7 but I believe it's a limitation of Hive 0.8 and Hive 0.9 as well.
To circumvent this, you might have to create a new input format for which the key is null and the value is the combination of your present key and value. Sorry, I know this was not the answer you were looking for!

It should be fields terminated by ','
instead of fields terminated by '\t', I think.

Related

How can I create a partitioned table that is semicolumn separated and has commas as decimal points?

I'm having problems whith this type of table:
manager; sales
charles; 100,1
ferdand; 212,6
aldalbert; 23,4
chuck; 41,6
I'm using the code bellow to create and define the partitioned table:
CREATE TABLE db.table
(
manager string,
sales string
)
partitioned by (file_type string)
row format delimited fields terminated by ';'
lines terminated by '\n'
tblproperties ("skip.header.line.count"="1");
Afterwards, I'm using a regex command to replace the commas by dots and then convert the sales field to a number datatype.
I wonder if there is a better solution than that.
Other than using Spark or Pig to clean the data as well as load the Hive table, then no, you'll need to replace and cast the sales column within HiveQL to get the format you want

How to create Hive table for special formated data

I have text files that i want to load into Hive table.
Format of the data is like below
Id|^|SegmId|^|geographyId|^|Sequence|^|Subtracted|^|FFAction|!|
4295875876|^|3|^|110170|^|1|^|False|^|I|!|
4295876137|^|2|^|110170|^|1|^|False|^|I|!|
4295876137|^|8|^|100219|^|1|^|False|^|I|!|
I want to create a table in Hive for this kind of data.
Can you please suggest how to create table for this?
This is what I have tried but getting null (also please suggest us the data type for the columns):
create table if not exists GeographicSegment
(
Id int,
SegId int,
geographyId int,
Sequence int,
Subtracted String,
FFAction String
) row format delimited fields terminated by '|!|' LINES TERMINATED BY '\n' ;
This has worked for me
row format SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ("field.delim"="|^|") tblproperties
It seems that your fields are terminated by '|^|' and your lines are terminated by '|!|\n'
Hive does not support multiple character as delimiter,
you can find the way to handle it here,
Solution
Regarding the data type what you are doing is correct except the first column ID. The value present is more than the range of INT. it can be BIGINT.

hive: external partitioned table without location

Is it possible to create external partitioned table without location? I want to add all the locations later, together with partitions.
i tried:
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
PARTITIONED BY day;
but i got ParseException: missing EOF at 'PARTITIONED' near 'TEXTFILE'
I don't think so, as said in alter location.
But anyway, i think your query as some errors and the correct script would be :
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
PARTITIONED BY (day String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
;
I think the issue is that you have not specified data type for your partition column "day". And you can create a HIVE external table without location and can use ALTER table options later to change the location.

Impala minimum DDL

I know that we can create an Impala table like
CREATE EXTERNAL TABLE SCHEMA.TableName LIKE PARQUET
'/rootDir/SecondLevelDir/RawFileThatKnowsDataTypes.parquet'
But I am not sure if Impala can create a table from a file (preferably a text file) that has no known formatting. So in other words if I just dump a random file into hadoop with a put command, can I wrap an Impala DDL around it and have a table created. Can anyone tell me?
If you file is newline separated I believe it should work if you provide the column delimiter with the ROW FORMAT clause, since textfile is the default format. Just get rid of your LIKE clause, and choose names and datatypes for your columns something like this:
CREATE EXTERNAL TABLE SCHEMA.TableName (col1 STRING, col2 INT, col3 FLOAT)
'/rootDir/SecondLevelDir/RawFile'
row format delimited fields terminated by ",";

How do I ignore brackets when loading exteral table in HIVE

I'm trying to load an extract of a pig script as an external table in HIVE. Pig enclosed each row between brackets () (tuples?) like this:
(1,2,3,a)
(2,4,5,b)
(4,2,6,c)
and I can't find a way to tell HIVE to ignore those brackets which results in null values for the first column as it is actually an integer.
Any thoughts on how to proceed?
I know I can use a FLATTEN command in PIG but I would also like to learn how to deal with these files directly from HIVE.
There is no way to do this in one step. You'd have to have another step, be it the use of flatten in Pig or an extra Hive INSERT INTO.
In Hive you could use split(string field, string pattern) several times to read from your external table and create the columns you want and then load that into a new table. However I'd always lean towards having Pig output into the format you want, unless something else is reading from this file that expects the data in that format. It will save an expensive re-read of all your data.
As Ben said there is no way to do in one step.. but you can do it by creating one more temp table in hive.
Not sure if I am making it more complicated with one more table.. but it worked for me.
create external table A_TEMP (first string,second int,third int,fourth string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/hdfs/Adata';
Place your data under 'Adata' folder
create external table A (first int,second int,third int,fourth string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/hdfs/Afinaldata';
Now lets insert data
insert into table A
select cast(substr(first, 2, length(first) - 2) as int),second,third,substr(fourth, 1,length(fourth) - 1 ) from A_TEMP;
I know type casting will hit performance.. but for the given scenario this is the best I could come up with.

Resources