How do I ignore brackets when loading exteral table in HIVE - hadoop

I'm trying to load an extract of a pig script as an external table in HIVE. Pig enclosed each row between brackets () (tuples?) like this:
(1,2,3,a)
(2,4,5,b)
(4,2,6,c)
and I can't find a way to tell HIVE to ignore those brackets which results in null values for the first column as it is actually an integer.
Any thoughts on how to proceed?
I know I can use a FLATTEN command in PIG but I would also like to learn how to deal with these files directly from HIVE.

There is no way to do this in one step. You'd have to have another step, be it the use of flatten in Pig or an extra Hive INSERT INTO.
In Hive you could use split(string field, string pattern) several times to read from your external table and create the columns you want and then load that into a new table. However I'd always lean towards having Pig output into the format you want, unless something else is reading from this file that expects the data in that format. It will save an expensive re-read of all your data.

As Ben said there is no way to do in one step.. but you can do it by creating one more temp table in hive.
Not sure if I am making it more complicated with one more table.. but it worked for me.
create external table A_TEMP (first string,second int,third int,fourth string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/hdfs/Adata';
Place your data under 'Adata' folder
create external table A (first int,second int,third int,fourth string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/hdfs/Afinaldata';
Now lets insert data
insert into table A
select cast(substr(first, 2, length(first) - 2) as int),second,third,substr(fourth, 1,length(fourth) - 1 ) from A_TEMP;
I know type casting will hit performance.. but for the given scenario this is the best I could come up with.

Related

How to handle new line characters in hive?

I am exporting table from Teradata to Hive.. The table in the teradata Has a address field which has New line characters(\n).. initially I am exporting the table to mount filesystem path from Teradata and then I am loading the table into hive... Record counts are mismatching between teradata table and hive table, Since new line characters are presented in hive.
NOTE: I don't want to handle this through sqoop to bring the data I want to handle the new line characters while loading Into hive from local path.
I got this to work by creating an external table with the following options:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
ESCAPED BY '\\'
STORED AS TEXTFILE;
Then I created a partition to the directory that contains the data files. (my table uses partitions)
i.e.
ALTER TABLE STG_HOLD_CR_LINE_FEED ADD PARTITION (part_key='part_week53') LOCATION '/ifs/test/schema.table/staging/';
NOTE: Be sure that when creating your data file you use '\' as the escape character.
Load data command in Hive only copies the data directly into the hdfs table location.
The only reason Hive would split a new line is if you only defined the table stored as TEXT, which by default uses new lines as record separators, not field separators.
To redefine the table you need something like
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY 'x'
LINES TERMINATED BY 'y'
Where, x and y are, hopefully, escape characters around fields containing new lines, and record delimiters, respectively

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

In Hive, how do I load only part of the raw data to a table?

I've got a typical CREATE TABLE statement as follows:
CREATE EXTERNAL TABLE temp_url (
MSISDN STRING,
TIMESTAMP STRING,
URL STRING,
TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/20140811/';
Where /20140811/ is a directory with gigabytes worth of data inside.
Loading the things is not a problem. Querying anything on it, however, chokes Hive up and simply gives me a number of MapRed errors.
So instead, I'd like to ask if there's a way to load only part of the data in /20140811/. I know I can select a few files from inside the folder, dump them into another folder, and use that, but it seems tedious, especially when I've got 20 or so of this /20140811/ directories.
Is there something like this:
CREATE EXTERNAL TABLE temp_url (
MSISDN STRING,
TIMESTAMP STRING,
URL STRING,
TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/Half_of_20140811/';
I'm also open to non-hive answers. Perhaps there's a way in s3cmd to quickly get a certain amount of data inside /20140811/ dump it into /20140811_halved/ or something.
Thanks.
I would suggest the following as a workaround :
Create a temp table with same structure. (using like)
insert into NEW_TABLE select * from OLD_TABLE limit 1000;
You add as many filter conditions to filter out data and load.
Hope this helps you.
Since you are saying that you have "20 or so of this /20140811/ directories", why don't you try creating an external table with partitions on those directories and run your queries on a single partition.

Simple Hive query is empty

I have a csv log file. After loading it into Hive using this sentence:
CREATE EXTERNAL TABLE iprange(id STRING, ip STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' STORED AS TEXTFILE LOCATION '/user/hadoop/expandediprange/';
I want to perfom a simple query like:
select * from iprange where ip="0.0.0.2";
But I get an empty result.
I'm running Hive on HDFS, should I use HBase?
My conclusion is that it's got something to do with the table size. Log file is 160 MB, and the generated table in Hive has 8 million rows. If I try to create myself a smaller file and load it to Hive it will work.
Any idea of what is wrong?
Edit: I forgot to say that it's running on Amazon Elastic MapReduce using a small instance.
I found the problem. It was not a Hive issue really. I'm using the output of a Hadoop job as input, and in that job I was writing the output in the key, leaving the value as an empty string:
context.write(new Text(id + "," + ip), new Text(""));
The problem is that Hadoop inserts a tab character by default between the key and the value, and as field is a string it took the tab as well, so I had a trailing tab in every line. I discovered it using Pig as it embraces the output with ().
The solution for me is to set the separator to another character, as I have only two fields I write one in the key and the other one in the value, and set the separator to ",":
conf.set("mapred.textoutputformat.separator", ",");
Maybe its possible to trim these things in Hive.

Share data between hive and hadoop streaming-api output

I've several hadoop streaming api programs and produce output with this outputformat:
"org.apache.hadoop.mapred.SequenceFileOutputFormat"
And the streaming api program can read the file with input format "org.apache.hadoop.mapred.SequenceFileAsTextInputFormat".
For the data in the output file looks like this.
val1-1,val1-2,val1-3
val2-1,val2-2,val2-3
val3-1,val3-2,val3-3
Now I want to read the output with hive. I created a table with this script:
CREATE EXTERNAL
TABLE IF NOT EXISTS table1
(
col1 int,
col2 string,
col3 int
)
PARTITIONED BY (year STRING,month STRING,day STRING,hour STRING)
ROW FORMAT DELIMITED
FIELDs TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION '/hive/table1';
When I query data with query
select * from table1
The result will be
val1-2,val1-3
val2-2,val2-3
val3-2,val3-3
It seems the first column has been ignored. I think hive just use values as output not keys. Any ideas?
You are correct. One of the limitations of Hive right now is that ignores the keys from the Sequence file format. By right now, I am referring to Hive 0.7 but I believe it's a limitation of Hive 0.8 and Hive 0.9 as well.
To circumvent this, you might have to create a new input format for which the key is null and the value is the combination of your present key and value. Sorry, I know this was not the answer you were looking for!
It should be fields terminated by ','
instead of fields terminated by '\t', I think.

Resources