Delete row in hive external table - hadoop

I loaded text file into hive external table. That text file has a delimiter of / to differentiate column. Also additionally some column has new line character in one column. Because of that there is mismatch in the data stored in external table. In my case the unique key is row_id which contains values like 1_234 . rowid is numeric. But because of new line character in the text file, some rows has text in row_id.
Is there any way to delete those rows in hive or how can I remove the new line character in text file in hdfs?

You will have to write a hadoop (streaming is an option) job to clean your data before loading into Hive.

Related

How to handle new line characters in hive?

I am exporting table from Teradata to Hive.. The table in the teradata Has a address field which has New line characters(\n).. initially I am exporting the table to mount filesystem path from Teradata and then I am loading the table into hive... Record counts are mismatching between teradata table and hive table, Since new line characters are presented in hive.
NOTE: I don't want to handle this through sqoop to bring the data I want to handle the new line characters while loading Into hive from local path.
I got this to work by creating an external table with the following options:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
ESCAPED BY '\\'
STORED AS TEXTFILE;
Then I created a partition to the directory that contains the data files. (my table uses partitions)
i.e.
ALTER TABLE STG_HOLD_CR_LINE_FEED ADD PARTITION (part_key='part_week53') LOCATION '/ifs/test/schema.table/staging/';
NOTE: Be sure that when creating your data file you use '\' as the escape character.
Load data command in Hive only copies the data directly into the hdfs table location.
The only reason Hive would split a new line is if you only defined the table stored as TEXT, which by default uses new lines as record separators, not field separators.
To redefine the table you need something like
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY 'x'
LINES TERMINATED BY 'y'
Where, x and y are, hopefully, escape characters around fields containing new lines, and record delimiters, respectively

Replace specific junk characters from column in hive

I've an issue where one of the column loaded in a hive table contains junk character ("~) in a column suffixed with actual value (ABC). So the actual value that's visible for this column is (ABC"~).
This column can have either ABC (or any such string) or NULL. The table is huge and Update is not an option here.
I've thought of a solution of creating a temp table with this column containing either the string (ABC) or NULL, thereby want to remove this junk character ("~) completely while copying the data from original table to this temp table.
Any help on how I can remove this junk? I tried using regexp function, but no success. Any suggestions?
I was not using regexp properly; my fault.
The data loaded initially in the table had the extra characters attached to a column's values. For Ex: If the column's actual value was Adf452, then the data contained in the cell was Adf452"~.
So I loaded the data to a temp table like this:
insert overwrite table tempTable select colA, colB, colC, regexp_replace(colC,"\"~",""), partitionedCol from origTable;
This simply loaded the data in tempTable without those junk characters.

Table count is more than File record count in Hive

I'm using the SQL server exported file as the input of my hive table (having 40 columns). There are around 6 million rows in the data file, but when I load that file in the hive table, I find the record count more than row count in file. The table has 15 records more than that of the input text file.
I suspect the presence of new line characters \n in the data, but due to the huge volume of data I'm unable to manually check and remove these characters from the data file.
Is there any way by which I can manage my table count exactly equal to that of file count? Can I make my load query to consider those new line characters as data instead of record delimiter? or is there any other issue?
If you are sqooping input to hdfs/hive then you may use --hive-drop-import-delims or --hive-delims-replacement options of sqoop.
Hive will have problems using Sqoop-imported data if your database’s
rows contain string fields that have Hive’s default row delimiters (\n
and \r characters) or column delimiters (\01 characters) present in
them.
You can use the --hive-drop-import-delims option to drop those
characters on import to give Hive-compatible text data.
Alternatively, you can use the --hive-delims-replacement option to replace > those characters with a user-defined string on import to give
Hive-compatible text data.
These options should only be used if you
use Hive’s default delimiters and should not be used if different
delimiters are specified.
Sqoop User Guide
Alternatively, if you are copying files onto hdfs using some other method, then just run a replace script/command over the files.
It was as simple as to run a simple unix command and clean the source data.
sed -i 's/\r//g'
After applying this command on the dataset to remove carraige returns I was able to load the hive table with expected record count.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

How do I partition in hive by a specific column?

I have 3 columns: user, datetime, and data
My data is space delimited and each row is delimited by a new line
right now I'm using the regexserde to read in my input, however I want to partition by the user. If I do that user can no longer be a column, correct? If so how do I load my data onto my tables?
In Hive each partition corresponds to a folder in HDFS. You can reload the data from your unpartitioned Hive table into a new partitioned HIve table using a create-table-as-select (CTAS) statement. See https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-CreateTable for more details.
You can order the data in HDFS in sub-directories under the current directory, the directory name has to be in the format PART_NAME=PART_VALUE.
If your data is split into files where in each file you have only one type of "user" just create directories corresponding to the usernames (e.g. USERNAME=XYZ) and put all the files that match that username in its directory.
Next you can create an external-table with partitions (see example).
The only problem is that you'll have to define the column "user" that's in your data anyway (but you can just ignore it) and query the other column (USERNAME) which will provide the needed partition pruning.

Resources