Apache Pig store delimiters - hadoop

I'm using Pig Latin to store values from an alias into the HDFS. The alias contains a semicolon in one of its fields.
dump A;
(Richard & John, 1993)
(Albert, 1994)
A table that shows the data in the HDFS, but the semicolon makes John go to the next column.
| Name | Year |
|--------------|------|
| Richard &amp | John |
| Albert | 1994 |
Trying to use store like this is also not working as expected:
STORE A INTO '/user/hive/warehouse/test.db/names' using PigStorage('\t');
but even when telling PigStore to use tab as delimiter the semicolon breaks the table data. How can I fix it?

I just locally create a file suppose a.txt and copy your data into this file.
(Richard & John, 1993)
(Albert, 1994)
Now I see that your data is not in tab delimiter form and that's why it split after semicolon part.So to solve this problem i just right a query like this
data = load '/home/hduser/Desktop/a.txt' using PigStorage(',');
dump data;
and my output result is this
((Richard & John, 1993))
((Albert, 1994))
I split it using this
,
because your data looks like this delimiter.
Note: I run it my local file system.So to run it locally you must start your pig using this command pig -x local and give your relevant path.

It seems there was a bug in the create table.
create table test.names
(
name varchar(40),
year varchar(40)
)
row format delimited fields terminated by '\073'
lines terminated by '\n';
The delimiter I used was \073 (semicolon), so changing the PigStorage delimiter had no effect.
I'm using \072 (double colon) and it is now working. I think any other delimiter would work as long as it is not a common or possible character in the input data.

Related

Hadoop Hive: Generate Table Name and Attribute Name using Bash script

In our environment we do not have access to Hive meta store to directly query.
I have a requirement to generate tablename , columnname pairs for a set of tables dynamically.
I was trying to achieve this by running "describe extended $tablename" to a file for all tables and pick up tablename and column name pairs from the file.
is there any easier way it is done/it can be done other than this way .
The desired output is like
table1|col1
table1|col2
table1|col3
table2|col1
table2|col2
table3|col1
This script will print columns in desired format for single table. AWK parses strings from describe command, takes only column_name, concatenates with "|" and table_name variable, each string printed with \n as a delimiter between them.
#!/bin/bash
#Set table name here
TABLE_NAME=your_schema.your_table
TABLE_COLUMNS=$(hive -S -e "set hive.cli.print.header=false; describe ${TABLE_NAME};" | awk -v table_name="${TABLE_NAME}" -F " " 'f&&!NF{exit}{f=1}f{printf c table_name "|" toupper($1)}{c="\n"}')
You can easily modify it for generating output for all tables using show tables command for example.
The easier way is to access metadata database directly.

String and non string data getting converted to 'null' for empty fields while exporting into Oracle table through hive

I am new to Hadoop and I have a scenario where I have to export the dataset/file from HDFS to Oracle table using sqoop export. The file has values of 'null' in it so same is getting exported in table as well. I want to know how we can replace 'null' with blank in database while exporting?
You can create a TSV file from hive/beeline in that process you can add nulls to be blank with this --nullemptystring=true
Example : beeline -u ${hhiveConnectionString} --outputformat=csv2 --showHeader=false --silent=true --nullemptystring=true --incremental=true -e 'set hive.support.quoted.identifiers =none; select * from someSchema.someTable where whatever > something' > /Your/Local/Location or EdgeNode/exportingfile.tsv
You can use the created file in the sqoop export for exporting to Oracle table.
You can also replace the nulls with blanks on the file with Unix sed
Ex : sed -i s/null//g /Your/file//Your/Local/Location or EdgeNode/exportingfile.tsv
In oracle empty strings and nulls are treated the same for varchars. That is why Oracle internally converts empty strings into nulls for varchar. When '' assigned to a char(1) it becomes ' ' (char types are blank padded strings). See what Tom Kite says about this: https://asktom.oracle.com/pls/asktom/f?p=100:11:0%3a%3a%3a%3aP11_QUESTION_ID:5984520277372
See this manual: https://www.techonthenet.com/oracle/questions/empty_null.php

How to handle a delimiter in Hive

How do we handle a data in Hive when the \t is in the value and the delimiter is also \t. Suppose for example there is a column as Street, data type as String and value as XXX\tYYY and while creating a table we have used the field delimiter as \t. How will the delimiter work? In this case will the \t in the value will also be delimited?
If your columns with \t values are enclosed by quote character like " the you could use csv-serde to parse the data like this:
Here is a sample dataset that I have loaded:
R1Col1 R1Col2 "R1Col3 MoreData" R1Col4
R2Col2 R2Col2 "R2Col3 MoreData" R2Col4
Register the jar from hive console
hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;
Create a table with the specified serde and custom properties
hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
> row format serde 'com.bizo.hive.serde.csv.CSVSerde'
> with serdeproperties(
> "separatorChar" = "\t",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> stored as textfile;
Load your dataset into the table:
hive> load data inpath '/path/to/file/in/hdfs' into table test_table;
Do a select * from test_table to check the results
You could download the csv-serde from here.
It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.
To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.
Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.
We have faced the same in our data load into hadoop clusters. What we did, added \\t whenever we saw the delimiter is included within a data fields and added the below in the table definition.
Row format delimited fields terminated by \t escaped by \\ lines terminated by \n

Loading data using Hive Sed command

I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ; Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int. Please help
Also i dont want to use regexserde command.
how can i use sed command from unix to clean the data and get my output.
i tried to learn about sed command and found the replace option. So i can remove the " double quotations. But how can i handle the extra ; semi colon which comes in the middle of the data
Please help
I think you can preprocess with sed and then use the MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES
sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*);/\1XXXXX/g; t a; s/;/ /g; s/XXXXX/;/g' file
This sed matches the quote pairs to avoid processing what is between quotes putting a placeholder for the semicolons outside of quoted text. Afterward it removes the ;'s from the book title text and replaces them w/a space and puts back the semicolons that are outside quotes.
See here for more how to load data using Hive including an example of MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES:
https://svn.apache.org/repos/asf/hive/trunk/serde/README.txt
create external table books (isbn int,title string,year int,publisher string)
row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH
SERDEPROPERTIES ('separatorChar' = '\;' , 'quoteChar' = '\"' ) location 'S3
path/HDFS path for the file';

How to find the number of columns within a row key in hbase

How to find the number of columns within a row key in hbase (since a row can have many columns)
I don't think there's a direct way to do that as each row can have a different number of columns and they may be spread over several files.
If you don't want to bring the whole row to the client to perform the count there you can write an endpoint coprocessor (HBase version for a stored procedure if you like) to perform the calculation on the region server side and only return the result. you can read a little about coprocessors here
There is a simple way:
Use hbase shell to scan through the table and write the output to a intermediate text file. Because hbase shell output splits each column of a row into a new line, we can just count the lines inside the text file (minus the first 6 lines which are hbase shell standard output and the last 2 lines).
echo "scan 'mytable', {STARTROW=>'mystartrow', ENDROW=>'myendrow'}" | hbase shell > row.txt
wc -l row.txt
Make sure to select the appropriate row keys, as the borders are not inclusive.
If you are only interested into specific columns (families), apply the filters in the hbase shell command above (e.g. FamilyFilter, ColumnRangeFilter, ...).
Thanks for #user3375803, actually you don't have to use external txt file. Because I can not comment on your answer, so I leave my answer below:
echo "scan 'mytable', {STARTROW=>'mystartrow', ENDROW=>'myendrow'}" | hbase shell | wc -l | awk '{print $1-8}'

Resources