Postgres: After converting from bytea to varchar '\r' remains - windows

I have a table which contains xml file as binary data. The xmls contains "\r\n" characters as "\015\012" in bytea. I need to change the column type from bytea to varchar.
I run:
ALTER TABLE my_table ALTER COLUMN xml_data TYPE VARCHAR;
UPDATE my_table SET xml_data = convert_from(xml_data::bytea, 'UTF8');
And it works for linux. But on Windows it converts '\015' to "\r" (two characters). So I have something like that in the result:
<field>...</field>\r
<field>...</field>
Maybe there is an proper method to convert binary data to UTF?

You'll have to strip the carriage returns in a separate step.
If you are ok with getting rid of them wholesale, I suggest something like:
ALTER TABLE my_table
ALTER xml_data TYPE text
USING replace(
convert_from(xml_data, 'UTF8'),
E'\r',
''
);
Is there a good reason for using data type varchar (or text, which is the same) rather than xml?

Related

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

loading data in table using SQL Loader

I'm loading data into my table through SQL Loader
data loading is successful but i''m getting garbage(repetitive) value in a particular column for all rows
After inserting :
column TERM_AGREEMENT is getting value '806158336' for every record
My csv file contains atmost 3 digit data for that column,but i'm forced to set my column definition to Number(10).
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION,
**TERM_AGREEMENT INTEGER**
)
create table LOAN_BALANCE_MASTER_INT
(
ACCOUNT_NO NUMBER(30),
CUSTOMER_NAME VARCHAR2(70),
LIMIT NUMBER(30),
PRODUCT_DESC VARCHAR2(30),
SUBPRODUCT_CODE NUMBER,
ARREARS_INT NUMBER(20,2),
IRREGULARITY NUMBER(20,2),
PRINCIPLE_IRREGULARITY NUMBER(20,2),
**TERM_AGREEMENT NUMBER(10)**
)
INTEGER is for binary data type. If you're importing a csv file, I suppose the numbers are stored as plain text, so you should use INTEGER EXTERNAL. The EXTERNAL clause specifies character data that represents a number.
Edit:
The issue seems to be the termination character of the file. You should be able to solve this issue by editing the INFILE line this way:
INFILE'/ipoapplication/utl_file/LBR_HE_Mar16.csv' "STR X'5E204D'"
Where '5E204D' is the hexadecimal for '^ M'. To get the hexadecimal value you can use the following query:
SELECT utl_raw.cast_to_raw ('^ M') AS hexadecimal FROM dual;
Hope this helps.
I actually solved this issue on my own.
Firstly, thanks to #Gary_W AND #Alessandro for their inputs.Really appreciate your help guys,learned some new things in the process.
Here's the new fragment which worked and i got the correct data for the last column
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION,
**TERM_AGREEMENT INTEGER Terminated by Whitspace**
)
'Terminated by whitespace' - I went through some threads of SQL Loader and i used 'terminated by whitespace' in the last column of his ctl file. it worked ,this time i didn't even had to use 'INTEGER' or 'EXTERNAL' or EXPRESSION '..' for conversion.
Just one thing, now can you guys let me now what could possibly be creating issue ?what was there in my csv file in that column and how by adding this thing solved the issue ?
Thanks.

How to trim the new line character of column data in ctl file of SQL Loader

My table data has contains new line character it is loading from sql loader ctl file, one column called 'IPADDRESS'is loading with new line character:
My ctl file :
load data
INFILE 'abc.txt'
INTO TABLE TABLENAME
APPEND
FIELDS TERMINATED BY '\|'
(MAKE,
CUST_ID "UPPER(:CUST_ID)",
IPADDRESS "REGEXP_REPLACE(:IPADDRESS, '\\.\\D+', '', 1, 0)"
)
Data in table storing is Ex:
Make CUST_ID IPADDRESS
------------------------------
C MPG-VG-ALG01 "9.7.69.37
"
C MPG-VG-ALG03 "9.7.69.39
"
Sample input file data :
C|mpg-vg-alg01.gdl.mex.ibm.com|9.7.69.37
C|mpg-vg-alg03.gdl.mex.ibm.com|9.7.69.39
C|mpg-vg-alg04.gdl.mex.ibm.com|9.7.69.23
Answer for my question is : column_name "REPLACE(:column_name,CHR(13),'')";
Yes, one option would be using REPLACE() function but need to add more;
add CHAR(data_length) for string any data type even if it's of type VARCHAR2
add CHR(10)(line feed) also along with CHR(13)(carriage return)
don't forget to add TRIM() function nested within REPLACE() against extra
issues too
using the third argument is redundant
such as
column_name CHAR(4000) "REPLACE(TRIM(:column_name),CHR(13)||CHR(10))"'
moreover
column_name CHAR(4000) "TRANSLATE(TRIM(:column_name),CHR(13)||CHR(10),' ')"'
might be used as an alternative.

How to specify custom string for NULL values in Hive table stored as text?

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...
Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")
oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

getting null values while loading the data from flat files into hive tables

I am getting the null values while loading the data from flat files into hive tables.
my tables structure is like this:
hive> create table test_hive (id int,value string);
and my flat file is like this:
input.txt
1 a
2 b
3 c
4 d
5 e
6 F
7 G
8 j
when I am running the below commands I am getting null values:
hive> LOAD DATA LOCAL INPATH '/home/hduser/input.txt' OVERWRITE INTO TABLE test_hive;
hive> select * from test_hive;
OK<br>
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
screen shot:
hive> create table test_hive (id int,value string);
OK
Time taken: 4.97 seconds
hive> show tables;
OK
test_hive
Time taken: 0.124 seconds
hive> LOAD DATA LOCAL INPATH '/home/hduser/input2.txt' OVERWRITE INTO TABLE test_hive;
Copying data from file:/home/hduser/input2.txt
Copying file: file:/home/hduser/input2.txt
Loading data to table default.test_hive
Deleted hdfs://hydhtc227141d:54310/app/hive/warehouse/test_hive
OK
Time taken: 0.572 seconds
hive> select * from test_hive;
OK
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
NULL NULL
Time taken: 0.182 seconds
The default field terminator in Hive is ^A. You need to explicitly mention in your create table statement that you are using a different field separator.
Similar to what Lorand Bending pointed in the comment, use:
CREATE TABLE test_hive(id INT, value STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
You don't need to specify a location since you are creating a managed table (and not an external table).
Problem you are facing is because in your data the fields are separated by ' ' and while creating table you did not mention the field delimiter. So if you don't mention the field delimiter while creating hive table, by default hive considers ^A as delimiter.
So to resolve your problem, you can recreate the table mentioning the below syntax and it would work.
CREATE TABLE test_hive(id INT, value STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
The solution is quite simple. The Table wan't created in the right way.
Simple solution for your problem or any further problems is knowing how to load the data.
CREATE TABLE [IF NOT EXIST] mytableName(id int,value string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t'
STORED AS TEXTFILE ;
Now lemme explain the code :
First Line
Creating your table. The [IF NOT EXIST] is optional that tells if the table exist don't overwrite it. Its more of safety measure.
Second line
Specifies a delimiter at the table level for structured fields.
Third Item
You can include any single character, but the default is '\001'.
'/t' is for a tab space : in your case
'|' is for data which are beside each other and separated by |
' ' for one char space. And so on...
Forth Line :
Specifies the type of file in which data is to be stored. The file can be a TEXTFILE, SEQUENCEFILE, RCFILE, or BINARY SEQUENCEFILE. Or, how the data is stored can be specified as Java input and output classes.
when loading Locally :
LOCD DATA LOCAL INPATH '/your/data/path.csv' [OVERWRITE] INTO TABLE myTableName;
Always try checking your data by a simple select* statement.
Hope it helps.
Hive’s default record and field delimiters list:
\n
^A
^B
^C
press ^V^A could insert a ^A in Vim.
The elements are separated by space or tab? Let it's tab follow these steps. If separated space use ' ' instead of '\t' Ok.
hive> CREATE TABLE test_hive(id INT, value STRING) row format
delimited fields terminated by '\t' line formated by '\n' stored as filename;
Than you have to enter
hive> LOAD DATA LOCAL INPATH '/home/hduser/input.txt' OVERWRITE INTO TABLE test_hive;
hive> select * from test_hive;
Now you will get exact your expected output "filename".
please check the dataset date column it should follow the date format yyyy-mm-dd
If the string is in the form 'yyyy-mm-dd', then a date value corresponding to that year/month/day is returned. If the string value does not match this formate, then NULL is returned.
Hive Official documentation

Resources