Unicode field separator to create table in databricks - azure-databricks

We are getting \u318a (ㆊ) separated csv file. We want to create unmanaged table in databricks, Here is the table creation script.
create table IF NOT EXISTS db_test_raw.t_data_otc_poc
(`caseidt` String,
`worktype` String,
`doctyp` String,
`brand` String,
`reqemailid` String,
`subprocess` String,
`accountname` String,
`location` String,
`lineitems` String,
`emailsubject` String,
`createddate` string,
`process` String,
`archivalbatchid` String,
`createddt` String,
`customername` String,
`invoicetype` String,
`month` String,
`payernumber` String,
`sapaccountnumber` String,SOURCE_BUSINESS_DATE Date ) USING
CSV OPTIONS (header 'true',encoding 'UTF-8',quote '"', escape '"',delimiter '\u318a', path
'abfss://xxxx#yyyyy.dfs.core.windows.net/Raw/OPERATIONS/BUSINESSSERVICES/xxx/xx_DATA_OTC')
PARTITIONED BY (SOURCE_BUSINESS_DATE )
The table created successfully in databricks.
While checking (describe table extended db_test_raw.t_data_otc_poc ), we found storage properties as
[encoding=UTF-8, quote=", escape=", header=true, delimiter=?] .The delimiter got changed.
Can you please let us know what went wrong here?

Related

Skip first character from CSV file in Oracle sql loader control file

How do I skip the first character?
Here is the CSV file that I want to load
H
B"01","Mosco"
B"02","Delhi"
T
Here is the control file
LOAD DATA
INFILE 'capital.csv'
APPEND
INTO TABLE CAPITALS
WHEN (01)='B'
FIELDS TERMINATED BY ","
OPTIONALLY ENCLOSED BY '"'
(
ID,
CAPITAL
)```
WHEN i RUN THIS THE 'B' COMES INTO PICTURE.
The table should look like
[![Table view][1]][1]
How do I skip the 'B'?
[1]: https://i.stack.imgur.com/2U3Vo.png
Disregard the first character. Can you have the source put a comma after the record type indicator?
If so, do this to ignore it:
(
RECORD_IND FILLER,
ID,
CAPITAL
)
If not, this should take care of it in your situation:
ID "SUBSTR(:ID, 2)",

loading data form txt into an external hive table and rows with a special characters are being inserted

CREATE EXTERNAL TABLE EUROPEANSITES(
SITEID STRING,
SITES_IN_COUNTRY STRING,
EMP_INCO_INCNTRY STRING,
PC_IN_COUNTRY STRING,
PREFERRED_WAN_PROVIDER STRING,
REG_CODE STRING,
NAF_CODE_REV2 STRING,
NUTS2_CODE STRING,
NUTS2_DESC STRING,
NUTS3_CODE STRING,
NUTS3_DESC STRING,
NUTS4_CODE STRING,
NUTS4_DESC STRING,
TURNOVER_CODE STRING,
TURNOVER_LOCAL STRING,
TURNOVER_EUROS STRING,
VAT_CODE STRING,
NACE1_CODE STRING,
NACE1_DESC STRING,
NACE2_CODE STRING,
NACE2_DESC STRING,
NACE3_CODE STRING,
NACE3_DESC STRING,
NACE4_CODE STRING,
NACE4_DESC STRING,
ENT_NACE3_CODE STRING,
ENT_NACE3_DESC STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\n'
STORED AS TEXTFILE
TBLPROPERTIES ('skip.header.line.count' = '1')
;
this is the script I am using to load a text file with '\t' delimited but when I load the data and query the table I see alternate rows with special character .
When I verified the file I dont see any special character.
the data in the table looks like this :
Please click here to see how the data in the table looks like
I tried cleaning up the file and it worked, i used this command :
sed $'s/[^[:print:]\t]//g' filename.csv> tgt_filename.csv
then I used openserde to avoid the "".

Hive table delimited by comma and multiple spaces

I have a similiar question to here:
Hive table source delimited by multiple spaces
My data looks like this:
AL, 01, 2016010700, , BEST, 0, 266N, 753W
AL, 01, 2016010706, , BEST, 0, 276N, 747W
AL, 01, 2016010712, , BEST, 0, 287N, 738W
AL, 01, 2016010712, , BEST, 0, 287N, 738W
That means my column delimiter is "a comma plus a variable number of spaces".
I tried to simply modify field.delim by adding this comma to the regex, but it doesn't work.
The result is, that all data gets put into the first column (basin) and all other columns are NULL.
CREATE EXTERNAL TABLE IF NOT EXISTS default.myTable1
(
basin string
,cy string
,yyyymmddhh int
,technum_min string
,tech string
,tau string
,lat_n_s string
,lon_e_w string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"=",\\s+")
LOCATION '/data';
I am running HDP 2.5 (Hive 1.2.1).
Thanks for any help and suggestions.
We have two approach to solve your problem.
create table 'rawTbl' using below option
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
and use trim() to remove space
Insert into baseTbl select trim(basin), trim(cy),...., from rawTbl
OR you can use regEx
I have updated answer with regex which separate text input file composed of requested fields. Regex contains 7 regex groups capturing the requested field on each line.
CREATE EXTERNAL TABlE tableex(basin string
,cy string
,yyyymmddhh int
,technum_min string
,tech string
,tau string
,lat_n_s string
,lon_e_w string )
ROW FORMAT 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '^([A-Za-z]{2}),\s+(\d{2}),\s(\d{10}),\s+,\s([A-Z]{4}),\s+(\d{1}),\s+(\d{3}[A-Z]{1}),\s+(\d+[A-Z]{1})'
)
LOCATION '/data';
how about this
(\S+),\s+(\S+),\s(\S+),\s+,\s(\S+)\s+(\S+),\s+(\S+),\s+(\S*)

Multiple rows in single field not getting loaded | SQL Loader | Oracle

I need to load from CSV file into an Oracle Table.
The problem i m facing is that, the DESCRIPTION field is having Multiple Lines in itself.
Solution i am using for it as ENCLOSURE STRING " (Double Quotes)
Using KSH to call for sqlldr.
I am getting following two problems:
The row having Description with multiple lines, is not getting loaded as it terminates there itself and values of further fields/columns are not visible for loader. ERROR: second enclosure string not present (Obviously " is not found.)
The second line(and lines beyond that) of DESCRIPTION field is being treated as NEW Row in itself and is thus getting populated. It is GARBAGE DATA.
CONTROL File:
OPTIONS(SKIP=1)
LOAD DATA
BADFILE '/home/fimsctl/datafiles/inbound/core_po/logs/core_po_data.bad'
DISCARDFILE '/home/fimsctl/datafiles/inbound/core_po/logs/core_po_data.dsc'
APPEND INTO TABLE FIMS_OWNER.FINANCE_PO_INBOUND_T
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
PO_NUM,
CREATED_DATE "to_Date(:CREATED_DATE,'mm/dd/yyyy hh24:mi:ss')",
PO_TYPE,
PO_STATUS,
NOTREQ1 FILLER,
NOTREQ2 FILLER,
PO_VALUE,
LINE_ITEM_NUMBER,
QUANTITY,
LINE_ITEM_DESCRIPTION,
RATE_VALUE,
CURRENCY_CODE,
UOM_ID,
PO_REQUESTER_WWID,
QUANTITY_ORDERED,
QUANTITY_RECEIVED,
QUANTITY_BILLED terminated by whitespace
)
CSV File Data:
COL1,8/4/2014 5:52,COL3,COL4,COL5,,,COL8,COL9,"Description Data",COL11,COL12,COL13,COL14,COL15,COL16,COL17
COL1,8/4/2014 8:07,COL3,COL4,COL5,,,COL8,COL9,,"GE MAKE 1X250 WATT HPSV NON INTEGRAL IP-65 **[NEWLINE HERE]**
DIE-CAST ALUMINIUM FIXTURE COMPLETE SET **[NEWLINE HERE]**
WITH SEPRATE CONTROL GEAR BOX WITH CHOKE, **[NEWLINE HERE]**
IGNITOR, CAPACITOR & LAMP-T",COL11,COL12,COL13,COL14,COL15,COL16,COL17
COL1,8/4/2014 8:13,COL3,COL4,COL5,,,COL8,COL9,"Description Data",COL11,COL12,COL13,COL14,COL15,COL16,COL17

Ruby Code Comment Within Quotes

I have a multi-line SQL command string in my Ruby script. I am adding some extra lines to the SQL command string, and want to supplement it with some in-line comments.
mysql.query("CREATE TABLE If NOT EXISTS #{table}(
application varchar(255),
eventType varchar(255),
eventTs datetime,
eventDayWeek int,
newColumnHere int, #Hello, I would like to be a comment
eventHourDay int,
....)")
How does one add code comments within a set of quotes?
MySQL does support comment syntax so your code should work as is. However, I would prefer to use a "heredoc":
mysql.query <<END
CREATE TABLE If NOT EXISTS #{table}(
application varchar(255),
eventType varchar(255),
eventTs datetime,
eventDayWeek int,
newColumnHere int, #Hello, I would like to be a comment
eventHourDay int,
....)
END
You could just break the string in two, or alternatively include an SQL comment.
For the first option:
"CREATE TABLE ...
newColumnHere int, " +
# comment in ruby here
"eventHourDay int, ...
Or the second option:
newColumnHere int, -- SQL comments from double dash to end of line
eventHourDay int,

Resources