Hive table delimited by comma and multiple spaces - hadoop

I have a similiar question to here:
Hive table source delimited by multiple spaces
My data looks like this:
AL, 01, 2016010700, , BEST, 0, 266N, 753W
AL, 01, 2016010706, , BEST, 0, 276N, 747W
AL, 01, 2016010712, , BEST, 0, 287N, 738W
AL, 01, 2016010712, , BEST, 0, 287N, 738W
That means my column delimiter is "a comma plus a variable number of spaces".
I tried to simply modify field.delim by adding this comma to the regex, but it doesn't work.
The result is, that all data gets put into the first column (basin) and all other columns are NULL.
CREATE EXTERNAL TABLE IF NOT EXISTS default.myTable1
(
basin string
,cy string
,yyyymmddhh int
,technum_min string
,tech string
,tau string
,lat_n_s string
,lon_e_w string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"=",\\s+")
LOCATION '/data';
I am running HDP 2.5 (Hive 1.2.1).
Thanks for any help and suggestions.

We have two approach to solve your problem.
create table 'rawTbl' using below option
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
and use trim() to remove space
Insert into baseTbl select trim(basin), trim(cy),...., from rawTbl
OR you can use regEx
I have updated answer with regex which separate text input file composed of requested fields. Regex contains 7 regex groups capturing the requested field on each line.
CREATE EXTERNAL TABlE tableex(basin string
,cy string
,yyyymmddhh int
,technum_min string
,tech string
,tau string
,lat_n_s string
,lon_e_w string )
ROW FORMAT 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '^([A-Za-z]{2}),\s+(\d{2}),\s(\d{10}),\s+,\s([A-Z]{4}),\s+(\d{1}),\s+(\d{3}[A-Z]{1}),\s+(\d+[A-Z]{1})'
)
LOCATION '/data';

how about this
(\S+),\s+(\S+),\s(\S+),\s+,\s(\S+)\s+(\S+),\s+(\S+),\s+(\S*)

Related

Power Query - How to extract after delimiter

I have info in a column, that needs to be split into two columns. It can be shown like:
1,000,1111,000 - what we should see is 1,000,111 - 1,000 - or
1,1111,100 - what we should see is 1,111 - 1,100
etc.
I need to separate these columns. I assume the conditions should be "If there are four digits after a comma, separate at this point, into two columns.
It's not immediately obvious how I should fix this. Any thoughts?
EDIT: essentially, the criteria is: If the 4th character after any comma is not another comma, move the 4th character onward onto another column.
This query splits the text string into a list, using its commas as delimiters; then looks at each list entry to find the one that is greater than 3 digits; then inserts a semicolon after the 3rd digit of that entry that is longer than 3 digits; then recombines the list into a text string, with commas; then splits that recombined string into two columns, using the semicolon as the delimiter.
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Custom1 = Table.TransformColumns(Source, {"Column1", each Text.Combine(List.Transform(Text.Split(_,","), each if Text.Length(_) > 3 then Text.Insert(_,3,";") else _),",")}),
#"Split Column by Delimiter" = Table.SplitColumn(Custom1, "Column1", Splitter.SplitTextByDelimiter(";", QuoteStyle.Csv), {"Column1.1", "Column1.2"})
in
#"Split Column by Delimiter"
The table I used to develop/test this is simply this table, which I named Table1:
The query result looks like this:

Skip first character from CSV file in Oracle sql loader control file

How do I skip the first character?
Here is the CSV file that I want to load
H
B"01","Mosco"
B"02","Delhi"
T
Here is the control file
LOAD DATA
INFILE 'capital.csv'
APPEND
INTO TABLE CAPITALS
WHEN (01)='B'
FIELDS TERMINATED BY ","
OPTIONALLY ENCLOSED BY '"'
(
ID,
CAPITAL
)```
WHEN i RUN THIS THE 'B' COMES INTO PICTURE.
The table should look like
[![Table view][1]][1]
How do I skip the 'B'?
[1]: https://i.stack.imgur.com/2U3Vo.png
Disregard the first character. Can you have the source put a comma after the record type indicator?
If so, do this to ignore it:
(
RECORD_IND FILLER,
ID,
CAPITAL
)
If not, this should take care of it in your situation:
ID "SUBSTR(:ID, 2)",

How to Convert a string without any delimiter to a comma delimited string?

I have a file details.txt in which data stored is in this format
"571955NandhithaF1975-12-222011-12-06Mumbai"
Columns are first six digit unique id ,
name , (M/F) Gender , dob,joining date , and location
i have to separate this in six columns using comma delimiter !!
Please help me in this problem
Pass each line into a regex function which contains the below logic :
String expression = "571955NandhithaF1975-12-222011-12-06Mumbai";
Pattern pattern = Pattern
.compile("([0-9]{6})([a-zA-Z]+)([M|F])([0-9]{4}-[0-9]{2}-[0-9]{2})([0-9]{4}-[0-9]{2}-[0-9]{2})([a-zA-Z0-9]+)");
Matcher matcher = pattern.matcher(expression);
if (matcher.find()) {
//System.out.println(matcher.group());
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
System.out.println(matcher.group(4));
System.out.println(matcher.group(5));
System.out.println(matcher.group(6));
}
output:
571955
Nandhitha
F
1975-12-22
2011-12-06
Mumbai
571955NandhithaF1975-12-222011-12-06Mumbai
To split this type of data, we have to use String Functions in java in the mapper class under map method.
You can use substring(beginindex,endindex) method to get Id of from the string, its
like string id[6]=substring(0,5) which returns 6 digit string that is ID.(As ID is Fixed length we take 6)
You can use substring(beginindex) to get remaining string.
Next on wards you have to use REGXP in java.. along with split(regexp) to get the name, gender, dob, doj, loc. But definitely some workout with java takes place.
go through this link for String functions in java.
Hope this post may help.
If any suggestions or modifications to the same are also accepted :)

NULL when casting string values to decimal in Hive

I'm using Hive 0.13 and I have in a STRING column of my table values like 1.250,99
I want to cast these values into decimal, so I must replace "." by "" and "," by "." The result would be 1250.99
This is my hql sentence:
cast(regexp_replace(regexp_replace(price, '\\.',''), ',','.') as decimal(18,6))
But it returns NULL, I suppose because the conversion does not succeed. What is the problem?
If I don't do the conversion, it returns the expected string.
UPDATE
My problem was that there were white spaces in the column, so it could not convert it into decimal value. I have used trim function before doing the conversion.
Try this:
select cast(regexp_replace(regexp_replace('1.234,56','\\.',''),'\,','\.') as decimal(10,2));

Sql loader - second enclosure string not present

I am loading a .csv file data into oracle table through sql loader. One of the fields has a new line character (CRLF) in its data and so, am getting the below error:
second enclosure string not present
This is my control file
load data
characterset UTF8
infile 'C:\Users\lab.csv'
truncate
into table test_labinal
fields terminated by ";" optionally enclosed by '"'
TRAILING NULLCOLS
(
STATEMENT_STATUS ,
MANDATORY_TASK ,
COMMENTS CHAR(9999) "SubStr(:Comments, 0, 1000)"
)
The field COMMENTS has a new line character in one of its records. Can any one suggest a solution for this.
Thanks
If your last field is always present (though trailing nullcols suggests it isn't) and you have some control over the formatting, you can use the CONTINUEIF directive to treat the second line as part of the same logical record.
If the comments field is always present and enclosed in double-quotes then you can do:
...
truncate
continueif last != x'22'
into table ...
Which would handle data records like:
S;Y;"Test 1"
F;N;"Test 2"
P;Y;"Test with
new line"
P;N;""
Or if you always have a delimiter after the comments field, whether it is populated or not:
...
truncate
continueif last != ';'
into table ...
Which would handle:
S;Y;Test 1;
F;N;"Test 2";
P;Y;Test with
new line;
P;N;;
Both ways will load the data as:
S M COMMENTS
- - ------------------------------
S Y Test 1
F N Test 2
P Y Test withnew line
P N
But this loses the new line from the data. To keep that you need the terminating field delimiter to be present, and instead of CONTINUEIF you can change the record separator using the stream record format:
...
infile 'C:\Users\lab.csv' "str ';\n'"
truncate
into table ...
The "str ';\n'" defines the terminator as the combination of the field terminator and a new line character. Your split comment only has that combination on the final line. With the same data file as the previous version, this gives:
S M COMMENTS
- - ------------------------------
S Y Test 1
F N Test 2
P Y Test with
new line
P N
4 rows selected.
Since you're on Windows you might have to include \r in the format as well, e.g. "str ';\r\n'", but I'm not able to check that.
load data
characterset UTF8
infile 'C:\Users\lab.csv'
truncate
into table test_labinal
fields terminated by ";" optionally enclosed by '"'
TRAILING NULLCOLS
(
STATEMENT_STATUS ,
MANDATORY_TASK ,
COMMENTS CHAR(9999) "SubStr(REPLACE(REPLACE(:Comments,CHR(13)),CHR(10)), 0, 1000)"
)
Note: The CHR(13) is the ASCII character for "carriage return" and the CHR(10) is the ASCII character for "new line". Using the Oracle PL/SQL REPLACE command without a replacement value will remove any "carriage return" and/or "new line" character that is embedded in your data. Which is probably the case because the comment field is the last field in your CSV file.
You can use replace(replace(column_name, chr(10)), chr(13)) to remove newline charactors or regexp_replace(column_name, '\s+') to remove non printable charactors during loading
I found the best way to load the .csv files with fields containing newline and comma.Please run the macro over the .csv file and then load using sqlloader
Sub remove()
Dim row As Integer
Dim oxcel As Excel.Application
Dim wbk As Excel.Workbook
Set oxcel = New Excel.Application
Set wbk = oxcel.Workbooks.Open("filename.csv", 0, True)
row = 0
With oxcel
.ActiveSheet.Select
Do
row = row + 1
'Assume first column is PK and so checking for empty pk to find the number of rows
Loop Until IsEmpty(Cells(row, 1)) Or IsNull(Cells(row, 1))
Range(Cells(1, 24), Cells(row - 1, 24)).Select
For Each oneCell In Selection
oneCell.Value = Application.Substitute(Application.Substitute
(Application.Substitute (CStr(oneCell.Value), vbLf, vbCr), vbCr, "-"),",","-")
Next oneCell
End With
End Sub
It's running perfect for me.

Resources