Loading to hive new line character from CSV File - hadoop

We are having a file, which is of the following type:
1- Sam, Joshua , "52 DD dr,
Lake Hiawatha" , New Jersey, 07034
2- Ruchi,kumari,SNN Raj serenity,Bengaluru, 560068
The line 1 is split into 2 rows in the External table with the rest of the columns being null in 1st row and 2nd row is having rest of the data.
Need assistance on what is the best way to load in a single column to overcome this issue. Went through a couple of solutions in the web , but was not clear.
Tried the following options:
1) Used the Regex Serde
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
but it did not work
2) CSVInputFormat from github
https://github.com/mvallebr/CSVInputFormat
But not able to use it.

I tried the following option and it worked for me ,
1) Regex tester - for this new line scenario the regex is very complicated , and it is not working.
2) Use CVS parser provided by https://github.com/mvallebr/CSVInputFormat and also had a chat with him on how to use it. Tried multiple options but not working.
3) The quick simple fix is to try the legacy method to replace the new lines in the File using shell or Perl command and it worked smoothly. Seems that to be a more feasible and easy option.

Related

How to load nested collections in hive with more than 3 levels

I'm struggling to load data into Hive, defined like this:
CREATE TABLE complexstructure (
id STRING,
date DATE,
day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>>
) row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by ':';
The day_data field contains a complex structure difficult to load with load data inpath...
I've tried with '\004', ^D... a lot of options, but the data inside the map doesn't get loaded.
Here is my last try:
id_3054,2012-09- 22,3600000:TOT'\005'0.716'\004'PI'\005'0.093'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|7200000:TOT'\005'0.367'\004'PI'\005'0.066'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|10800000:TOT'\005'0.268'\004'PI'\005'0.02'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.159'\004'RES'\005'0.0|14400000:TOT'\005'0.417'\004'PI'\005'0.002'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.165'\004'RES'\005'0.0`
Before posting here, I've tried (many many) options, and this example doesn't work:
HIVE nested ARRAY in MAP data type
I'm using the image from HDP 2.2
Any help would be much appreciated
Thanks
Carlos
So finally I found a nice way to generate the file from java. The trick is that Hive uses the first 8 ASCII characters as separators, but you can only override the first three. From the fourth on, you need to generate thee actual ASCII charaters.
After many tests, I ended up editing my file with an HEX editor, and inserting the right value worked, but how can I do that in Java? Can't be more simple: just cast an int into char, and that will generate the corresponding ASCII character:
ASCII 4 -> ((char)4)
ASCII 5 -> ((char)5)
...
And so on.
Hope this helps!!
Carlos
You could store Hive table in Parquet or ORC format which support nested structures natively and more efficiently.

how to replace characters in hive?

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application.
is there a simple way to get rid of all tab characters in that column?. I could run a simple python program to do it, but I want to find a better solution for this.
regexp_replace UDF performs my task. Below is the definition and usage from apache Wiki.
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
This returns the string resulting from replacing all substrings in INITIAL_STRING
that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT,
e.g.: regexp_replace("foobar", "oo|ar", "") returns fb
Custom SerDe might be a way to do it. Or you could use some kind of mediation process with regex_replace:
create table tableB as
select
columnA
regexp_replace(description, '\\t', '') as description
from tableA
;
select translate(description,'\\t','') from myTable;
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well. (Available as of Hive 0.10.0, for string types)
Char/varchar support added as of Hive 0.14.0
You can also use translate(). If the third argument is too short, the corresponding characters from the second argument are deleted. Unlike regexp_replace() you don't need to worry about special characters.
Source code.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
There is no OOTB feature at this moment which allows this. One way to achieve that could be to write a custom InputFormat and/or SerDe that will do this for you. You might this JIRA useful : https://issues.apache.org/jira/browse/HIVE-3751. (not related directly to your problem though).

How can I do a double delimiter in Hive?

let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col

Reading files in PIG where delemeter comes in data

I want to read a CSV file using PIG what should i Do?. I used load n pigstorage(',') but it fails to read CSV file properly because where it encounters comma (,) in data it splits it.How should i give delimeter now if i have comma in data also?
It's generally impossible to distinguish comma in data from comma as a delimiter.
You will need to escape that comma that is in your 'data' and custom load function (for Pig) that can recognize escaped commas.
Take a look here:
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
http://pig.apache.org/docs/r0.7.0/udf.html#Load%2FStore+Functions
Have you had a look at the CSVLoader loader in the PiggyBank if you want to read a CSV file? (of course the file format needs to be valid)
First make sure you have a valid CSV file. In the case you haven't try to change the source file through Excel (if the file is small) or other tool and export a new CSV with a good delimiter for your data (Ex: \t tab, ; , etc). Even better can be do another extract with a "good" delimiter.
Example of your load can be then something like this:
TABLE = LOAD 'input.csv' USING PigStorage(';') AS ( site_id: int,
name: chararray, ... );
Example of your DUMP:
STORE TABLE INTO 'clean.csv' using PigStorage(','); <- delimiter that suits you best

using PIG to load a file

I am very new to PIG and I am having what feels like a very basic problem.
I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!
Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt" are getting stuck in your first column, word1. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.
To fix this, you should specify PigStorage to load by comma by doing:
A = LOAD '...' USING PigStorage(',') AS (...);

Resources