Escaping separator in data while bulk loading using importtsv tool and ingesting numeric values - hadoop

I am using importtsv tool to ingest data. I have some doubts. I am using hbase 1.1.5.
First does it ingest non-string/numeric values? I was referring this link detailing importtsv in cloudera distribution. It says:"it interprets everything as strings". So I was guessing what does that mean.
I am using simple wordcount example where first column is a word and second column is word count.
When I keep file as follows:
"access","1"
"about","1"
and ingest and then do scan on hbase shell it gives following output:
about column=f:count, timestamp=1467716881104, value="1"
access column=f:count, timestamp=1467716881104, value="1"
When I keep file as follows (double quotes surrounding count is removed):
"access",1
"about",1
and ingest and then do scan on hbase shell it gives following output (double quotes surrounding count is not there):
about column=f:count, timestamp=1467716881104, value=1
access column=f:count, timestamp=1467716881104, value=1
So as you can see there are no double quotes in count's value.
Q1. Does that mean it is stored as integer and not as string? The cloudera's article suggests that custom MR job needs to be written for ingesting non-string values. However I am not able to get what does that mean if above is ingesting integer values.
Also another doubt I am having is that whether I can escape the column separator when it appears inside the column value. For example in importtsv, we can specify the separator as follows:
-Dimporttsv.separator=,
However what if I have employee data where first column is employee name and second column as address? My file will have rows resembling to something like this:
"mahesh","A6,Hyatt Appartment"
That second comma makes importtsv think that there are three columns and throwing BadTsvLineException("Excessive columns").
Thus I tried escaping comma with backslash (\) and just for sake of curiosity escaping backslash with another backslash (that is \\). So my file had following lines:
"able","1\"
"z","1\"
"za","1\\1"
When I ran scan on hbase shell, it gave following output:
able column=f:count, timestamp=1467716881104, value="1\x5C"
z column=f:count, timestamp=1467716881104, value="1\x5C"
za column=f:count, timestamp=1467716881104, value="1\x5C\x5C1"
Q2. So it seems that instead of escaping character following backslash, it encodes backslash as \x5C. Is it like that? Is there no way to escape column separator while bulk loading data using importtsv?

Related

data factory special character in column headers

I have a file I am reading into a blob via datafactory.
Its formatted in excel. Some of the column headers have special characters and spaces which isn't good if want to take it to csv or parquet and then SQL.
Is there a way to correct this in the pipeline?
Example
"Activations in last 15 seconds high+Low" "first entry speed (serial T/a)"
Thanks
Normally, Data Flow can handle this for you by adding a Select transformation with a Rule:
Uncheck "Auto mapping".
Click "+ Add mapping"
For the column name, enter "true()" to process all columns.
Enter an appropriate expression to rename the columns. This example uses regular expressions to remove any character that is not a letter.
SPECIAL CASE
There may be an issue with this is the column name contains forward slashes ("/"). I accidentally came across this in my testing:
Every one of the columns not mapped contains forward slashes. Unfortunately, I cannot explain why this would be the case as Data Flow is clearly aware of the column name. It can be addressed manually by adding a Fixed rule for EACH offending column, which is obviously less than ideal:
ANOTHER OPTION
The other thing you could try is to pre-process the text file with another Data Flow using a Source dataset that has no delimiters. This would give you the contents of each row as a single column. If you could get a handle on the just first row, you could remove the special characters.

how to replace characters in hive?

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application.
is there a simple way to get rid of all tab characters in that column?. I could run a simple python program to do it, but I want to find a better solution for this.
regexp_replace UDF performs my task. Below is the definition and usage from apache Wiki.
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
This returns the string resulting from replacing all substrings in INITIAL_STRING
that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT,
e.g.: regexp_replace("foobar", "oo|ar", "") returns fb
Custom SerDe might be a way to do it. Or you could use some kind of mediation process with regex_replace:
create table tableB as
select
columnA
regexp_replace(description, '\\t', '') as description
from tableA
;
select translate(description,'\\t','') from myTable;
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well. (Available as of Hive 0.10.0, for string types)
Char/varchar support added as of Hive 0.14.0
You can also use translate(). If the third argument is too short, the corresponding characters from the second argument are deleted. Unlike regexp_replace() you don't need to worry about special characters.
Source code.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
There is no OOTB feature at this moment which allows this. One way to achieve that could be to write a custom InputFormat and/or SerDe that will do this for you. You might this JIRA useful : https://issues.apache.org/jira/browse/HIVE-3751. (not related directly to your problem though).

csv parsing issue when data include comma

I am retrieving data from DB and make each field merged with Comma between them to generate CSV.
But the problem is one the field is Company Name and the data includes comma which leads to malformed CSV file.
Example: Name, Telephone, Email
AAA, 12345, aaa#mail.com
BBB Co,.Ltd, 43466, bbb#gmail.com
For the record BBB the generated CSV becomes problem as it includes , in the data.
How should I make the correct CSV for such records of including , ?
Most of the developers handle this situation by using different characters instead of "comma". But i would suggest you to look into an old post here
Dealing with commas in a CSV file
Is your question related to Salesforce APEX?
When the CSV was generated there ought to be an option to enclose the fields in Double quotes so that commas can appear inside the field content. For example "Company, Name","1234","etc."
The CSV generator will also "escape" any double quotes inside a field like this "Some field with \"double\" quotes","123","etc"
This all means you need a CSV parser that can handle these situations.
If your question is related to Salesforce APEX then it is quite difficult to build such a CSV parser because of the limitations Salesforce imposes on the number of statements that can run in any given action.

Access 2007 export table to csv decimal

I have some tables in Access that I'm trying to export to csv so that I can import to Oracle. I don't use the export via ODBC because I have 70K - 500K records in some of these tables and that feature takes way to long as I have about 25 tables to do so I want to export to csv (which is much faster) then load via sqlldr.
Some numeric columns can go out to 16 decimal places and I need them all. However when I export they only go out 2. I've done some googling around this. Regional settings only allows 9 decimals out (Win XP), formatting the column via a query will change it to text which I don't want when I import to Oracle (maybe I can use to_number() in the control file?).
Why is this so difficult? Why can't Access just export numeric columns as they are?
In my Access 2007 test case, I'm not seeing quite the same result you described. When I export to CSV, I get all the decimal places.
Here is my sample table with decimal_field as decimal(18, 16).
id some_text decimal_field
-- --------- ------------------
1 a 1.0123456789012345
2 b 2
Unfortunately, those exported decimal_field values are quoted in the CSV:
"id","some_text","decimal_field"
1,"a","1.0123456789012345"
2,"b","2"
The only way I could find to remove the quotes surrounding the decimal_field values also removed the quotes surrounding genuine text values.
If quoted numeric values are unworkable, perhaps you could create a VBA custom CSV export procedure, where you write your values to each file line formatted as you wish.
Regarding "Why is this so difficult?", I suspect decimal data type as the culprit. I don't recall encountering this type of problem with other numeric data types. Unfortunately, that's only my speculation and won't help even if it's correct.
Create a query selecting all the records from your table. Format the troublesome column by using the format function:
Select Format(Fieldname,"0000.00000") AS FormattedField
Save this query and export the query instead of the table.
One disadvantage of this approach is that your numeric field is then treated as text, so you then get quotes around the exported numbers, and if you use the option not to enclose text in quotes, then any actual text fields you export in the same query lose their quotes too
The other (quicker, dirtier, bodge job) method is to export first into Excel and from there to text. This leaves decimal places intact, but obviously it's not very elegant.

using PIG to load a file

I am very new to PIG and I am having what feels like a very basic problem.
I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!
Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt" are getting stuck in your first column, word1. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.
To fix this, you should specify PigStorage to load by comma by doing:
A = LOAD '...' USING PigStorage(',') AS (...);

Resources