Talend Open Studio: delimited file with semi colon and header with quotes - etl

I have a delimited file that is delimited by semi colon.
The first row in this file is the header, and the header tokens are in double quotes: an example is below:
"name", "telephone", "age", "address", "y"
When using the tFileDelimited and tMap and you pull the fields in, they look like this with underscores around the fields:
_name_, _telephone_, _age_, _address_, Column05
SO it seems that the fields, the double quote is changed to underscore character and for some reason the last field is a single character without the quotes, but Talend seems to ignore this field name and gives its own default.
Just wondering if anyone has encountered this kind of behaviour and whether one should use a regex to remove the double quotes, to preprocess this first.
Any help appreciated.

Be sure to remove extra blank spaces in the first row, between header tokens. If you use Metadata to import your file, you should have the right names appearing, (just check the options : 'heading rows as column names' and "\"" as the text enclosure)

Related

How to replace the content of a flow file that exists between [ and ]?

I wanted to remove the entire content between the brackets of a flow file attribute. Attached is my sample Flow file and in which I wanted to remove the content between [ and ]. May I know the search and replacement value to be used from ReplaceText Processor ?
Flow File content
You can put following regular expression in the 'Search value' filed to detect all the content between brackets. This will select the whole text including the brackets.
\[(.*?)\]
If you put an empty string in the 'Replacement value', it will clear all the content between brackets (including the brackets itself). If you would like to keep the brackets in the output use [] as the 'Replacement value'.

How to clean a csv file where fields contains the csv separator and delimiter

I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.
The source software has no settings to play with to improve the situation.
Format of the csv:
"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";
Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.
In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.
What I managed to do so far:
RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)
(?:";"|^")(.*?)(?=";"|";\n)
RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.
(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)
RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers
(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)
I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.
Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.
When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.
This is what your row should rather look like.
"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";
So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.
In python, this would look like this:-
mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")
But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk
But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.

Freemarker <compress> tag is trimming data inside ${} also

I have code like this :
FTL:
<#compress>
${doc["root/uniqCode"]}
</#compress>
Input is XML Nodemodel
The xml element is having data like: ID_234 567_89
When it is processed the out is: "ID_234 567_89"
The three white spaces between 234 and 567 is trimmed down to one white-space and lost all the white spaces at the end of the value.
I need the value as it is :"ID_234 567_89 "
When i removed the tags it works as expected irrespective of newFactory.setIgnoringElementContentWhitespace(true).
Why should tag trims data resulted from ${}?
Please help.
You could simply replace the characters you don't want manually (in the following example tabs, carriage returns and newlines), e.g.
${doc["root/uniqCode"]?replace("[\\t\\r\\n]", "", "rm")}
See ?replace built-in for strings: http://freemarker.org/docs/ref_builtins_string.html#ref_builtin_replace

How do I specify a row key in hbase shell that has a tab in it?

In our infinite wisdom, we decided our rows would be keyed with a tab in the middle:
item_id <tab> location
For example:
000001 http://www.url.com/page
Using Hbase Shell, we cannot perform a get command because the tab character doesn't get written properly in the input line. We tried
get 'tableName', '000001\thttp://www.url.com/page'
without success. What should we do?
I had the same issue for binary values: \x00. This was my separator.
For the shell to accept your binary values, you need to provide them in double quote (") instead of single quote (').
put 'MyTable', "MyKey", 'Family:Qualifier', "\x00\x00\x00\x00\x00\x00\x00\x06Hello from shell"
Check how your tab is being encoded, my best bet would be that it is UTF8 encoded so from the ASCII table, this would be "000001\x09http://www.url.com/page".
On a side note, you should use null character for your separator, it will help you in scan.
Hope you can change the tab character. :) Yeah that's a bad idea since Map Reduce jobs use the tab as a delimiter, and its generally a bad idea to use a tab or space as a delimiter.
You could use a double colon (::) as a delimiter. But wait, what if the URL has a double-colon in the URL? Well, urlencode the URL when you store it to HBase - that way, you have a standard delimiter, and the URL part of the key will not conflict with the delimiter.
In Python:
import urllib
DELIMITER = "::"
urlkey = urllib.quote_plus(location)
rowkey = item_id + DELIMITER + urlkey

Bash for truncation

I have to make changes to a document where there are two columns separated by tab (\t) and each record separated by newline \n. the statements of the document are as follows:
/something/random/2345.txt
my aim is to remove the entire string and just keep the number 2345 in this case.I used
sed 's/something/random//g' file.csv
but I do not know how to escape the / cause sed syntax has / too. Also not all records have the same words so i would be looking for regex of the type
/*/*.*
But each entry has a number as a part of the record and I would like to extract that.
Also there are a few records which do not contain any number, I would like to delete those records along with the corresponding entry in the next column for that record.
The file is in CSV format.
You can escape the forward slash with a backslash, or you can use a different character than forward slash to delimit your expression. Observe:
echo foobar | sed sIfooIcrowI
> crowbar
Of course, you probably shouldn't use an alphabetic character for the delimiter. I'm just using it here to make the point that pretty much any normal character can be substituted for the slash.
You could just remove all non digit characters from brining of each statement in string :
sed 's/[^0-9]*\(.*\)[\t]*/\1/g'

Resources