Field Overflow when column values contain space - etl

I have a simple mapping to read data from a DB table onto a CSV delimited file using ODI(11g/12c).
Certain fields are having values with words separated by spaces, like "United States of America".
When the data is generated in CSV,I see that "United" is in one column, "States" is in the next column and so on. In other words there is an overflow of data in spite of the delimiter being set to ",".
IKM: SQL to File Append.
How can we resolve this?

Related

How to handle the string separated by a comma in csvwriter processinggroup nifi

I'm having a csv file with two columns for example column A and Columb B. Column B consists of string value like this : I am, doing good. so when I try to insert this data into a database only the string I am is getting inserted. I just want to know what attribute I need to add to the process group so that I am, doing good will get inserted to the database
The attached image consists of the attributes in the current process group

data factory special character in column headers

I have a file I am reading into a blob via datafactory.
Its formatted in excel. Some of the column headers have special characters and spaces which isn't good if want to take it to csv or parquet and then SQL.
Is there a way to correct this in the pipeline?
Example
"Activations in last 15 seconds high+Low" "first entry speed (serial T/a)"
Thanks
Normally, Data Flow can handle this for you by adding a Select transformation with a Rule:
Uncheck "Auto mapping".
Click "+ Add mapping"
For the column name, enter "true()" to process all columns.
Enter an appropriate expression to rename the columns. This example uses regular expressions to remove any character that is not a letter.
SPECIAL CASE
There may be an issue with this is the column name contains forward slashes ("/"). I accidentally came across this in my testing:
Every one of the columns not mapped contains forward slashes. Unfortunately, I cannot explain why this would be the case as Data Flow is clearly aware of the column name. It can be addressed manually by adding a Fixed rule for EACH offending column, which is obviously less than ideal:
ANOTHER OPTION
The other thing you could try is to pre-process the text file with another Data Flow using a Source dataset that has no delimiters. This would give you the contents of each row as a single column. If you could get a handle on the just first row, you could remove the special characters.

Pentaho Load Plain Text File w/ ASCII separator

I'm trying to use Spoon / Kettle to upload a plain text file that is separated by ASCII characters. I can see all the data when I preview the content of the file in Kettle, but no records load when I try to preview rows on the "Content" tab.
According to my research, Kettle should understand my field separator when typed as "$[value]" which in my case is "$[01]". Here's a description of the file structure:
Each file in the feed is in plain text format, separated into columns and rows. Each record has the same set of fields. The following are the delimiters for
each field and record:
Field Separator (FS): SOH (ASCII character 1)
Record Separator (RS) : STX (ASCII character 2) + “n”
Any record starting with a “#” and ending with the RS should be treated as a comment by the ingester and ignored. The data provider has also generated a column header line at the beginning of the file, listing field data types.
So my input parameters are:
Filetype: Fixed
Separator: $[01]
Enclosure:
Escape:
...
Format: DOS
Encoding: US-ASCII
Length: Characters
I'm unable to read any records, and I'm not sure if this is the correct approach. Would ingesting this data with java inside of kettle be a better method?
Any help with this would be much appreciated. Thanks!

How does ORC delimit fields?

I know this must be a silly question, but after hours' googling, I cannot get the answer.
It's easy to understand in plain text format such as csv how the delimiters work. Whilein ORC, since is is binary stored in HDFS, what would be a delimiter for a field? I was told that there is no delimiter in ORC, but I highly doubt about this statement.
Even it is stored as row groups, for each row group's one column, there can be multiple data fields, how is each field distinguished from the next one? How is each row separated from the next row? Is there a delimiter to achieve this?
Thank you for any comments!
No delimiter. It uses Stride/Stripes,
The body of the file is divided into stripes. Each stripe is self
contained and may be read using only its own bytes combined with the
file’s Footer and Postscript. Each stripe contains only entire rows so
that rows never straddle stripe boundaries. Stripes have three
sections: a set of indexes for the rows within the stripe, the data
itself, and a stripe footer. Both the indexes and the data sections
are divided by columns so that only the data for the required columns
needs to be read.
Refer: ORC

Send a Flat file attachment in the workflow in Informatica Developer

In a mapping we use delimited flat file having 3 columns.The column separated through comma. But i have a requirement that in between the column there is a column having 2 comma.So how should I process the column in the mapping?
You should have information quoted with "" so whatever is within " is skiped. this way you could differentiate between comma of a piece of information or as a column separator.
We don't know what have you tried, but count the number of commas for each line and separate accordingly (if possible).

Resources