check for newline character in a csv file - shell

currently in my code below line used to fix the newline break in a csv :
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
I want to do a pre check like if there is a new line break present in the file then only script will run the above command to fix that file, how do I add a pre check for this ?
my csv file looks as below and having 1 millions records in it :
20160711,"M","N1","F","S","A","good data with.....some special character and space (new line)
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
new line can appear anywhere in the file .

#sabya Perhaps count the double quotes on a line? If odd, then there is a return somewhere:
gawk '{if (and(1,gsub(/"/, "\"")) HasReturn = 1; exit} END {exit HasReturn}'

I would respectfully suggest you load the data as given and not alter it in order to maintain data integrity by constructing the control file to preserve the newline between the double-quotes.
Construct the control file like this using the "str" clause on the infile option line to set the end of record character. It tells sqlldr that hex 0D (carriage return, or ^M) is the record separator (this way it will ignore the linefeeds inside the double-quotes):
LOAD DATA
infile "test.dat" "str x'0D'"
TRUNCATE
INTO TABLE test
replace
fields terminated by ","
optionally enclosed by '"'
(
cola char,
colb char,
colc char
)
More info in this post: https://stackoverflow.com/a/37216660/2543416

Related

Manipulating CSV file with bash and remove CR LF in specific column

I have a file that's generated by a vendor. This vendor refuses to strip out the CR LF in the middle of some data due to how they concatenate two lines together in the file. The result is manual effort to identify and clean up these instances.
What I'd like to do is for each line in this file, if there is a CR LF in the 6th spot in the record- then remove it and replace it with a space. Here's an example with one in the 6th spot that I need to parse out. The file has 1-2 million lines and only a dozen or so have the CR LF in the 6th spot of the record. There's also a CR LF at the end of each record, so I can't just replace every instance of CR LF in the file.
XXXXXX~XXXXXX~XXXXXX~XXXXXX~~-NEW CUSTOMER HANK BUDREAU
DL:WD-XX-XX5
CONF# 12344564 ~XXXXXX~XXXXXX~XXXXXX~KWH~~000015~16~10132022074500PM~10~0.0798~10~0.0582~10~0.0606~10~0.0666~10~0.8358~10~1.5564~10~1.0986~10~0.6048~10~0.2022~10~0.0372~10~0.045~10~0.0318~10~0.0366~10~0.036~10~0.0294~10~0.0672~
If you know the number of fields in advance (47 here) then you can use something like this:
awk -F '~' -v nFields=47 '
NF < nFields {
if (nf) {
rec = rec "\n" $0
if (NF)
nf += NF - 1
} else {
rec = $0
nf = (NF ? NF : 1)
}
if ( nf >= nFields ) {
gsub(/\r\n/," ",rec)
print rec
nf = 0
}
next
}
1
'
notes: The above code doesn't work if the last field is the one that contains a LF.

Converting CSV file to multiline text file

I have file which looks like following:
C_DocType_ID,SOReference,DocumentNo,ProductValue,Quantity,LineDescription,C_Tax_ID,TaxAmt
1000000,1904093563U,1904093563U,5210-1,1,0,1000000,0
1000000,1904093563U,1904093563U,6511,2,0,1000000,0
1000000,1904093563U,1904093563U,5001,1,0,1000000,0
1000000,1904083291U,1904083291U,5310,4,0,1000000,0
1000000,1904083291U,1904083291U,5311,3,0,1000000,0
1000000,1904083291U,1904083291U,6101,6,0,1000000,0
1000000,1904083291U,1904083291U,6102,1,0,1000000,0
1000000,1904083291U,1904083291U,6106,6,0,1000000,0
I need to convert it to text file which looks like this:
WOH~1.0~~1904093563Utest~~~ORD~~~~
WOL~~~5210-1~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6511~~~~~~~~2~~~~~~~~~~~~~~~~~~~~~
WOL~~~5001~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOH~1.0~~1904083291Utest~~~ORD~~~~~~
WOL~~~5310~~~~~~~~4~~~~~~~~~~~~~~~~~~~~~
WOL~~~5311~~~~~~~~3~~~~~~~~~~~~~~~~~~~~~
WOL~~~6101~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
WOL~~~6102~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6106~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
The output file has header record and line item record. Header Record contains the SOReference and some hardcoded fields and the Line Item record contains the Product Value and Quantity associated to that SOReference . In the input file we have 2 unique SOReferences thats why the the output file contains 2 header record and their associated line items record.
Need something being done as a command line (awk/sed)? since I have a series of files like this one which need to be converted to text.
With AWK, please try the following:
awk -F, '
FNR==1 {next} # skip the header line
{
if ($2 != prevcol2) { # insert newline when SOReference changes
nl = FNR<=2 ? "" : "\n" # suppress the newline in the 1st line
printf("%sWOH~1.0~~%stest~~~ORD~~~~\n", nl, $2)
}
printf("WOL~~~%s~~~~~~~~%s~~~~~~~~~~~~~~~~~~~~~\n", $4, $5)
prevcol2 = $2
}' file.csv

Display column from empty column (fixed width and space delimited) in bash

I have log file (in txt) with the following text
UNIT PHYS STATE LOCATION INFO
TCSM-1098 SE-NH -
ETPE-5-0 1403 SE-OU BCSU-1 ACTV FLTY
ETIP-6 1402 SE-NH -
They r delimited by space...
How am I acquired the output like below?
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
Thank in advance
This is what I've tried so far
cat file.txt | awk 'BEGIN { FS = "[[:space:]][[:space:]]+" } {print $1,$2,$3,$4}' | sed 's/ /|/g'
It produces output like this
|UNIT|PHYS|STATE|LOCATION|INFO|
|TCSM-1098|SE-NH|-|
|ETPE-5-0|1403|SE-OU|BCSU-1|ACTV|FLTY
|ETIP-6|1402|SE-NH|-|
The column isn't excatly like what I hope for
It seems it's not delimited but fixed-width format.
$ perl -ple '
$_ = join "|",
map {s/^\s+|\s+$//g;$_}
unpack ("a11 a5 a6 a22 a30",$_);
' <file.txt
how it works
-p switch : loop over input lines (default var: $_) and print it
-l switch : chomp line ending (\n) and add it to output
-e : inline command
unpack function : takes defined format and input line and returns an array
map function : apply block to each element of array: regex to remove heading trailing spaces
join function : takes delimiter and array and gives string
$_ = : affects the string to default var for output
Perl to the rescue!
perl -wE 'my #lengths;
$_ = <>;
push #lengths, length $1 while /(\S+\s*)/g;
$lengths[-1] = "*";
my $f;
say join "|",
map s/^\s+|\s+$//gr,
unpack "A" . join("A", #lengths), $_
while (!$f++ or $_ = <>);' -- infile
The format is not whitespace separated, it's a fixed-width.
The #lengths array will be populated by the widths of the columns taken from the first line of the input. The last column width is replaced with *, as its width can't be deduced from the header.
Then, an unpack template is created from the lengths that's used to parse the file.
$f is just a flag that makes it possible to apply the template to the header line itself.
With GNU awk for FIELDWITDHS to handle fixed-width fields:
awk -v FIELDWIDTHS='11 5 6 22 99' -v OFS='|' '{$1=$1; gsub(/ *\| */,"|"); sub(/ +$/,"")}1' file
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
I think it's pretty clear and self-explanatory but let me know if you have any questions.
Manually, in awk:
$ awk 'BEGIN{split("11 5 6 23 99", cols); }
{s=0;
for (i in cols) {
field = substr($0, s, cols[i]);
s += cols[i];
sub(/^ */, "", field);
sub(/ *$/, "", field);
printf "%s|", field;
};
printf "\n" } ' file
UNIT|PHYS|STATE|LOCATION|INFO|
TCSM-1098||SE-NH||-|
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY|
ETIP-6|1402|SE-NH||-|
The widths of the columns are set in the BEGIN block, then for each line we take substrings of the line of the required length. s counts the starting position of the current column, the sub() calls remove leading and trailing spaces. The code as such prints a trailing | on each line, but that can be worked around by making the first or last column a special case.
Note that the last field is not like in your output, it's hard to tell where the split between ACTV and FLTY should be. Is that fixed width too, or is the space a separator there?

How to use sed to insert a line before each line in a file with the original line's content surrounding by a string?

I am trying to use sed (GNU sed version 4.2.1) to insert a line before each line in a file with that line's content surrounding by a string.
Input:
truncate table ALPHA;
truncate table BETA;
delete from TABLE_CHARLIE where ID=1;
Expected Result:
SELECT 'truncate table ALPHA;' from dual;
truncate table ALPHA;
SELECT 'truncate table BETA;' from dual;
truncate table BETA;
SELECT 'delete from TABLE_CHARLIE where ID=1;' from dual;
delete from TABLE_CHARLIE where ID=1;
I have tried to make use of the ampersand (&) special character, but this does not seem to work. If I put anything after the ampersand on the replacement string, the output is not correct.
Attempt 1:
sed -e "s/\(.*\)/SELECT '&\n&/g" input.txt
output:
SELECT 'truncate table ALPHA;
truncate table ALPHA;
SELECT 'truncate table BETA;
truncate table BETA;
SELECT 'delete from TABLE_CHARLIE where ID=1;
delete from TABLE_CHARLIE where ID=1;
With the preceding code, I get the SELECT ' as expected, but once I attempt to add ' from dual; to the right side of string, things get out of whack.
Attempt 2:
sed -e "s/\(.*\)/SELECT '&' from dual;\n&/g" input.txt
output:
' from dual;cate table ALPHA;
truncate table ALPHA;
' from dual;cate table BETA;
truncate table BETA;
SELECT 'delete from TABLE_CHARLIE where ID=1;' from dual;
You can take advantage of the hold space to temporarily store the original line.
sed "h;s/.*/'SELECT '&' from dual;/;p;g" input.txt
or more readably:
sed "
h
s/.*/'SELECT '&' from dual;/
p
g" input.txt
Here's a breakdown of the command.
First, each line of the input is placed in the pattern space.
The h command copies the contents of the pattern space to the hold space.
The s command performs a substitution on the pattern space. The & represents whatever was matched. This command leaves the hold space unaffected.
The p command outputs the contents of the pattern space to standard output.
The g command copies the contents of the hold space to the pattern space.
By default, the contents of the pattern space are written to standard output before reading the next input line.
As Glenn Jackman points out, you can replace p;g with G. This builds up a two-line value in the pattern space that is then printed, rather than print two separate pattern spaces.
sed "h;s/.*/'SELECT '&' from dual;/;G" input.txt
Also, you can add comments to the sed command so that you can understand what the line noise does later :), if this is in a script.
sed "
# The input line is first copied to the pattern space
h # Copy the pattern space to the hold space
s/.*/'SELECT '&' from dual;/ # Modify the pattern space
p # Print the (modified) pattern space
g # Copy the hold space to the pattern space
# The output of the pattern space (the original input line) is now printed
" input.txt
If you're looking for an alternative to sed, these work:
awk '{printf "SELECT '\''%s'\'' from dual;\n%s\n", $0, $0}' file
perl -lpe "print qq{SELECT '\$_' from dual;}" file
Your second attempt works on both 4.2.1 and 4.2.2 versions of sed. I received same invalid input when I tried to save your input file with windows line endings (line feed and carriage return).
Use this command on your input file before running your sed command:
tr -d '\15\32' < winfile.txt > unixfile.txt
Or as you suggest, simply by using the dos2unix utility.
Here's how to do it with awk:
awk -v PRE="SELECT '" -v SU="' from dual;" '{print PRE$0SU; print}'`

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Resources