Using gawk in a batch file I am having trouble reformatting lines from format A to format B - windows

I have a compiler which produces output like:
>>> Warning <code> "c:\some\file\path\somefile.h" Line <num>(x,y): warning comment
For example:
>>> Warning 100 "c:\some\file\path\somefile.h" Line 10(5,7): you are missing a (
>>> Warning 101 "c:\some\file\path\file with space.h" Line 20(8,12): unexpected char a
I need to get the into the format (for MSVS2013):
<filename-without-quotes>(<line>,<column>) : <error|warning> <code>: <comment>
e.g. using the first example from above:
c:\some\file\path\somefile.h(10,5): warning 100: you are missing a (
I have had a good go at it and I can just about get the first example working, but the second example screwed me over because I had not figured on filenames with spaces (who does that!!? >.< ). Here is my awk (gawk) code:
gawk -F"[(^), ]" '$2 == "Warning" {gsub("<",""^); gsub("\"",""); start=$4"("$6","$7"^) : "$2" "$3":"; $1=$2=$3=$4=$5=$6=$7=$8=$9=""; print start $0;}' "Filename_with_build_output.txt"
gawk -F"[(^), ]" '$2 == "Error" {gsub("<",""^); gsub("\"",""); start=$4"("$6","$7"^) : "$2" "$3":"; $1=$2=$3=$4=$5=$6=$7=$8=$9=""; print start $0;}' "Filename_with_build_output.txt"
Ok, so point 1 is, its a mess. I will break it down to explain what I am doing. First note that the input is a file, which is an error log generated by my build which I simply pass into awk. Also note the occasional "^" before any round bracket is because this is within a batch file IF statement so I have to escape any ")" - except for one of them... I don't know why! - So the breakdown:
-F"[(^), ]" - This is to split the line by "(" or ")" or "," or " ", which is possibly an issue when we think about files with spaces :(
'$2 == "Warning" {...} - Any line where the 2nd parameter is "Warning". I tried using IGNORECASE=1 but I could not get that to work. Also I could not get an or expression for "Warning" or "Error", so I simply repeat the entire awk line with both!
gsub("<",""^); gsub("\"",""); - this is to remove '<' and '"' (double quotes) because MSVS does not want the filename with quotes around it... and it can't seem to handle "<". Again issues here if I want to get the filename with spaces?
start=$4"("$6","$7"^) : "$2" "$3":"; - this part basically shuffles the various parameters into the correct order with the various format strings inserted.
$1=$2=$3=$4=$5=$6=$7=$8=$9=""; - hmm... here I Wanted to print the 10th parameter and every thing after that, one trick (could not get others to work) was to set params 1-9 to "" and then later I will print $0.
print start $0; - final part, this just prints the string "start" that I built up earlier followed by everything after the 9th parameter (see previous point).
So, this works for the first example - although its still a bit rubbish because I get the following (missing the "(" at the end because "(" is a split char):
c:\some\file\path\somefile.h(10,5): warning 100: you are missing a
And for the one with filename with a space I get (you can see the filename is all broken and some parameters are in the wrong place):
RCU(Line,20) : warning 101: : unexpected char a
So, multiple issues here:
How can I extract the filename between the quotes, yet still remove the quotes
How can I get at the individual numbers in Line 10(5,7):, if I split on brackets and comma I can get to them, but then I lose real bracket/commas from the comment at the end.
Can I more efficiently print out the 10th element and all elements after that (instead of $1=$2=...$9="")
How can I make this into one line such that $2 == "Warning" OR "Error"
Sorry for long question - but my awk line is getting very complicated!

IMHO, it is better not to get yourself tied up in reg-ex and fancy FS values if they don't provide real value or are in other ways really needed. Just "cut and paste" as needed. Put the following in a file,
{
sub(/^>>> /,"")
warn=$1 " " $2; $1=$2=""
sub(/^[[:space:]][[:space:]]*/,"",$0)
fname=$0
sub(" Line.*$","",fname)
gsub("\"","",fname);
msg=$0
sub(/^.*:/,"",msg)
print fname ":\t" warn ":\t"msg
}
Then, per #EdMorton 's most excellent comments, run it
awk -f awkscript dat.txt > dat.out
output
c:\some\file\path\somefile.h: Warning 100: you are missing a (
c:\some\file\path\file with space.h: Warning 101: unexpected char a
Note that I have used tab separated fields. If you what spaces or other chars, just sub the \t chars with " " or whatever you need.
As so many crave the one-liner solution, here it is
awk '{sub(/^>>> /,"");warn=$1 " " $2; $1=$2="";sub(/^[[:space:]][[:space:]]*/,"",$0);fname=$0;sub(" Line.*$","",fname);gsub("\"","",fname);msg=$0;sub(/^.*:/,"",msg);print fname ":\t" warn ":\t"msg}' dat.txt
IHTH

Related

Convert multi-line csv to single line using Linux tools

I have a .csv file that contains double quoted multi-line fields. I need to convert the multi-line cell to a single line. It doesn't show in the sample data but I do not know which fields might be multi-line so any solution will need to check every field. I do know how many columns I'll have. The first line will also need to be skipped. I don't how much data so performance isn't a consideration.
I need something that I can run from a bash script on Linux. Preferably using tools such as awk or sed and not actual programming languages.
The data will be processed further with Logstash but it doesn't handle double quoted multi-line fields hence the need to do some pre-processing.
I tried something like this and it kind of works on one row but fails on multiple rows.
sed -e :0 -e '/,.*,.*,.*,.*,/b' -e N -e '1n;N;N;N;s/\n/ /g' -e b0 file.csv
CSV example
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
The output I want is
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
Jane,Doe,Country City Street,67890
etc.
etc.
First my apologies for getting here 7 months late...
I came across a problem similar to yours today, with multiple fields with multi-line types. I was glad to find your question but at least for my case I have the complexity that, as more than one field is conflicting, quotes might open, close and open again on the same line... anyway, reading a lot and combining answers from different posts I came up with something like this:
First I count the quotes in a line, to do that, I take out everything but quotes and then use wc:
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
If you think of a single multi-line field, knowing if the quotes are 1 or 2 is enough. In a more generic scenario like mine I have to know if the number of quotes is odd or even to know if the line completes the record or expects more information.
To check for even or odd you can use the mod operand (%), in general:
even % 2 = 0
odd % 2 = 1
For the first line:
Odd means that the line expects more information on the next line.
Even means the line is complete.
For the subsequent lines, I have to know the status of the previous one. for instance in your sample text:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
You can say line 1 (John,Doe,"Country) has 1 quote (odd) what means the status of the record is incomplete or open.
When you go to line 2, there is no quote (even). Nevertheless this does not mean the record is complete, you have to consider the previous status... so for the lines following the first one it will be:
Odd means that record status toggles (incomplete to complete).
Even means that record status remains as the previous line.
What I did was looping line by line while carrying the status of the last line to the next one:
incomplete=0
cat file.csv | while read line; do
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
incomplete=$((($quotes+$incomplete)%2)) # Check if Odd or Even to decide status
if [ $incomplete -eq 1 ]; then
echo -n "$line " >> new.csv # If line is incomplete join with next
else
echo "$line" >> new.csv # If line completes the record finish
fi
done
Once this was executed, a file in your format generates a new.csv like this:
First name,Last name,Address,ZIP
John,Doe,"Country City Street",12345
I like one-liners as much as everyone, I wrote that script just for the sake of clarity, you can - arguably - write it in one line like:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
I would appreciate it if you could go back to your example and see if this works for your case (which you most likely already solved). Hopefully this can still help someone else down the road...
Recovering the multi-line fields
Every need is different, in my case I wanted the records in one line to further process the csv to add some bash-extracted data, but I would like to keep the csv as it was. To accomplish that, instead of joining the lines with a space I used a code - likely unique - that I could then search and replace:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l ~newline~ " || echo "$l";done >new.csv
the code is ~newline~, this is totally arbitrary of course.
Then, after doing my processing, I took the csv text file and replaced the coded newlines with real newlines:
sed -i 's/ ~newline~ /\n/g' new.csv
References:
Ternary operator: https://stackoverflow.com/a/3953666/6316852
Count char occurrences: https://stackoverflow.com/a/41119233/6316852
Other peculiar cases: https://www.linuxquestions.org/questions/programming-9/complex-bash-string-substitution-of-csv-file-with-multiline-data-937179/
TL;DR
Run this:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
... and collect results in new.csv
I hope it helps!
If Perl is your option, please try the following:
perl -e '
while (<>) {
$str .= $_;
}
while ($str =~ /("(("")|[^"])*")|((^|(?<=,))[^,]*((?=,)|$))/g) {
if (($el = $&) =~ /^".*"$/s) {
$el =~ s/^"//s; $el =~ s/"$//s;
$el =~ s/""/"/g;
$el =~ s/\s+(?!$)/ /g;
}
push(#ary, $el);
}
foreach (#ary) {
print /\n$/ ? "$_" : "$_,";
}' sample.csv
sample.csv:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
John,Doe,"Country
City
Street",67890
Result:
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
John,Doe,Country City Street,67890
This might work for you (GNU sed):
sed ':a;s/[^,]\+/&/4;tb;N;ba;:b;s/\n\+/ /g;s/"//g' file
Test each line to see that it contains the correct number of fields (in the example that was 4). If there are not enough fields, append the next line and repeat the test. Otherwise, replace the newline(s) by spaces and finally remove the "'s.
N.B. This may be fraught with problems such as ,'s between "'s and quoted "'s.
Try cat -v file.csv. When the file was made with Excel, you might have some luck: When the newlines in a field are a simple \n and the newline at the end is a \r\n (which will look like ^M), parsing is simple.
# delete all newlines and replace the ^M with a new newline.
tr -d "\n" < file.csv| tr "\r" "\n"
# Above two steps with one command
tr "\n\r" " \n" < file.csv
When you want a space between the joined line, you need an additional step.
tr "\n\r" " \n" < file.csv | sed '2,$ s/^ //'
EDIT: #sjaak commented this didn't work is his case.
When your broken lines also have ^M you still can be a lucky (wo-)man.
When your broken field is always the first field in double quotes and you have GNU sed 4.2.2, you can join 2 lines when the first line has exactly one double quote.
sed -rz ':a;s/(\n|^)([^"]*)"([^"]*)\n/\1\2"\3 /;ta' file.csv
Explanation:
-z don't use \n as line endings
:a label for repeating the step after successful replacement
(\n|^) Search after a newline or the very first line
([^"]*) Substring without a "
ta Go back to label a and repeat
awk pattern matching is working.
answer in one line :
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile
if you'd like to drop quotes, you could use:
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile | sed 's/"//gw NewFile'
but I prefer to keep it.
to explain the code:
/Pattern/ : find pattern in current line.
ORS : indicates the output line record.
$0 : indicates the whole of the current line.
's/OldPattern/NewPattern/': substitude first OldPattern with NewPattern
/g : does the previous action for all OldPattern
/w : write the result to Newfile

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Why do I get weird output in printf in awk for $0?

The input is following
Title: Aoo Boo
Author: First Last
I am trying to output
Aoo Boo, First Last, "
by using awk like this
awk 'BEGIN { FS="[:[:space:]]+" }
/Title/ { sub(/^Title: /,""); t = $0; } # save title
/Author/{ sub(/^Author: /,""); printf "%s,%s,\"\n", t, $0}
' t.txt
But the output is like ,"irst Last. Basically it prints everything from the beginning of the sentence.
But if I change $0 to $2, the output is as expected which is Boo,Last,"
Why is it incorrect? What is the right way to do?
You need to get rid of the Windows line endings in your text file if you want to use Unix utilities.
If you're lucky, you'll find you have the dos2unix program installed, and you'll only need to do this:
dos2unix t.txt
If not, you could do it with tr:
tr -d '\r' < t.txt > new_t.txt
For reference, what is going on is that Windows files have \r\n at the end of every line (actually, a CR control code followed by a NL control code). On Linux, the lines ends with the \n, so the \r is part of the data; when you print it out, the terminal interprets as a "carriage return", which moves the cursor to the beginning of the current line, rather than advancing to the next line. Since the value of t ends with a \r, the following text overwrites the value of t.
It works with $2 because you've reassigned FS to include [:space:]; that definition of field separators is more generous than the awk default, since it includes \r and \f, neither of which are default field separators. Consequently, $2 does not contain the \r, but $0 does.
This assumes there are no colons in titles or names...
awk -F': *' '
$1=="Title" {
sub(/[^[:print:]]/,"");
t=$2;
}
$1=="Author" {
sub(/[^[:print:]]/,"");
printf("%s, %s\n", t, $2);
}
' inputfile.txt
This works by finding the title and storing it in a variable, then finding the author and using that as a trigger to print everything according to your format. You can alter the format as you see fit.
It may break if there are extra colons on the line, as the colon is being used to split fields. It may also break if your input doesn't match your example.
Perhaps the most important thing in this example is the sub(...) functions, which strip off non-printable characters like the carriage return that rici noticed you have. The regular expression [^[:print:]] matches "printable" characters, which the carriage return is not. This script will substitute them into oblivion if they're there, but should do no harm if they are not.

Extraneous output in awk

I'm parsing a file using awk.
BEGIN{FS=":"; PPH = 0; NAME=""}
NAME=$1;
PPH=$2;
PAY=PPH*HOURS;
{print NAME " " PAY}
END{print "end" }
This is the basic structure of the program. I'm running it as
awk -f file.awk inputfile.dat
The issue I'm having is that it prints each line six times and then what it should print for the print NAME and PAY line. I'm kind of confused why this is happening as I only have the two print lines and it seems to be unrelated to the number of lines in the input file.
The problem is that the assignment statements need to be part of the action, that is, they need to be inside the second set of curly braces.
BEGIN {FS=":"; PPH = 0; NAME=""}
{
NAME=$1;
PPH=$2;
PAY=PPH*HOURS;
print NAME " " PAY
}
END {print "end" }
Remember that everything in awk is a pattern followed by an action within curly braces. If the action is omitted, the default action is to print the line. Since the assignments were not in curly braces they were being interpreted as patterns, evaluating to true, causing the line to be printed multiple times.

Resources