speed up my awk command? Answer must be awk :) - bash

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character.
Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?

Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [ and ], and do not start with ^!

Do you need to use awk, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
Extract the 5th field from the tab-delimited $file.
Remove all fields starting with a capital letter.
Remove the characters <, >, *, and newlines.
Count the remaining characters.

Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'

This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Related

How to replace text in file between known start and stop positions with a command line utility like sed or awk?

I have been tinkering with this for a while but can't quite figure it out. A sample line within the file looks like this:
"...~236 characters of data...Y YYY. Y...many more characters of data"
How would I use sed or awk to replace spaces with a B character only between positions 236 and 246? In that example string it starts at character 29 and ends at character 39 within the string. I would want to preserve all the text preceding and following the target chunk of data within the line.
For clarification based on the comments, it should be applied to all lines in the file and expected output would be:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
With GNU awk:
$ awk -v FIELDWIDTHS='29 10 *' -v OFS= '{gsub(/ /, "B", $2)} 1' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
FIELDWIDTHS='29 10 *' means 29 characters for first field, next 10 characters for second field and the rest for third field. OFS is set to empty, otherwise you'll get space added between the fields.
With perl:
$ perl -pe 's/^.{29}\K.{10}/$&=~tr| |B|r/e' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
^.{29}\K match and ignore first 29 characters
.{10} match 10 characters
e flag to allow Perl code instead of string in replacement section
$&=~tr| |B|r convert space to B for the matched portion
Use this Perl one-liner with substr and tr. Note that this uses the fact that you can assign to substr, which changes the original string:
perl -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file > out_file
To change the file in-place, use:
perl -i.bak -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
I would use GNU AWK following way, for simplicity sake say we have file.txt content
S o m e s t r i n g
and want to change spaces from 5 (inclusive) to 10 (inclusive) position then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=5;i<=10;i+=1)$i=($i==" "?"B":$i);print}' file.txt
output is
S o mBeBsBt r i n g
Explanation: I set field pattern (FPAT) to any single character and output field seperator (OFS) to empty string, thus every field is populated by single characters and I do not get superfluous space when print-ing. I use for loop to access desired fields and for every one I check if it is space, if it is I assign B here otherwise I assign original value, finally I print whole changed line.
Using GNU awk:
awk -v strt=29 -v end=39 '{ ram=substr($0,strt,(end-strt));gsub(" ","B",ram);print substr($0,1,(strt-1)) ram substr($0,(end)) }' file
Explanation:
awk -v strt=29 -v end=39 '{ # Pass the start and end character positions as strt and end respectively
ram=substr($0,strt,(end-strt)); # Extract the 29th to the 39th characters of the line and read into variable ram
gsub(" ","B",ram); # Replace spaces with B in ram
print substr($0,1,(strt-1)) ram substr($0,(end)) # Rebuild the line incorporating raw and printing the result
}'file
This is certainly a suitable task for perl, and saddens me that my perl has become so rusty that this is the best I can come up with at the moment:
perl -e 'local $/=\1;while(<>) { s/ /B/ if $. >= 236 && $. <= 246; print }' input;
Another awk but using FS="":
$ awk 'BEGIN{FS=OFS=""}{for(i=29;i<=39;i++)sub(/ /,"B",$i)}1' file
Output:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
Explained:
$ awk ' # yes awk yes
BEGIN {
FS=OFS="" # set empty field delimiters
}
{
for(i=29;i<=39;i++) # between desired indexes
sub(/ /,"B",$i) # replace space with B
# if($i==" ") # couldve taken this route, too
# $i="B"
}1' file # implicit output
With sed :
sed '
H
s/\(.\{236\}\)\(.\{11\}\).*/\2/
s/ /B/g
H
g
s/\n//g
s/\(.\{236\}\)\(.\{11\}\)\(.*\)\(.\{11\}\)/\1\4\3/
x
s/.*//
x' infile
When you have an input string without \r, you can use:
sed -r 's/(.{236})(.{10})(.*)/\1\r\2\r\3/;:a;s/(\r.*) (.*\r)/\1B\2/;ta;s/\r//g' input
Explanation:
First put \r around the area that you want to change.
Next introduce a label to jump back to.
Next replace a space between 2 markers.
Repeat until all spaces are replaced.
Remove the markers.
In your case, where the length doesn't change, you can do without the markers.
Replace a space after 236..245 characters and try again when it succeeds.
sed -r ':a; s/^(.{236})([^ ]{0,9}) /\1\2B/;ta' input
This might work for you (GNU sed):
sed -E 's/./&\n/245;s//\n&/236/;h;y/ /B/;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Divide the problem into 2 lines, one with spaces and one with B's where there were spaces.
Then using pattern matching make a composite line from the two lines.
N.B. The newline can be used as a delimiter as it is guaranteed not to be in seds pattern space.

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Trim column length to 6 characters

I have a file with over 30 columns and I want to trim the length of let's say 9th column to 6 characters (using shell).
Not able to get a good solution.
Please help!
awk would be an excellent choice here. For example lets say we need to trim the first column ($1) to six characters, we can write something like
awk '{$1 = gensub(/^(......).*$/, "\\1", $1)}1'
Test
$ echo 'helloworld' | awk '{$1 = gensub(/^(......).*$/, "\\1", $1)}1'
hellow
What it does
gensub does a regular expression substitute.
/^(......).*$/ Matches 6 characters and captures them in \1, if you note you could see 6 dots in the ().
\\1 replace the entire column content with the captured 6 characters.
1 Always evaluates to true, awk takes the default action to print the entire record.

Resources