Sorting the contents within a column using Shell Script Line by Line in a File - shell

I am Sorting a File using a column using the command -
cat myFile | sort -u -k3
Now i want to Sort Data within a Column of a File. Can anyone please help and tell me how can i achieve it?
My Data Looks like this in the File names Student.csv -
Name,Age,Marks,Grades
Sam,21,"34,56,21,67","C,B,D,A"
Josh,25,"90,89,78,45","A,A,B,C"
Output-
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
Will Appreciate the help, Thanks

You should export your CSV with a field separator that does not exist within the texts. Otherwise it becomes hugely cumbersome to deal with this.
Afterwards you can easily sort by specifying the separator and the field.
Example if you would use | as separator:
Name|Age|Marks|Grades
Sam|21|"34,56,21,67"|"C,B,D,A"
Josh|25|"90,89,78,45"|"A,A,B,C"
Then execute:
cat myFile | sort -u -k3 -t\|
or:
sort -u -k3 -t\| <myFile
Afterwards you could be putting your semi-colons back:
sort -u -k3 -t\| <myFile | sed 's/|/;/g'

Did it, but I'm too tired to explain how; brain's hitting a brick wall. There's a lot to unpack there, and it'll take half-a-day to explain. I'll write all the steps in a couple hours after I get a nap in, otherwise there's gonna be 50 typos in that description.
cat Student.csv | head -n1 && cat Student.csv | tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}'
Alternatively:
cat Student.csv | tee >(head -n1) >(tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}') >/dev/null ; sleep 0.1
Output:
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
https://www.tutorialspoint.com/awk/index.htm
Edit -- 'kay, the explaination:
cat concatenates (glues) files together, but when you just give it one arg, then that's what it prints out.
You can do the next part in one or two steps, I'll explain the first method. | pipe directs the output to another command. We all know this, or we wouldn't be here right now... however someday, someone will come across this post, and wonder what it does.
head prints out the first few lines of what you give it. Here, I specified -n1 number of lines = one, so it would print out the header:
Name,Age,Marks,Grades
&& continues to the next command, so long as that initial instruction was a success.
cat Student.csv again, but this time piped into tail, which prints the last few lines, of whatever you give it. -n+2 specifies to spit out everything from line number 2, and beyond.
We then pipe those contents into AWK https://en.wikipedia.org/wiki/AWK ...I'm sure you could do it with sed https://en.wikipedia.org/wiki/Sed, and I started with that, but sed tends to be more simple than awk, so you'd need to do far more chained-commands to achieve the same thing. Lisp might be able to do it more concicely, but it sounded like you were asking for shell builtins. Python's also decent with strings, but again, sh.
-F \" delegates a literal " as the field separator, so that we can group the contents into 3 categories:
Sam,21, " 34,56,21,67 " , "C,B,D,A"
$1 = Sam,21,
$2 = 34,56,21,67
$3 = ,
$4 = C,B,D,A
You actually get 4, but I'm throwing out that comma in the third position. It's easy enough to put it back in.
We now need to sort those numbers, so split($2,a,",") returns an array, in this case, named a, from the contents of $2, which has been delimited by the , symbol.
a = [ 34, 56, 21, 67 ]
; separates AWK commands, you can mostly ignore those. If there were simply a space, awk would try to concatenate items together, and we don't want that yet.
Next, array sort asort( a ), the contents of a -- https://www.tutorialspoint.com/awk/awk_string_functions.htm
a = [ 21, 34, 56, 67 ]
Here would be a perfect time for Python's string .join() method https://www.w3schools.com/python/ref_string_join.asp
However, we don't have that available to us, and AWK doens't seem to have it, as far as I know, so we have to roll our own here. So construct string, b, whose contents will be appended by each item in a. Single-quotes often won't do in commandline, so you'll see double-quotes.
b=""
for( i in a ) b=b a[i] ","
b begins empty. Iterating a for-loop over a's contents, we arrive at an appending which includes commas. Leave the trailing comma for now, it'll get trimmed off in a bit.
21,34,56,67,
Exact same procedure for $4, but we name the array c this time, and the string in which those contents are contatenaded with commas, d -- split( $4, c, "," ) ; asort( c ) ; d="" ; for( i in c ) d=d c[i] "," You can name them anything you like, just happened to have ABCD staring me in the face from those grade listings, so that's what I went with.
OK, now we have everything we need.
$1 = Sam,21,
b = 21,34,56,67,
d = A,B,C,D,
Let's format a string so they're all together.
printf "%s\"%s\",\"%s\"\n"
This will print $1 in the first %s string position, then a literal double-quote,
b into the second %s string position, next ",",
followed by d in the third %s position,
all wrapped up with a final double-quote and a newline.
However, b and d both have trailing commas, so we trim those off with AWK's substr() command. -- https://www.tutorialspoint.com/awk/awk_string_functions.htm Knowing where to begin is easy enough, but we need to chop those at one-from-the-end.
substr( b, 1, length(b) -1 )
substr( d, 1, length(d) -1 )
It'd be nice if you could just specify -2, and have it count backwards, like you can in Lua, Python, et al... but that doesn't seem to do in AWK, so whatevs. Ya live, ya learn. And there you have it, all your ducks in a row.
Sam,21,"21,34,56,67","A,B,C,D"
This does, maybe not elegantly, but it's within the required guidelines. I'm sure there's possibilities of code-golfing in there somewhere, but it's solid logic you can follow.

Related

How to sort array of strings by function in shell script

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.
If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)
Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021
Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'
Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

Parsing unstructured text file with grep

I am trying to analyze this IDS log file from MIT, found here.
Summarized attack: 41.084031
IDnum Date StartTime Duration Destination Attackname insider? manual? console?success? aDump? oDump iDumpBSM? SysLogs FSListing StealthyNew? Category OS
41.08403103/29/1999 08:18:35 00:04:07 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:19:37 00:01:56 209.154.098.104ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:29:27 00:00:43 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:40:14 00:24:26 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
I am trying to write commands that do two things:
First, parse through the entire file and determine the amount of distinct "summarized attacks" that begin with 4x.xxxxx. I have accomplished this with:
grep -o -E "Summarized attack: 4". It returns 80.
Second, for each of the "Summarized Attacks" found by the above command, parse the table and determine the amount of IDnum rows, and return the total amount of rows (i.e., attacks) across all "Summarized attack" finds. I would imagine that number is somewhere around 200.
However, I am struggling to get the individual number of IDs, i.e., that are in the IDnum column of this text file.
Since it is a text file with technically no structure, how can I parse this .txt file as if it had a tabular structure to retrieve the total entries in the IDnum column, for each Summarized attack that follows the above grep command's search text?
Desired output would be a count of all IDnum's for the Summarized attacks found by the above command. I don't know the count, but I would imagine an integer output, similar to the return of 80 for grep -o -E "Summarized attack: 4". The output would be <int> where <int> is the # of "attacks" as defined by rows in the IDnum column across all 80 of the found "Summarized attacks" by the above grep command.
If another command other than grep is better suited, that is OK.
to count matches you can use grep -c
grep -cE '(^Summarized.attack:.4[0-9]\.[0-9]+$)'
you can use colon as delimiter for cut -d
(if you loop over results the leading whitespace does not care)
grep -oE '(^Summarized.attack:.4[0-9]\.[0-9]+$)' | cut -d: -f2
example loop
file="path/to/master-listfile-condensed.txt"
for var in $(grep -oE '(^Summarized.attack:.4[0-9]\.[0-9]+$)' "$file" | cut -d: -f2)
do
printf "Summarized attacks: %s: %s\n" $var \
$(grep -cE "(^.${var}[0-9]+/[0-9]{2}/[0-9]{4})" "$file")
done
^ start of line
$ end of line
. any byte (in this case single whitespace)
\. single dot (escaped)
[0-9] single digit
+ one (or more) occurrence
{4} four occurrence
Assuming you have more than one "Summarized attack:" in your input file this may be what you're looking for:
$ cat tst.awk
/^Summarized attack:/ {
prt()
atk = ($3 ~ /^4/ ? $3 : 0)
cnt = 0
}
atk { cnt++ }
END {
prt()
print "TOTAL", tot
}
function prt() {
if ( atk ) {
cnt -= 2
print atk, cnt
}
tot += cnt
}
.
$ awk -f tst.awk file
For your first part, fgrep -c "Summarized attacks: 4" or fgrep -F "Summarized attacks: 4" is sufficient.
If I understand your second part, for each of those blocks, you want to add up the attack rows and print a grand total. You can do that with
gawk '/^Summarized attack: 4/ { on=1; next} /^ 4[0-9.]*/ { if (on) ++ids; next} /^ IDnum/ {next} /^ */ {next} { on=0} END {print ids;}'< master-listfile-condensed.txt
The first statement says, search (/.../) for every line that begins with (^) "Summarized attack: 4", and upon finding it, turn on the "on" flag, and go to the next line. The second statement says, if this is an attack record (i.e. begins with 4 followed by a string [*] of digits), then check the flag; if it is on, count it. Basically, we want the flag to be on when we are in a stanza of target attack records. The next two statements say for every line that starts with " IDnum" or are all whitespace (sometimes blank lines are inserted), go to the next line; this is needed to counteract the next statement, which says that if this is not a line that matches any of the previous statements, turn off the "on" flag. This prevents us from counting attacks outside the target. Finally, END means at the end, print the grand total. I get 757 which is pretty far out of your range. But I think it is correct.
But a far easier way, assuming the Summarized timestamp is always repeated in the IDnum at least to the first significant digit, would be to use
grep -Ec '^ 4' master-listfile-condensed.txt
That means count all the lines that begin with space-4. In this case it gives us the correct result.

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

speed up my awk command? Answer must be awk :)

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character.
Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?
Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [ and ], and do not start with ^!
Do you need to use awk, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
Extract the 5th field from the tab-delimited $file.
Remove all fields starting with a capital letter.
Remove the characters <, >, *, and newlines.
Count the remaining characters.
Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'
This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Resources