How can I calculate the number of occurrences that is followed by a specific value? (add if statement) - bash

How can I calculate the number of occurrences that are ONLY followed by a specific value that is after E*? e.g:'EXXXX' ?
file.txt:
E2dd,Rv0761,Rv1408
2s32,Rv0761,Rv1862,Rv3086
6r87,Rv0761
Rv2fd90c,Rv1408
Esf62,Rv0761
Evsf62,Rv3086
i tried
input:
awk -F, '{map[$2]++} END { for (key in map) { print key, map[key] } }' file.txt
and add:
if [[ $line2 == `E*` ]];then
but not working, have syntax error
Expected Output:
total no of occurrences:
Rv0761: 2
Rv3086:1
Now i can only count all number of occurrences of the second value

if [[ $line2 == `E*` ]];then
This definitely is not legal GNU AWK if statement, consult If Statement to find what is allowed, though it is not required in this case as you might as follows, let file.txt content be
E2dd,Rv0761,Rv1408
2s32,Rv0761,Rv1862,Rv3086
6r87,Rv0761
Rv2fd90c,Rv1408
Esf62,Rv0761
Evsf62,Rv3086
then
awk 'BEGIN{FS=","}($1~/^E/){map[$2]++} END { for (key in map) { print key, map[key] } }' file.txt
gives output
Rv3086 1
Rv0761 2
Explanation: actions (enclosed in {...}) could be preceeded by pattern, which does restrict their execution to lines which does match pattern (in other words: condition does hold) in above example pattern is $1~/^E/ which means 1st column does starts with E.
(tested in gawk 4.2.1)

You are so close. You are only missing the REGEX to identify records beginning with 'E' and then a ":" concatenated on output to produce your desired results (not in sort-order). For example you can do:
awk -F, '/^E/{map[$2]++} END { for (key in map) { print key ":", map[key] } }' file.txt
Example Output
With your data in file.txt you would get:
Rv3086: 1
Rv0761: 2
If you need the output sorted in some way, just pipe the output of the awk command to sort with whatever option you need.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Formatting output using awk

I've a file with following content:
A 28713.64 27736.1000
B 9835.32
C 38548.96
Now, i need to check if the last row in the first column is 'C', then the value of first row in third column should be printed in the third column against 'C'.
Expected Output:
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
I tried below, but it's not working:
awk '{if ($1 == "C") ; print $1,$2,$3}' file_name
Any help is most welcome!!!
This works for the given example:
awk 'NR==1{v=$3}$1=="C"{$0=$0 FS v}7' file|column -t
If you want to append the 3rd column value from A row to C row, change NR==1 into $1=="A"
The column -t part is just for making output pretty. :-)
EDIT: As per OP's comment OP is looking for very first line and looking to match C string at very last line of Input_file, if this is the case then one should try following.
awk '
FNR==1{
value=$NF
print
next
}
prev{
print prev
}
{
prev=$0
prev_first=$1
}
END{
if(prev_first=="C"){
print prev,value
}
else{
print
}
}' file | column -t
Assuming that your actual Input_file is same as shown samples and you want to pick value from 1st column whose value is A.
awk '$1=="A" && FNR==1{value=$NF} $1=="C"{print $0,value;next} 1' Input_file| column -t
Output will be as follows.
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
POSIX dictates that "assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS."
So...
awk 'NR==1{x=$3} $1=="C"{$3=x} 1' input.txt
Note that the output is not formatted well, but that's likely the case with most of the solutions here. You could pipe the output through column, as Ravinder suggested. Or you could control things precisely by printing your data with printf.
awk 'NR==1{x=$3} $1=="C"{$3=x} {printf "%-2s%-26s%s\n",$1,$2,$3}' input.txt
If your lines can be expressed in a printf format, you'll be able to avoid the unpredictability of column -t and save the overhead of a pipe.

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

Replace special characters in variable in awk shell command

I am currently executing the following command:
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; print H > "/Directory/FILE_"$3"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$3"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
This takes the value from the third position in the CSV file and creates a CSV for each distinct $3 value. Works as desired.
The input file looks as follows:
Name, Amount, ID
"ABC", "100.00", "0000001"
"DEF", "50.00", "0000001"
"GHI", "25.00", "0000002"
Unfortunately I have no control over the value in the source (CSV) sheet, the $3 value, but I would like to eliminate special (non-alphanumeric) characters from it. I tried the following to accomplish this but failed...
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; name=${$3//[^a-zA-Z_0-9]/}; print H > "/Directory/FILE_"$name"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$name"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
Suggestions? I'm hoping to do this in a single command but if anyone has a bash script answer that would work.
This is definitely not a job you should be using getline for, see http://awk.info/?tip/getline
It looks like you just want to reproduce the first line of your input file in every $3-named file. That'd be:
awk -F, '
NR==1 { hdr=$0; next }
$3 != prev { prev=name=$3; gsub(/[^[:alnum:]_]/,"",name); $0 = hdr "\n" $0 }
{ print > ("/Directory/FILE_" name "_DOWNLOAD.csv") }
' /Directory/FILE_ALL_DOWNLOAD.csv
Note that you must always parenthesize expressions on the right side of output redirection (>) as it's ambiguous otherwise and different awks will behave differently if you don't.
Feel free to put it all back onto one line if you prefer.
If you always expect the number to be in the last field of your CSV and you know that each field is wrapped in quotes, you could use this awk to extract the value 456 from the input you have provided in the comment:
echo " 123.", "Company Name" " 456." | awk -F'[^a-zA-Z0-9]+' 'NF { print $(NF-1) }'
This defines the field separator as any number of non-alphanumeric characters and retrieves the second-last field.
If this is sufficient to reliably retrieve the value, you could construct your filename like this:
file = "/Directory/FILE_" $(NF-1) "_DOWNLOAD.csv"
and output to it as you're already doing.
bash variable expansions do not occur in single quotes.
They also cannot be performed on awk variables.
That being said you don't need that to work.
awk has string manipulation functions that can perform the same tasks. In this instance you likely want the gsub function.
Would this not work for what you asked ?
awk -F, 'a=NR==1{x=$0;next}
!a{gsub(/[^[:alnum:]]/,"",$3);print x"\n"$0 >> "/Directory/FILE_"$3"_DOWNLOAD.csv"}' file

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Resources