Countif like function in AWK - bash

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 1 whould have the range of values and column 2 would have the times the value appears in column 1

Count how many times each value appears in the first column and append the count to the end of each line:
$ cat file
1,2,3
1,2,3
9,7,4
1,5,7
3,2,1
$ awk -F, '{c[$1]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file
1,2,3,3
1,2,3,3
9,7,4,1
1,5,7,3
3,2,1,1

One more solution using Perl.
perl -F, -lane ' $kv{$F[0]}++;$kl{$.}=$_;END {for(sort keys %kl) { $x=(split(",",$kl{$_}))[0]; print "$kl{$_},$kv{$x}" }} '
Borrowing input from Chris
$ cat kbiles.txt
1,2,3
1,2,3
9,7,4
1,5,7
3,2,1
$ perl -F, -lane ' $kv{$F[0]}++;$kl{$.}=$_;END {for(sort keys %kl) { $x=(split(",",$kl{$_}))[0]; print "$kl{$_},$kv{$x}" }} ' kbiles.txt
1,2,3,3
1,2,3,3
9,7,4,1
1,5,7,3
3,2,1,1
$

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Searching for a string between two characters

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

How can one dynamically create a new csv from selected columns of another csv file?

I dynamically iterate through a csv file and select columns that fit the criteria I need. My CSV is separated by commas.
I save these indexes to an array that looks like
echo "${cols_needed[#]}"
1 3 4 7 8
I then need to write these columns to a new file and I've tried the following cut and awk commands, however, as the array is dynamically created, I cant seem to find the right commands that can select them all at once. I have tried cut, awk and paste commands.
awk -v fields=${cols_needed[#]} 'BEGIN{ n = split(fields,f) }
{ for (i=1; i<=n; ++i) printf "%s%s", $f[i], (i<n?OFS:ORS) }' test.csv
This throws an error as it cannot split the fields unless I hard code them (even then, it can only do 2), split on spaces.
fields="1 2’
I have tried to dynamically create -f parameters, but can only do so with one variable in a loop like so
for item in "${cols_needed[#]}";
do
cat test.csv | cut -f$item
done
which outputs one column at a time.
And I have tried to dynamically create it with commas - input as 1,3,4,7...
cat test.csv | cut -f${cols_needed[#]};
which also does not work!
Any help is appreciated! I understand awk does not work like bash and we cannot pass variables around in the same way. I feel like I'm going around in circles a bit! Thanks in advance.
Your first approach is ok, just:
change -v fields=${cols_needed[#]} to -v fields="${cols_needed[*]}", to pass the array as a single shell word
add FS=OFS="," to BEGIN, after splitting (you want to split on spaces, before FS is changed to ,)
ie. BEGIN {n = split(fields, f); FS=OFS=","}
Also, if there are no commas embedded in quoted csv fields, you can use cut:
IFS=,; cut -d, -f "${cols_needed[*]}" test.csv
If there are embedded commas, you can use gawk's FPAT, to only split fields on unquoted commas.
Here's an example using that.
# prepend $ to each number
for i in "${cols_needed[#]}"; do
fields[j++]="\$$i"
done
IFS=,
gawk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, "{print ${fields[*]}}"
Injecting shell code in to an awk command is generally not great practice, but it's ok here IMO.
Expanding on my comments re: passing the bash array into awk:
Passing the array in as an awk variable:
$ cols_needed=(1 3 4 7 8)
$ typeset -p cols_needed
declare -a cols_needed=([0]="1" [1]="3" [2]="4" [3]="7" [4]="8")
$ awk -v fields="${cols_needed[*]}" 'BEGIN{n=split(fields,f); for (i=1;i<=n;i++) print i,f[i]}'
1 1
2 3
3 4
4 7
5 8
Passing the array in as a 'file' via process substitution:
$ awk 'FNR==NR{f[++n]=$1;next} END {for (i=1;i<=n;i++) print i,f[i]}' <(printf "%s\n" "${cols_needed[#]}")
1 1
2 3
3 4
4 7
5 8
As for OP's main question of extracting a specific set of columns from a .csv file ...
Borrowing dawg's .csv file:
$ cat file.csv
1,2,3,4,5,6,7,8
11,12,13,14,15,16,17,18
21,22,23,24,25,26,27,28
Expanding on the suggestion for passing the bash array in as an awk variable:
awk -v fields="${cols_needed[*]}" '
BEGIN { FS=OFS=","
n=split(fields,f," ")
}
{ pfx=""
for (i=1;i<=n;i++) {
printf "%s%s", pfx, $(f[i])
pfx=OFS
}
print ""
}
' file.csv
NOTE: this assumes OP has provided a valid list of column numbers; if there's some doubt as to the validity of the input (column) numbers then OP can add some logic to address said doubts (eg, are they integers? are they positive integers? do they reference a field (in file.csv) that actually exists?, etc)
This generates:
1,3,4,7,8
11,13,14,17,18
21,23,24,27,28
Suppose you have this variable in bash:
$ echo "${cols_needed[#]}"
3 4 7 8
And this CSV file:
$ cat file.csv
1,2,3,4,5,6,7,8
11,12,13,14,15,16,17,18
21,22,23,24,25,26,27,28
You can select columns of that csv file in awk this way:
awk '
BEGIN{FS=OFS=","}
FNR==NR{split($0, cols," "); next}
{
s=""
for (e=1;e<=length(cols); e++)
s=e<length(cols) ? s $(cols[e]) OFS : s $(cols[e])
print s
}' <(echo "${cols_needed[#]}") file.csv
Prints:
3,4,7,8
13,14,17,18
23,24,27,28
Or, you can do:
awk -v cw="${cols_needed[*]}" '
BEGIN{FS=OFS=","; split(cw, cols," ")}
{
s=""
for (e=1;e<=length(cols); e++)
s=e<length(cols) ? s $(cols[e]) OFS : s $(cols[e])
print s
}' file.csv
# same output
BTW, you can do this entirely with cut:
cut -d ',' -f $(IFS=, ; echo "${cols_needed[*]}") file.csv
3,4,7,8
13,14,17,18
23,24,27,28

How to export original unique values using awk

This command works great for concatenating duplicates and giving only unique values:
awk '!x[$0]++' filewithdupes > newfile
However, I want to keep the original unique values.
Example:
If I have this simple set of values in a CSV column:
1
1
2
2
3
The command above outputs this:
1
2
3
But I want:
3
How can I modify this command to keep the original unique value? Or is there a command better suited to what I'm trying to do?
You may use this awk to print record that has only one occurrence:
awk '{x[$0]++} END{for (i in x) if (x[i] == 1) print i}' filewithdupes
3
if your file is already sorted as in the example, the simplest will be
$ uniq -u file
3
otherwise, a double scan algorithm
$ awk 'NR==FNR{a[$1]++; next} a[$1]==1' file{,}
3
Could you please try following.
awk 'FNR==NR{a[$0]++;next} a[$0]==1' Input_file Input_file

Shell Script add column values

I have a text file which contains like below:
{"userId":"f1fcab","count":"3","type":"Stack"}
{"userId":"fcab","count":"2","type":"Stack"}
{"userId":"abcd","count":"5","type":"Stack"}
I want to get sum of the value of count.
I am using awk to achive this like below:
$ awk -F "," '{print $4}' test.txt
How can I get only the integer type using awk and add them all.
My script should give me as
sum=10
You could try the below,
$ awk -F'"' '{sum = sum + $8;}END{print "sum="sum+0}' file
sum=10
-F'"' Sets the double quotes as FS value. Awk splits the row into colunms according to the value of FS variable.
sum = sum + $8 Calculate the sum of all the values in column no 8 and store it into a variable called sum
Finally by printing the variable sum at the end will give you the desired output.
You can get the value of count key using double quotes (") as delimiter so that the eighth column will be the value to count on:
$ awk -F"\"" 'BEGIN {sum=0} {sum+=$8} END {print sum}' fd
10
Assuming consistent use of double quote characters, you can use:
awk -F\" '{s += $8} END{print "sum=" s+0}' inputFile
This will generate:
sum=10
This works because a quote delimiter gives you the fields:
1 2 3 4 5 6 7 8 ...
{"userId":"f1fcab","count":"3","type":"Stack"}
awk -F'[:"]' '{sum+=$10} END{print "sum=" sum}' File
Setting ':' and '"' as delimiters. Then taking the 10th field, which is the count value. add then up to sum and print at the end.
Example:
sdlcb#ubuntu:~/AMD_C/SO$ cat File
{"userId":"f1fcab","count":"3","type":"Stack"}
{"userId":"fcab","count":"2","type":"Stack"}
{"userId":"abcd","count":"5","type":"Stack"}
sdlcb#ubuntu:~/AMD_C/SO$ awk -F'[:"]' '{sum+=$10} END{print "sum=" sum}' File
sum=10

Resources