Awk: Stop 'print' adding newline and how to loop/automate - bash

As it stands, I have tab delimited data laid out like this (headers added here for clarity):
EntryID GroupID Result
039848 00100 Description 1
088345 00200 Description 2
748572 00435 Description 3
884938 00200 Description 2
000392 00200 Description 3
008429 00100 Description 4
What I am trying to do is condense my data into groups. I wish to output a table with column A being groupIDs (with no duplication) and column B being a combination of all descriptions associated with that group. An example output would be:
00100 Description 1 | Description 4
00200 Description 2 | Description 2| Description 3
00435 Description 3
I've tried to write an awk command to produce one line at a time, given a Group ID as a parameter:
$ awk -F '\t' '/00100/ { print $2 '\t' $3 }' table.txt > output.txt
This works, however each hit is printed on a newline, like this
00100 Description 1
00100 Description 2
etc
I gather that this can be solved by specifying the ORS to an alternate character, or using printf rather than print, but when I try either of these
$ awk -F '\t' 'BEGIN {ORS = '\t'} /00100/ { print $2 '\t' $3 }' table.txt > output.txt
or
$ awk -F '\t' '/00100/ { printf $2 '\t' $3 }' table.txt > output.txt
Nothing actually changed in the output.
Once I get that solved, the other problem I have is that I have thousands of groups to repeat this with. I have a list of every group ID present in the data, stored in a different file, and I'd like to automate feeding that to awk for each ID.
I've tried modifying a command I've seen used to feed IDs to grep in a similar fashion, but I haven't had any luck with that either, as it just hangs:
$ for i in `$ cat groupIDs.txt`; do awk -F '\t' '/$i/ { print $2 '\t' $3 }' table.txt' >> test_results.txt ; done;
Any ideas how I can solve these issues?

I'm not much on awk, but you can do this with bash, sort, grep, cut and paste:
#!/bin/bash
groups=$(cut -f2 "$1" | sort -u)
for group in $groups ; do
echo -n "$group "
cut -f2- "$1" | grep "^$group" | cut -f2 | paste -d"|" -s -
done
This produces the following output:
00100 Description 1|Description 4
00200 Description 2|Description 2|Description 3
00435 Description 3
Not sure if the output delimiter has to be " | " or if "|" will do.

You can try this awk command:
$ awk '{i=$2;$1=""; $2="";a[i]=a[i]?a[i]" |"$0:$0}END{for (i in a) print i, a[i]} ' file
00435 Description 3
00100 Description 1 | Description 4
00200 Description 2 | Description 2 | Description 3
Or since the file is tab delimited, you can simplify it to
$ awk -F'\t' '{a[$2]=a[$2]?a[$2]" | "$3:$3}END{for (i in a) print i"\t"a[i]} ' file
00435 Description 3
00100 Description 1 | Description 4
00200 Description 2 | Description 2 | Description 3

$ cat tst.awk
BEGIN {
FS=OFS="\t"
split(tgtS,tmpA,/,/)
for (i in tmpA)
tgtA[tmpA[i]]
}
(!tgtS) || ($2 in tgtA) {
descs[$2] = descs[$2] sep[$2] $3
sep[$2]=" | "
}
END {
for (gid in descs)
print gid, descs[gid]
}
$
$ gawk -f tst.awk file
00435 Description 3
00100 Description 1 | Description 4
00200 Description 2 | Description 2 | Description 3
$
$ gawk -v tgtS="00100" -f tst.awk file
00100 Description 1 | Description 4
$
$ gawk -v tgtS="00100,00200" -f tst.awk file
00100 Description 1 | Description 4
00200 Description 2 | Description 2 | Description 3

Code:
#!/usr/bin/awk -f
BEGIN {
FS = OFS = "\t"
getline
}
{
if ($2 in a) {
a[$2] = a[$2] " | " $3
} else {
a[$2] = $3
b[i++] = $2
}
}
END {
for (j = 0; j < i; ++j) {
k = b[j]
print k, a[k]
}
}
Input:
EntryID GroupID Result
039848 00100 Description 1
088345 00200 Description 2
748572 00435 Description 3
884938 00200 Description 2
000392 00200 Description 3
008429 00100 Description 4
Output:
00100 Description 1 | Description 4
00200 Description 2 | Description 2 | Description 3
00435 Description 3

Related

How to add Column values based on unique value of a different column

I am trying to add values in Column B based on unique value in Column A. How can i do it using AWK (or) Any other way using bash?
Column_A | Column_B
--------------------
A | 1
A | 2
A | 1
B | 3
B | 8
C | 5
C | 8
Result:
Column_A | Column_B
--------------------
A | 6
B | 11
C | 13
Considering that your Input_file is same as shown, sorted with first field, could you please try following(will edit solution for alignment too soon).
awk '
BEGIN{
OFS=" | "
}
FNR==1 || /^-/{
print
next
}
prev!=$1 && prev{
print prev,sum
prev=sum=""
}
{
sum+=$NF
prev=$1
}
END{
if(prev && sum){
print prev,sum
}
}' Input_file
another awk
$ awk 'NR<3 {print; next}
{a[$1]+=$NF; line[$1]=$0}
END {for(k in a) {sub(/[0-9]+$/,a[k],line[k]); print line[k]}}' file
Column_A | Column_B
--------------------
A | 4
B | 11
C | 13
note that A totals to 4, not 6.
One possible solution (Assuming file is in CSV format):
Input :
$ cat csvtest.csv
A,1
A,2
A,3
B,3
B,8
C,5
C,8
$ cat csvtest.csv | awk -F "," '{arr[$1]+=$2} END {for (i in arr) {print i","arr[i]}}'
A,6
B,11
C,13

CONCAT columns within a file

I'd like to concatenate column2 until column4.
Example (first.txt):
|ID|column2|column3|column4|
|1 | a | b | c |
|2 | d | e | f |
To this (mynewfile.txt) :
ID|column2
1 | a b c
2 | d e f
This is my script in cygwin : $ awk '{print $2" "$3" "$4 }' first.txt > mynewfile.txt
Of course, it is not working out well.. How do I improve the script?
You need to set the field separator so that a pipe with optional whitespace around it is the field delimiter.
The pipe at the beginning of the line causes an empty field 1 before the pipe, so the ID is field 2, and columns 2-4 are fields 3-5. So it should be:
awk -F' *\\| *' 'NR == 1 {print "ID|column2|"} NR > 1 {printf("%d | %s %s %s |\n", $2, $3, $4, $5)}' first.txt > mynewfile.txt
Not especially general GNU sed method:
sed 's/^[|]//;1s/2.*/2/;1!{s/|/ /g2;s/ */ /2g}' first.txt
Output:
ID|column2
1 | a b c
2 | d e f

Replace string in Nth array

I have a .txt file with strings in arrays which looks like these:
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 0
And i want to add counts, so i need to replace last digit, but i need to change it only for the one line.
I am getting new needed value by this function:
$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"
And them I want to replace it with new value, which will be bigger for 1.
And im doing it right in there.
counts=(("$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"+1))
Where IdOpen is an id of the array, where i need to replace string.
So i have tried to replace the whole array by these:
counter="$(awk -v i=$idOpen 'BEGIN{FNqR == i}{$7+=1} END{ print $0}' bookmarks)"
N=$idOpen
sed -i "{N}s/.*/${counter}" bookmarks
But it doesn't work!
So is there a way to replace only last string with value which i have got earlier?
As result i need to get:
id | String1 | String2 | Counts
1 | Abc | Abb | 1 # if idOpen was 1 for 1 time
2 | Cde | Cdf | 2 # if idOpen was 2 for 2 times
And the last number will be increased by 1 everytime when i will activate these commands.
awk solution:
setting idOpen variable(for ex. 2):
idOpen=2
awk -F'|' -v i=$idOpen 'NR>1{if($1 == i) $4=" "$4+1}1' OFS='|' file > tmp && mv tmp file
The output(after executing the above command twice):
cat file
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 2
NR>1 - skipping the header line

Sum of Columns for multiple variables

Using Shell Script (Bash), I am trying to sum the columns for all the different variables of a list. Suppose I have the following input of a Test.tsv file
Win Lost
Anna 1 1
Charlotte 3 1
Lauren 5 5
Lauren 6 3
Charlotte 3 2
Charlotte 4 5
Charlotte 2 5
Anna 6 4
Charlotte 2 3
Lauren 3 6
Anna 1 2
Anna 6 2
Lauren 2 1
Lauren 5 5
Lauren 6 6
Charlotte 1 3
Anna 1 4
And I want to sum up how much each of the participants have won and lost. So I want to get this as a result:
Sum Win Sum Lost
Anna 57 58
Charlotte 56 57
Lauren 53 56
What I would usually do is take the sum per person and per column and repeat that process over and over. See below how I would do it for the example mentioned:
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f2 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f3 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
However I would need to repeat this line for every participant. This becomes a pain when you have to many variables you want to sum it up for.
What would be the way to write this script?
Thanks!
This is pretty straightforward with awk. Using GNU awk:
awk -F '\t' 'BEGIN { OFS = FS } NR > 1 { won[$1] += $2; lost[$1] += $3 } END { PROCINFO["sorted_in"] = "#ind_str_asc"; print "", "Sum Win", "Sum Lost"; for(p in won) print p, won[p], lost[p] }' filename
-F '\t' makes awk split lines at tabs, then:
BEGIN { OFS = FS } # the output should be separated the same way as the input
NR > 1 { # From the second line forward (skip header)
won[$1] += $2 # tally up totals
lost[$1] += $3
}
END { # When done, print the lot.
# GNU-specific: Sorted traversal or player names
PROCINFO["sorted_in"] = "#ind_str_asc"
print "", "Sum Win", "Sum Lost"
for(p in won) print p, won[p], lost[p]
}

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Resources