Grouping the rows of a text file based on 2 columns - shell

I have a text file like this:
1 abc 2
1 rgt 2
1 yhj 2
3 gfk 4
5 kji 6
3 plo 4
3 vbn 4
5 olk 6
I want to group the rows on the basis of first and second column like this:
1 abc,rgt,yhj 2
3 gfk,plo,ybn 4
5 kji,olk 6
such that I can see what are the values of col2 for a particular pair of col1, col3.
How can I do this using shell script?

This should do it :
awk -F " " '{ a[$1" "$3]=a[$1" "$3]$2","; }END{ for (i in a)print i, a[i]; }' file.txt | sed 's/,$//g' | awk -F " " '{ tmp=$3;$3=$2;$2=tmp;print }' |sort

Just using awk:
#!/usr/bin/env awk -f
{
k = $1 "\x1C" $3
if (k in a2) {
a2[k] = a2[k] "," $2
} else {
a1[k] = $1
a2[k] = $2
a3[k] = $3
b[++i] = k
}
}
END {
for (j = 1; j <= i; ++j) {
k = b[j]
print a1[k], a2[k], a3[k]
}
}
One line:
awk '{k=$1"\x1C"$3;if(k in a2){a2[k]=a2[k]","$2}else{a1[k]=$1;a2[k]=$2;a3[k]=$3;b[++i]=k}}END{for(j=1;j<=i;++j){k=b[j];print a1[k],a2[k],a3[k]}}' file
Output:
1 abc,rgt,yhj 2
3 gfk,plo,vbn 4
5 kji,olk 6

Related

Generate table from "awk" command of different files in a bash program?

I need to make a table of numbers, where these numbers were obtained from different files, my code is
#!/bin/sh
for K in 1.7e-2; do
dir0=Kn_${K};
for P in 1.4365 2.904; do
dir1=P${P};
for r in 0.30 0.35; do
dir2=${r};
awk '/result is =/{print $NF}' ./First/${dir0}/${dir1}/R\=${dir2}/Results.dat
done;
done;
done;
exit;
I obtain as
1
2
3
4
but I need
1 3
2 4
I was reading some posts on this topic, but these topic are about on files and not that the data were generated.
Thanks for your help and support
I obtained the data as,
1
2
3
4
5
6
7
8
Wit the pipe pr -4ts $ '\ t' (thanks #karakfa)
I obtained
1 3 5 7
2 4 6 8
then with a script to transpose as,
#!/bin/sh
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for (j =1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' $1
I obtained
1 2
3 4
5 6
7 8
I have a problem with interspersed the numbers,
I need
1 5
2 6
3 7
4 8
But I don't know, what is the problem with command pr and its options?
Thanks for your help

Calculating the sum of every third column from many files

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

use awk to match rows for each column

How can awk be used to find values that match in row 2 for each column?
I would like to take in a tab limited file and for each column if any row below row 2 matches what is in row 2, print field with "match".
transforming this tab delimited file:
header1 | header2 | header3
1 | 1 | B
--------+---------+----------
3 | 1 | A
2 | A | B
1 | B | 1
To this:
header1 | header2 | header3
1 | 1 | B
--------+---------+----------
3 | 1 match | A
2 | A | B match
1 match | B | 1
I would go for something like this:
$ cat file
header1 header2 header3
1 1 B
3 1 A
2 A B
1 B 1
$ awk -v OFS='\t' 'NR == 2 { for (i=1; i<=NF; ++i) a[i] = $i }
NR > 2 { for(i=1;i<=NF;++i) if ($i == a[i]) $i = $i " match" }1' file
header1 header2 header3
1 1 B
3 1 match A
2 A B match
1 match B 1
On the second line, populate the array a with the contents of each field. On subsequent lines, add "match" when they match the corresponding value in the array. The 1 at the end is a common shorthand causing each line to be printed. Setting the output field separator OFS to a tab character preserves the format of the data.
Pedantically, with GNU Awk 4.1.1:
awk -f so.awk so.txt
header1 header2 header3
1 1 B
3 1* A
2 A B*
1* B 1
with so.awk:
{
if(1 == NR) {
print $0;
} else if(2 == NR) {
for(i = 1; i <= NF; i++) {
answers[i]=$i;
}
print $0;
} else {
for(i = 1; i <= NF; i++) {
field = $i;
if(answers[i]==$i) {
field = field "*" # a match
}
printf("%s\t",field);
}
printf("%s", RS);
}
}
and so.txt as a tab delimited data file:
header1 header2 header3
1 1 B
3 1 A
2 A B
1 B 1
This isn't homework, right...?

Calculating sum of gradients with awk

I have a file that contains 4 columns such as:
A B C D
1 2 3 4
10 20 30 40
100 200 300 400
.
.
.
I can calculate gradient of columns B to D versus A such as following commands:
NR>1{print $0,($2-b)/($1-a)}{a=$1;b=$2}' file
How can I print sum of gradients as the 5th column in the file? The results should be:
A B C D sum
1 2 3 4 1+2+3+4=10
10 20 30 40 (20-2)/(10-1)+(30-3)/(10-1)+(40-4)/(10-1)=9
100 200 300 400 (200-20)/(100-10)+(300-30)/(100-10)+(400-40)/(100-10)=9
.
.
.
awk 'NR == 1 { print $0, "sum"; next } { if (NR == 2) { sum = $1 + $2 + $3 + $4 } else { t = $1 - a; sum = ($2 - b) / t + ($3 - c) / t + ($4 - d) / t } print $0, sum; a = $1; b = $2; c = $3; d = $4 }' file
Output:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
With ... | column -t:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
Update:
#!/usr/bin/awk -f
NR == 1 {
print $0, "sum"
next
}
{
sum = 0
if (NR == 2) {
for (i = 1; i <= NF; ++i)
sum += $i
} else {
t = $1 - a[1]
for (i = 2; i <= NF; ++i)
sum += ($i - a[i]) / t
}
print $0, sum
for (i = 1; i <= NF; ++i)
a[i] = $i
}
Usage:
awk -f script.awk file
If you apply the same logic to the first line of numbers as you do to the rest, taking the initial value of each column as 0, you get 9 as the result of the sum (as it was in your question originally). This approach uses a loop to accumulate the sum of the gradient from the second field up to the last one. It uses the fact that on the first time round, the uninitialised values in the array a evaluate to 0:
awk 'NR==1 { print $0, "sum"; next }
{
s = 0
for(i=2;i<=NF;++i) s += ($i-a[i])/($1-a[1]) # accumulate sum
for(i=1;i<=NF;++i) a[i] = $i # fill array to be used for next iteration
print $0, s
}' file
You can pack it all onto one line if you want but remember to separate the statements with semicolons. It's also slightly shorter to only use a single for loop with an if:
awk 'NR==1{print$0,"sum";next}{s=0;for(i=1;i<=NF;++i)if(i>1)s+=($i-a[i])/($1-a[1]);a[i]=$i;print$0,s}' file
Output:
A B C D sum
1 2 3 4 9
10 20 30 40 9
100 200 300 400 9

Sort values and output the indices of their sorted columns

I've got a file that looks like:
20 30 40
80 70 60
50 30 40
Each column represents a procedure. I want to know how the procedures did for each row. My ideal output would be
3 2 1
1 2 3
1 3 2
i.e. in row 1, the third column had the highest value, followed by the second, then the first smallest (this can be reversed, doesn't matter).
How would I do this?
I'd do it with some other Unix tools (read, cat, sort, cut, tr, sed, and bash of course):
while read line
do
cat -n <(echo "$line" | sed 's/ /\n/g') | sort -r -k +2 | cut -f1 | tr '\n' ' '
echo
done < input.txt
The output looks like this:
3 2 1
1 2 3
1 3 2
Another solution using Python:
$ python
Python 2.7.6 (default, Jan 26 2014, 17:25:18)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> with open('file.txt') as f:
... lis=[x.split() for x in f]
...
>>> for each in lis:
... each = [i[0] + 1 for i in sorted(enumerate(each), key=lambda x:x[1], reverse=True)]
... print ' '.join([str(item) for item in each])
...
3 2 1
1 2 3
1 3 2
Using Gnu Awk version 4:
$ awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_desc" }
{
split($0,a," ")
for (i in a) printf "%s%s", i,OFS
print ""
}' file
3 2 1
1 2 3
1 3 2
If you have GNU awk then you can do something like:
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
$ cat file
20 30 40
0.923913 0.913043 0.880435 0.858696 0.826087 0.902174 0.836957 0.880435
80 70 60
50 30 40
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
3 2 1
1 2 6 8 3 4 7 5
1 2 3
1 3 2
Solution via perl
#!/usr/bin/perl
open(FH,'<','/home/chidori/input.txt') or die "Can't open file$!\n";
while(my $line=<FH>){
chomp($line);
my #unsorted_array=split(/\s/,$line);
my $count=scalar #unsorted_array;
my #sorted_array = sort { $a <=> $b } #unsorted_array;
my %hash=map{$_ => $count--} #sorted_array;
foreach my $value(#unsorted_array){
print "$hash{$value} ";
}
print "\n";
}

Resources