Awk, printing certain columns based on how rows of different files match - bash

I am pretty sure that it is awk I would have to use
I have one file with information I need and another file where I need to take two pieces of information from and obtain two numbers from the second file based on that piece of information.
So if the first file has m7 in its fifth column and 3 in it's third column I want to search in the second column for a row that has 3 in it's first column and m7 in it's fourth column. The I want to print certain columns from these files as listed below.
Given the following two files of input
file1
1 dog 3 8 m7 n15
50 cat 5 8 m15 m22
20 fish 6 3 n12 m7
file2
3 695 842 m7 word
5 847 881 m15 not
8 910 920 n15 important
8 695 842 m22 word
6 312 430 n12 not
I want to produce the output
pre3 695 842 21
pre5 847 881 50
pre6 312 430 20
pre8 910 920 1
pre8 695 842 50
EDIT:
I need to also produce output of the form
pre3 695 842 pre8 910 920 1
pre5 847 881 pre8 695 842 50
pre6 312 430 pre3 695 842 20
The answer below work for the question before, but I'm confused with some of the syntax of it so I'm not sure how to adjust it to make this output

This command:
awk 'NR==FNR{ar[$5,$3]=$1+ar[$5,$3]; ar[$6,$4]=$1+ar[$6,$4]}
NR>FNR && ar[$4,$1] {print "pre"$1,$2,$3,ar[$4,$1]}' file1 file2
outputs pre plus the content of the second file's first, second, and third column and the first file's first column for all lines in which the content of the first file's fifth and third (or sixth and fourth) column is identical to the second file's fourth and first column:
pre3 695 842 21
pre5 847 881 50
pre8 910 920 1
pre8 695 842 50
pre6 312 430 20
(for lines with more than one match the values of ar[$4,$1] are summed up)
Note that the output is not necessarily sorted! To achieve this: add sort:
awk 'NR==FNR{ar[$5,$3]=$1+ar[$5,$3]; ar[$6,$4]=$1+ar[$6,$4]}
NR>FNR && ar[$4,$1]{print "pre"$1,$2,$3,ar[$4,$1]}' file1 file2 | sort
What does the code?
NR==FNR{...} works on the first input file only
NR>FNR{...} works on the 2nd, 3rd,... input file
ar[$5,$3] creates an array whose key is the content of the 5th and 3rd column of the current line / record (separated by the field separator; usually a single blank)

You could use the below command :
awk 'NR==FNR {a[$3 FS $5]=1;next } a[$1 FS $4]' f1.txt f2.txt
If you want to print only the specific fields from the matching lines in second file use like below :
awk 'NR==FNR {a[$3 FS $5]=1;next } a[$1 FS $4] { print "pre"$1" "$2" "$3}' f1.txt f2.txt

Related

Apply multiple substract commands between two columns in text file in bash

I would like to substract 2x two columns in a text file and add into two new columns in a tab delimited text file in bash using awk.
I would like to substract column 3 (h3) - column 1 (h1). And name the new added column "count1".
I would like to substract column 4 (h4) - column 2 (h2). And name the new added column "count2".
I don't want to build a new text file, but edit the old one.
My text file:
h1 h2 h3 h4 h5
343 100 856 216 536
283 96 858 220 539
346 111 858 220 539
283 89 860 220 540
280 89 862 220 541
76 32 860 220 540
352 105 856 220 538
57 16 860 220 540
144 31 858 220 539
222 63 860 220 540
305 81 858 220 539
My command at the moment looks like this:
awk '{$6 = $3 - $1}1' file.txt
awk '{$6 = $4 - $2}1' file.txt
But I don't know how to rename the new added columns and maybe there is a smarter move to run both commands in the same awk command?
Pretty simple in awk. Use NR==1 to modify the first line.
awk -F '\t' -v OFS='\t' '
NR==1 {print $0,"count1","count2"}
NR!=1 {print $0,$3-$1,$4-$2}' file.txt > tmp && mv tmp file.txt

Matching Column numbers from two different txt file

I have two text files which are a different size. The first one below example1.txt has only one column of numbers:
101
102
103
104
111
120
120
125
131
131
131
131
131
131
And the Second text file example2.txt has two columns:
101 3
102 3
103 3
104 4
104 4
111 5
120 1
120 1
125 2
126 2
127 2
128 2
129 2
130 2
130 2
130 2
131 10
131 10
131 10
131 10
131 10
131 10
132 10
The first column in the example1.txt is a subset of column one in example2.txt. The second column numbers in example2.txt are the associated values with the first column.
What I want to do is to get the associated second column of example1.txt following the example2.txt. I have tried but couldn't figure it out yet. Any suggestions or solutions in bash, awk would be appreciated
Therefore the result would be:
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
UPDATE:
I have been trying to do the column matching like :
awk -F'|' 'NR==FNR{c[$1]++;next};c[$1] > 0' example1.txt example2.txt > output.txt
In both files, the first column goes like an ascending order, but the frequency of the same numbers may not be the same. For example, the frequency of 104 is one in the example1.txt, but it appeared twice in the example2.txt The important thing is that the associated second column value would be the same for example1.txt too. Just see the expected output in the end.
$ awk 'NR==FNR{a[$1]++; next} ($1 in a) && b[$1]++ < a[$1]' f1 f2
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
This solution doesn't make use of the fact that the first column is in ascending order. Perhaps some optimization can be done based on that.
($1 in a) && b[$1]++ < a[$1] is the main difference from your solution. This checks if the field exists as well as that the count doesn't exceed that of the first file.
Also, not sure why you set the field separator as | because there is no such character in the sample given.

Difference of column value received from operation done on 2 different files

I have following 2 files file1.txt and file2.txt with the data as given below-
Data in file1.txt
125
125
295
295
355
355
355
Data in file2.txt
125
125
295
355
I did below operation over the files and got following output-
Operation1-
sort file1.txt | uniq -c
2 125
2 295
3 355
Operation2-
sort file2.txt | uniq -c
2 125
1 295
1 355
Now, I want following output using the result of Operation1 and Operation2 -
I want to compare the result of Operation1 and Operation2 and get the output which will show the difference of values from column 1 of both the files, and it will show the column 2 as it is as given below-
0 125
1 295
2 355
redirect output of operation 1 and operation 2 in some files. Let say
file1
and
file2
, then write like this:-
paste file1 file2 | awk '{print $1-$3,$2}'
you will have output
0 125
1 295
2 355

adding columns for specified rows & dividing by the number of rows using awk

So I'm really new to using linux and script commands, help would really be appreciated!
I have a file of 1050 rows and 8 columns. Example:
anger 1 0 5 101 13 2 somesentenceofwords
anger 2 0 5 101 23 3 somesentenceofwords
anger 3 0 3 101 35 3 somesentenceofwords
anger 4 0 2 101 23 3 somesentenceofwords
arch 5 0 3 101 34 12 somesentenceofwords
arch 6 0 2 101 45 23 somesentenceofwords
arch 7 0 2 101 23 12 somesentenceofwords
hand 8 9 0 101 32 21 somesentenceofwords
hand 9 0 2 101 23 12 somesentenceofwords
What I want to do is if the first column is the same for x number of rows then output the sum of the 6th column for those rows and divide it by the number of rows (an average essentially).
So in the example since the first 4 rows are all anger I want to get the average of the numbers corresponding to all rows with anger in column 1 for column 6. It would add 13 + 23 + 35 + 23 / 4. It would then do the same for arch, then hand and so on.
Example output:
anger 23.5
arch 34
hand 27.5
I tried this just to see if I can do it individually where each column would equal a specific letter string but couldn't even get that to work.
$ awk '{if($1="anger"){sum+=$6} {print sum}}' filename
Is this possible?
Using awk:
awk '!($1 in s){b[++i]=$1; s[$1]=0} {c[$1]++; s[$1]+=$6}
END{for (k=1; k<=i; k++) printf "%s %.1f\n", b[k], s[b[k]]/c[b[k]]}' file
anger 23.5
arch 34.0
hand 27.5
Pretty straight forward with awk:
$ awk '{a[$1]+=$6;b[$1]++}END{for (i in a) print i,a[i]/b[i]}' file
hand 27.5
arch 34
anger 23.5
How this works?
The block {a[$1]+=$6;b[$1]++} is executed for every line that is read. We create two maps, one storing the sum, for each key, and one storing the count for each key.
The block END{for (i in a) print i,a[i]/b[i]} is executed after all lines are read. We iterate over the keys of the first map, and print the key, and the division of the sum over the count (i.e. the mean).

Unix sort using unknown delimiter (last column)

My data looks like this:
Adelaide Crows 5 2 3 0 450 455 460.67 8
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
As you can tell, there is a mixture of spaces and tabs. I need to be able to sort the last column descending. So the ouput looks like this:
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
Adelaide Crows 5 2 3 0 450 455 460.67 8
The whole data consists of all AFL teams.
Using sort how can I achieve this. Am i right in attempting to use the $ character to start from the end of line? I also need to sort the second last column after sorting the last column. Therefore any duplicate numbers in the last column will be sorted in the 2nd last column. Code so far:
sort -n -t$'\t' -k 9,9 -k 8,8 tmp
How do I take into account that the football team names will count as whitespace?
Here is the file being sorted (filename: 'tmp') sample data
You could copy the last field into the first position using awk first, then sort by the first field, the get rid of the first field using cut.
awk '{print($NF" "$0)}' sample.txt | sort -k1,1 -n -r -t' ' | cut -f2- -d' '
Port Adelaide 5 5 0 0 573 386 916.05 20
Essendon 5 5 0 0 622 352 955.88 20
Sydney Swans 5 4 1 0 533 428 681.68 16
Hawthorn 5 4 1 0 596 453 620.64 16
Richmond 5 3 2 0 499 445 579.68 12
..
..

Resources