awk command to merge the content of the same file - bash

I have an input file with the following content
1 1
2 1
3 289
4 1
5 2
0 Clear
1 Warning
2 Indeterminate
3 Minor
4 Major
5 Critical
I want to merge the first type of lines with the messages by the first column and obtain
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical

Just use awk:
awk '$1 in a { print $1, a[$1], $2; next } { a[$1] = $2 }' file
Output:
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical

Using join/sed, sed creates different views of the file for each part and join joins on the common field:
join <(sed '/^[0-9]* [0-9]* *$/!d' input) <(sed '/^[0-9]* [0-9]* *$/d' input)
Gives:
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical

You can do this with Awk:
awk 'BEGIN{n=0}NR>6{n=1}n==0{a[$1]=$2}n==1{print $1,a[$1],$2}' file
or another way:
awk 'NR<=5{a[$1]=$2}$2~/[a-zA-z]+/ && $1>0 && $1<=5{print $1,a[$1],$2}' file

Related

AWK assign upper value for rank assignment during tie

I'm working on rank assignment to a list of values that is sorted in a file.
A miniature example is
Input:
1
2
2
2
3
4
Instead of normal ranking when there is a tie, I need to assign the upper value. So the required output is
1 1
2 4 #Note that it is not 2, since we have three 2's the upper bound is 4
2 4
2 4
3 5
4 6
I tried something like below, but it is not consistent.
$ awk ' BEGIN{t=0} NR==FNR { a[$1]++; next } { print $1,a[$1]+t; t=a[$1] } ' rank_in.txt rank_in.txt
1 1
2 4
2 6
2 6
3 4
4 2
This answer does normal ranking, so this question is not duplicate.
Instead of doing a double pass or keeping track of memory, we just use a uniq and reconstruct everything:
uniq -c file | awk '{n=n+$1;for(i=1;i<=$1;++i) print $2,n}' -
Two passes with just awk:
$ awk 'NR==FNR{rank[$1]=NR; next} {print $1, rank[$1]}' file file
1 1
2 4
2 4
2 4
3 5
4 6
or one pass with a pipe:
$ nl file | sort -k2,2 -k1,1nr | awk '$2!=prev{rank=$1; prev=$2} {print $2, rank}'
1 1
2 4
2 4
2 4
3 5
4 6
If you don't have nl on your system you could use cat -n or awk '{print NR, $0}' to generate the line numbers.
Try this awk:
awk 'FNR==NR {++fq[$1]; next} p != $1{s+=fq[$1]} {print p=$1, s}' file file
1 1
2 4
2 4
2 4
3 5
4 6
Assumptions:
input data is already sorted
Sample data:
$ cat rank.dat
1
2
2
2
3
4
One awk idea requiring a single pass through the file:
awk '
function print_rank() {
for ( i=1 ; i<=cnt ; i++ )
print id,rank
}
$1 != id { print_rank() # if we have a new id, print last id
cnt=0 # reset counter
}
{ id=$1 # keep track of current id
rank++ # increment rank by 1 for each new row processed
cnt++ # keep track of number of times we see this id
}
END { print_rank() } # flush last id to stdout
' rank.dat
This generates:
1 1
2 4
2 4
2 4
3 5
4 6
Another awk
$ awk ' NR==FNR { a[$1]++; next } { print $1, FNR + --a[$1] } ' rank_in.txt rank_in.txt
1 1
2 4
2 4
2 4
3 5
4 6
$

awk: print first column, then some values, and then all other columns

I want to print the first column, then a couple of columns with fixed values, like this command would do:
awk '{print $1,"1","2","1"}'
and then print all columns except the first after that...
I know this command prints all but the first column:
awk '{$1=""; print $0}'
But that gets rid of the first column.
In other words, this:
3 5 2 2
3 5 2 2
3 5 2 2
3 5 2 2
Needs to become this:
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
Any ideas?
use a loop to iterate through rest of the columns like this:
awk '{print $1,"1","2","1";for(i=2;i<=NF;i++) print $i}'
As an example:
$echo "3 5 2 2" | awk 'BEGIN{ORS=""}{print $1,"1","2","1";for(i=2;i<=NF;i++) print $i}'
3 1 2 1 5 2 2
$
Edit1 :
$ echo "3 5 2 2" | awk 'BEGIN{ORS="\n";OFS="\n"}{print $1,"1","2","1 ";for(i=2;i<=NF;i++) print $i" "}'
3
1
2
1
5
2
2
$
Edit2:
$ echo "3 5 2 2" | awk '{print $1,"1","2","1";for(i=2;i<=NF;i++) print $i}'
3 1 2 1
5
2
2
$
Edit3:
$ echo "3 5 2 2
3 5 2 2
3 5 2 2
3 5 2 2" | awk '{printf("%s %s ", $1,"1 2 1");for(i=2;i<=NF;i++) printf("%s ", $i); printf "\n"}'
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
You are almost there, you just need to store the first column in a temporary variable:
{
head=$1; # Store $1 in head, used later in printf
$1=""; # Empty $1, so that $0 will not contain first column
printf "%s 1 2 1%s\n", head, $0
}
And a full script:
echo "3 5 2 2" | awk '{head=$1;$1="";printf "%s 1 2 1%s\n", head, $0}'
Another solution with awk:
awk '{sub(/.*/, "1 2 1 "$2, $2)}1' File
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
Substitute the 2nd field with "1 2 1" followed by 2nd field itself.
You can do this using sed by replacing the first space by the string you want.
sed 's/ / 1 2 1 /' file
(OR)
With awk by replacing the first field($1):
awk '{$1=$1 " 1 2 1"}1' file
(I prefer the sed solution since it has less characters).

Another split file in bash - based on difference between rows of column x

Hello stackoverflow users!
Generally I would like to tune up script I am using, just to make it more insensitive to missing data.
My example data looks like this (tab delimited csv file with headers):
ColA ColB ColC
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
I use awk script found elsewhere, as follows:
awk 'BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$2 }
$2 == delim {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }'
Which gives me output I want - omit 1st line, find 2nd column and set delimiter - in this example it will be '0':
file_no00.txt
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
file_no01.txt
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
file_no02.txt
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
To make the script more robust (imagine that rows with 0's are deleted) I would need to split file according to the subtracted value of rows 'n+1' and 'n' if this value is below 0 split file, so basically if (value_row_n+1)-value_row_n < 0 then split file. Of course I would need also to maintain the file naming. Preferred way is bash with awk use. Any advices? Thanks in advance!
Cheers!
Here is awk command that you can use:
cat file
ColA ColB ColC
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
awk 'NR == 1 {
next
}
!p || $2 < p {
f = sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{
p = $2;
print $0 > f
}' file
I suggest small modifications to your current script:
awk 'BEGIN { fn=0; f=sprintf("file_no%02d.txt",fn++); print "Creating " f }
NR==1 { next }
NR==2 { delim=$2 }
$2 - delim < 0 {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f; delim = $2 }' infile
First, create the first file name just before starting the processing.
Second, in last condition save the value of current line to compare with the value of next line.
Third, instead the comparison with zero, do the substraction between previous value and current one to check if result is less than zero.
It yields:
==> file_no00.txt <==
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
==> file_no01.txt <==
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
==> file_no02.txt <==
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100

Comparing few colums of a file with columns of another file

I have two data files 1.txt and 2.txt
1.txt contains valid lines.
For example.
1 2 1 2
1 3 1 3
In 2.txt i have an extra coloum, but if you ignore that, I have a few valid lines, and few invalid lines. There could be multiple occurrences of the same line in 2.txt
For example:
1 2 1 2 1.9
1 3 1 3 3.4
1 3 1 3 3.4
2 3 2 3 5.6
2 3 2 3 5.6
The second and third lines are the same and valid.
The fourth and fifth lines are also the same but invalid.
I want to write a shell script which compares these two files and outputs two files, valid.txt and invalid.txt which look like these...
valid.txt :
1 2 1 2 1
1 3 1 3 2
and invalid.txt :
2 3 2 3 2
The last extra column of valid.txt and invalid.txt contains the number of times the line has been repeated in 2.txt.
this awk script works for the example data:
awk 'NR==FNR{sub(/ *$/,"");a[$0]++;next}
{sub(/ [^ ]*$/,"")
if($0 in a)
v[$0]++
else
n[$0]++
}
END{
for(x in v)print x,v[x] > "valid.txt"
for(x in n) print x,n[x] >"inv.txt"
}' file1 file2
output:
kent$ head inv.txt valid.txt
==> inv.txt <==
2 3 2 3 2
==> valid.txt <==
1 3 1 3 2
1 2 1 2 1

Help with duplicating rows based on a field using awk

I have the following data set with the 3rd field consists of 0's and 1's
Input
1 2 1
2 4 0
3 3 1
4 1 1
5 0 0
I wish to expand the data set to the following format
Duplicate each row based on the 2nd field and
Replace only the "new" 1's (obtain after duplication) in the 3rd field by 0
How can I do this with AWK?
Thanks
Output
1 2 1
1 2 0
2 4 0
2 4 0
2 4 0
2 4 0
3 3 1
3 3 0
3 3 0
4 1 1
awk '{print; $3=0; for (i=1; i<$2; i++) print}' inputfile
If you want to actually skip records with a zero in the second field (as your example seems to show):
awk '{if ($2>0) print; $3=0; for (i=1; i<$2; i++) print}' inputfile

Resources