Add lines with 0 for missing values in a datatable - bash

I have a dataset counting occurences of bins, for instance:
1 10
2 15
3 1
5 50
8 990
As you can see, I am missing bins in the first column. As I want to plot this data, I'm looking for a way to add those missing value, with a 0 on the second column, e.g. if I know my bins go up to 10:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
I'm looking for a unix/bash solution as it fits my pipeline and my files are rather big, but maybe R is more suited for this ?

EDIT: Thanks to karafaka sir, adding solutions which will capture very first line's digits too.
awk -v value=10 '$1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Let's say following is the Input_file:
cat Input_file
3 10
4 15
7 1
9 50
19 990
Then after running above code we will get following output.
1 0
2 0
3 10
4 15
5 0
6 0
7 1
8 0
9 50
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 990
Could you please try following.
awk -v value=10 'prev && $1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Adding a non-one liner form of solution too now.
awk -v value=10 '
prev && $1-prev>1{
while(++prev<$1){
print prev,"0"
}
}
{
prev=$1
print
}
END{
if(prev<value){
while(prev<=value){
print prev,"0"
prev++
}
}
}' Input_file

we can combine seq and awk to make the task easier:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' file <(seq 10)
You can do this as well:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$0}' f <(seq -f '%g 0' 10)
Test with your data:
kent$ cat f
1 10
2 15
3 1
5 50
8 990
kent$ awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' f <(seq 10)
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0

Using Bash and join:
$ join -a 1 --nocheck-order -e 0 -o 1.1,2.2 <(seq 10) file
Output:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0

another awk
$ awk -v mx=10 '{while(++k<$1) print k,0}1;
END {while(k++<mx) print k,0}' file
this will fill the first records if missing as well.

$ awk '{n[$1]=$2} END{for (i=1;i<=10;i++) print i,n[i]+0}' file
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0

Related

Delete lines with particular number of columns in linux [duplicate]

This question already has answers here:
Delete lines or rows in a tab-delimited file, by number of cells in that lines or rows
(2 answers)
Closed 1 year ago.
My file.fam looks like following containing around 22k lines. I want to delete rows containing less than 6 columns.
06S14031708 36125 0 0 2 2
06S14031716 38824 0 0 1 2
06S14031729 27949 0 0 2 2
06S14031742 30585 0 0 2 2
5 5 0 0 1 1
6 6 0 0 1
12 12 0 0 1 2
16 16 0 0 1 2
18_0004 21213 0 0 1 1
18_0006 35931 0 0 1 1
18_0008 31975 0 0 1 1
An awk version redirecting all lines w/ more than 5 "words" to a new file:
awk 'NF>=6' file.fam > file.fam.new
mv file.fam.new file.fam
Or a somewhat more unsightly variant using sed with inline replacement:
sed -i -r '/^\s*(\w+\s+){5}\w+\s*$/!d' file.fam

how to compare two column from same file?

I have long data file, file.txt
1 3
3 2
2 3
5 5
8 9
so out file should be, out.txt
1 3
1 2
1 5
1 9
3 3
3 2
3 5
Could you please try following.
awk '
FNR==NR{
a[++count]=$2
next
}
{
for(i=1;i<=count;i++){
print $1,a[i]
}
}
' Input_file Input_file

Find header value for first occurance of "1" instance in column

I have a matrix example:
1 3 5 8 10 12
50 1 1 1 1 1 1
100 0 0 1 1 1 1
150 0 0 1 1 1 1
200 0 0 0 1 1 1
250 0 0 0 0 1 1
300 0 0 0 0 1 1
350 0 0 0 0 0 1
For each row name (50, 100, 150, 200, etc.) I want to know what is the "header" value when the instance "1" first occurs. Based on the example the answer is:
50 1
100 5
150 5
200 8
250 10
300 10
350 12
I am not sure how to play with IFs and WHENs to get my answer from this format. R, Excel, bash, awk, all welcome as solutions.
You can do this using awk as following :
$ awk 'FNR==1{for(i=1; i<=NF; i++){a[i]=$i}; next} {for(i=2; i<=NF; i++){if($i=="1"){print $1, a[i-1]; break}}} ' file
50 1
100 5
150 5
200 8
250 10
300 10
350 12
Explanation :
For header i.e FNR==1 we are populating all values in the array a;
For all next lines we are checking which field equates to 1, if found print the col1 value i.e $1 and the corresponding value in the array a and break the loop.
Awk solution:
awk 'NR==1{ for(i=1;i<=NF;i++) h[i]=$i; next }
{
for(i=2;i<=NF;i++) { if($i==1) { n=h[i-1]; break } }
print $1,(n)?n:"None"; n=""
}' file

awk command to merge the content of the same file

I have an input file with the following content
1 1
2 1
3 289
4 1
5 2
0 Clear
1 Warning
2 Indeterminate
3 Minor
4 Major
5 Critical
I want to merge the first type of lines with the messages by the first column and obtain
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical
Just use awk:
awk '$1 in a { print $1, a[$1], $2; next } { a[$1] = $2 }' file
Output:
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical
Using join/sed, sed creates different views of the file for each part and join joins on the common field:
join <(sed '/^[0-9]* [0-9]* *$/!d' input) <(sed '/^[0-9]* [0-9]* *$/d' input)
Gives:
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical
You can do this with Awk:
awk 'BEGIN{n=0}NR>6{n=1}n==0{a[$1]=$2}n==1{print $1,a[$1],$2}' file
or another way:
awk 'NR<=5{a[$1]=$2}$2~/[a-zA-z]+/ && $1>0 && $1<=5{print $1,a[$1],$2}' file

Another split file in bash - based on difference between rows of column x

Hello stackoverflow users!
Generally I would like to tune up script I am using, just to make it more insensitive to missing data.
My example data looks like this (tab delimited csv file with headers):
ColA ColB ColC
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
I use awk script found elsewhere, as follows:
awk 'BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$2 }
$2 == delim {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }'
Which gives me output I want - omit 1st line, find 2nd column and set delimiter - in this example it will be '0':
file_no00.txt
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
file_no01.txt
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
file_no02.txt
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
To make the script more robust (imagine that rows with 0's are deleted) I would need to split file according to the subtracted value of rows 'n+1' and 'n' if this value is below 0 split file, so basically if (value_row_n+1)-value_row_n < 0 then split file. Of course I would need also to maintain the file naming. Preferred way is bash with awk use. Any advices? Thanks in advance!
Cheers!
Here is awk command that you can use:
cat file
ColA ColB ColC
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
awk 'NR == 1 {
next
}
!p || $2 < p {
f = sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{
p = $2;
print $0 > f
}' file
I suggest small modifications to your current script:
awk 'BEGIN { fn=0; f=sprintf("file_no%02d.txt",fn++); print "Creating " f }
NR==1 { next }
NR==2 { delim=$2 }
$2 - delim < 0 {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f; delim = $2 }' infile
First, create the first file name just before starting the processing.
Second, in last condition save the value of current line to compare with the value of next line.
Third, instead the comparison with zero, do the substraction between previous value and current one to check if result is less than zero.
It yields:
==> file_no00.txt <==
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
==> file_no01.txt <==
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
==> file_no02.txt <==
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100

Resources