I have a dataset counting occurences of bins, for instance:
1 10
2 15
3 1
5 50
8 990
As you can see, I am missing bins in the first column. As I want to plot this data, I'm looking for a way to add those missing value, with a 0 on the second column, e.g. if I know my bins go up to 10:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
I'm looking for a unix/bash solution as it fits my pipeline and my files are rather big, but maybe R is more suited for this ?
EDIT: Thanks to karafaka sir, adding solutions which will capture very first line's digits too.
awk -v value=10 '$1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Let's say following is the Input_file:
cat Input_file
3 10
4 15
7 1
9 50
19 990
Then after running above code we will get following output.
1 0
2 0
3 10
4 15
5 0
6 0
7 1
8 0
9 50
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 990
Could you please try following.
awk -v value=10 'prev && $1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Adding a non-one liner form of solution too now.
awk -v value=10 '
prev && $1-prev>1{
while(++prev<$1){
print prev,"0"
}
}
{
prev=$1
print
}
END{
if(prev<value){
while(prev<=value){
print prev,"0"
prev++
}
}
}' Input_file
we can combine seq and awk to make the task easier:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' file <(seq 10)
You can do this as well:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$0}' f <(seq -f '%g 0' 10)
Test with your data:
kent$ cat f
1 10
2 15
3 1
5 50
8 990
kent$ awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' f <(seq 10)
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
Using Bash and join:
$ join -a 1 --nocheck-order -e 0 -o 1.1,2.2 <(seq 10) file
Output:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
another awk
$ awk -v mx=10 '{while(++k<$1) print k,0}1;
END {while(k++<mx) print k,0}' file
this will fill the first records if missing as well.
$ awk '{n[$1]=$2} END{for (i=1;i<=10;i++) print i,n[i]+0}' file
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
Using the columns 4 and 2, will create a report like the output file showed below. My code works fine but I believe it can be done more shorted :).
I have a doubt in the part of the split.
CNTLM = split ("20,30,40,60", LMT
It works but will be better to have exactly the values "10,20,30,40" as values in column 4.
4052538693,2910,04-May-2018-22,10
4052538705,2910,04-May-2018-22,10
4052538717,2910,04-May-2018-22,10
4052538729,2911,04-May-2018-22,20
4052538741,2911,04-May-2018-22,20
4052538753,2912,04-May-2018-22,20
4052538765,2912,04-May-2018-22,20
4052538777,2914,04-May-2018-22,10
4052538789,2914,04-May-2018-22,10
4052538801,2914,04-May-2018-22,30
4052539029,2914,04-May-2018-22,20
4052539041,2914,04-May-2018-22,20
4052539509,2915,04-May-2018-22,30
4052539521,2915,04-May-2018-22,30
4052539665,2915,04-May-2018-22,30
4052539677,2915,04-May-2018-22,10
4052539689,2915,04-May-2018-22,10
4052539701,2916,04-May-2018-22,40
4052539713,2916,04-May-2018-22,40
4052539725,2916,04-May-2018-22,40
4052539737,2916,04-May-2018-22,40
4052539749,2916,04-May-2018-22,40
4052539761,2917,04-May-2018-22,10
4052539773,2917,04-May-2018-22,10
here is the code I use to get the output desired.
printf " Code 10 20 30 40 Total\n" > header
dd=`cat header | wc -L`
awk -F"," '
BEGIN {CNTLM = split ("20,30,40,60", LMT)
cmdsort = "sort -nr"
DASHES = sprintf ("%0*d", '$dd', _)
gsub (/0/, "-", DASHES)
}
{for (IX=1; IX<=CNTLM; IX++) if ($4 <= LMT[IX]) break
CNT[$2,IX]++
COLTOT[IX]++
LNC[$2]++
TOT++
}
END {
print DASHES
for (l in LNC)
{printf "%5d", l | cmdsort
for (IX=1; IX<=CNTLM; IX++) {printf "%9d", CNT[l,IX]+0 | cmdsort
}
printf " = %6d" RS, LNC[l] | cmdsort
}
close (cmdsort)
print DASHES
printf "Total"
for (IX=1; IX<=CNTLM; IX++) printf "%9d", COLTOT[IX]+0
printf " = %6d" RS, TOT
print DASHES
printf "PCT "
for (IX=1; IX<=CNTLM; IX++) printf "%9.1f", COLTOT[IX]/TOT*100
printf RS
print DASHES
}
' file
Output file I got
Code 10 20 30 40 Total
----------------------------------------------------
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
----------------------------------------------------
Total 9 6 4 5 = 24
----------------------------------------------------
PCT 37.5 25.0 16.7 20.8
----------------------------------------------------
Appreciate if code can be improved.
without the header and cosmetics...
$ awk -F, '{a[$2,$4]++; k1[$2]; k2[$4]}
END{for(r in k1)
{printf "%5s", r;
for(c in k2) {k1[r]+=a[r,c]; k2[c]+=a[r,c]; printf "%10d", OFS a[r,c]+0}
printf " =%7d\n", k1[r]};
printf "%5s", "Total";
for(c in k2) {sum+=k2[c]; printf "%10d", k2[c]}
printf " =%7d", sum}' file | sort -nr
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
Total 9 6 4 5 = 24
I have multiple count files that look like this:
File1.tab
6 10 0
49 0 53
15 0 15
0 0 0
0 0 0
0 0 0
Other file:
File2.tab
3 1 2
29 0 29
4 0 4
0 0 0
0 0 0
0 0 0
I have over 30 files and I want to combine the second column of each file into one big file.
I know this question have already been asked and I found a similar here How to combine column from multiple text files?
I used the answer from previous question for my problem:
paste *.tab | awk '{i=2;while($i); {printf("%d ",$i);i+=3}printf("\n")}'
The problem is that zero values are not printed, I get something like this:
10 1
and I want something like this:
10 1
0 0
0 0
0 0
0 0
0 0
I cheked the printf format specifiers, but none works. How can I solve this problem?
You picked a bad "answer" to build on. Try this:
paste *.tab |
awk '{for (i=2; i<=NF; i+=3) printf "%s%s", (i>2?OFS:""), $i; print ""}'
I have a file with multiple columns (greater than 1000). Each column has numbers 0, 1 or some other. The tab delimited file looks like :
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
I want to calculate the occurrence of each unique digit for each column in the file. How do I do that using AWK or shell ?
P.S To calculate the occurrence of each unique digit in first column, i used AWK code :
awk '{h[$1]++}; END {for (k in h) print k, h[k]}' file > output-file
It gives the results as :
0 2
1 3
which means 0 occurs twice in column 1 and 1 occurs thrice in column 1.
I want to do the same for a file having over 1000 columns.
You just need to make the keys for associative array h contain both column number, i, and column value, $i:
$ awk '{for (i=1;i<=NF;i++) h[i" "$i]++}; END {for (k in h) print k, h[k]}' file | sort -n
1 0 2
1 1 3
2 0 3
2 1 1
2 2 1
3 0 5
The last line above indicates that column 3 has the value 0 occurring 5 times.
In more detail:
for (i=1;i<=NF;i++) h[i" "$i]++
This loops over all columns from the first, i-=1, to the last, i=NF. For each column, it updates the counter h for that column and its value.
END {for (k in h) print k, h[k]}
This prints a table of the output.
sort -n
Because for (k in h) does not produce keys in any particular order, we put the output through sort.
With awk 4.0 2D arrays
sample input matrix of n=3 columns containing integer values
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
4 0 0
7 -1 -2
output is vector of data values in column 0 that occur in input followed by matrix of n=3 columns with count of each data value in corresponding column of input matrix
-1 0 1 0
-2 0 0 1
0 2 4 6
1 3 1 0
2 0 1 0
4 1 0 0
7 1 0 0
code
awk '
NR==1 {ncols=NF}
{for(i=1; i <=NF; ++i) ++c[$i][i-1]}
END{
for(i in c) {
printf("%d ", i)
for(j=0; j < ncols; ++j) {
printf("%d ", j in c[i]?c[i][j]: 0)
}
printf("\n")
}
}
'
Hello stackoverflow users!
Generally I would like to tune up script I am using, just to make it more insensitive to missing data.
My example data looks like this (tab delimited csv file with headers):
ColA ColB ColC
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
I use awk script found elsewhere, as follows:
awk 'BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$2 }
$2 == delim {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }'
Which gives me output I want - omit 1st line, find 2nd column and set delimiter - in this example it will be '0':
file_no00.txt
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
file_no01.txt
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
file_no02.txt
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
To make the script more robust (imagine that rows with 0's are deleted) I would need to split file according to the subtracted value of rows 'n+1' and 'n' if this value is below 0 split file, so basically if (value_row_n+1)-value_row_n < 0 then split file. Of course I would need also to maintain the file naming. Preferred way is bash with awk use. Any advices? Thanks in advance!
Cheers!
Here is awk command that you can use:
cat file
ColA ColB ColC
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
awk 'NR == 1 {
next
}
!p || $2 < p {
f = sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{
p = $2;
print $0 > f
}' file
I suggest small modifications to your current script:
awk 'BEGIN { fn=0; f=sprintf("file_no%02d.txt",fn++); print "Creating " f }
NR==1 { next }
NR==2 { delim=$2 }
$2 - delim < 0 {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f; delim = $2 }' infile
First, create the first file name just before starting the processing.
Second, in last condition save the value of current line to compare with the value of next line.
Third, instead the comparison with zero, do the substraction between previous value and current one to check if result is less than zero.
It yields:
==> file_no00.txt <==
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
==> file_no01.txt <==
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
==> file_no02.txt <==
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100