Delete lines with particular number of columns in linux [duplicate] - bash

This question already has answers here:
Delete lines or rows in a tab-delimited file, by number of cells in that lines or rows
(2 answers)
Closed 1 year ago.
My file.fam looks like following containing around 22k lines. I want to delete rows containing less than 6 columns.
06S14031708 36125 0 0 2 2
06S14031716 38824 0 0 1 2
06S14031729 27949 0 0 2 2
06S14031742 30585 0 0 2 2
5 5 0 0 1 1
6 6 0 0 1
12 12 0 0 1 2
16 16 0 0 1 2
18_0004 21213 0 0 1 1
18_0006 35931 0 0 1 1
18_0008 31975 0 0 1 1

An awk version redirecting all lines w/ more than 5 "words" to a new file:
awk 'NF>=6' file.fam > file.fam.new
mv file.fam.new file.fam
Or a somewhat more unsightly variant using sed with inline replacement:
sed -i -r '/^\s*(\w+\s+){5}\w+\s*$/!d' file.fam

Related

Common lines from 2 files based on 2 columns per file

I have two file:
file1:
1 imm_1_898835 0 908972 0 A
1 vh_1_1108138 0 1118275 T C
1 vh_1_1110294 0 1120431 A G
1 rs9729550 0 1135242 C A
file2:
1 exm1916089 0 865545 0 0
1 exm44 0 865584 0 G
1 exm46 0 865625 0 G
1 exm47 0 865628 A G
1 exm51 0 908972 0 G
1 exmF 0 1120431 C A
I want to obtain a file that is the overlap between file 1 and 2 based on columns 1 and 4,and I would print the common values for columns 1 and 4 and also columns 2 for file1 and file2.
e.g
I want:
1 908972 imm_1_898835 exm51
1 1120431 vh_1_1110294 exmF
Could you please try following.
awk 'FNR==NR{a[$1,$4]=$2;next} (($1,$4) in a){print $1,$4,a[$1,$4],$2}' file1 file2

Add lines with 0 for missing values in a datatable

I have a dataset counting occurences of bins, for instance:
1 10
2 15
3 1
5 50
8 990
As you can see, I am missing bins in the first column. As I want to plot this data, I'm looking for a way to add those missing value, with a 0 on the second column, e.g. if I know my bins go up to 10:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
I'm looking for a unix/bash solution as it fits my pipeline and my files are rather big, but maybe R is more suited for this ?
EDIT: Thanks to karafaka sir, adding solutions which will capture very first line's digits too.
awk -v value=10 '$1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Let's say following is the Input_file:
cat Input_file
3 10
4 15
7 1
9 50
19 990
Then after running above code we will get following output.
1 0
2 0
3 10
4 15
5 0
6 0
7 1
8 0
9 50
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 990
Could you please try following.
awk -v value=10 'prev && $1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Adding a non-one liner form of solution too now.
awk -v value=10 '
prev && $1-prev>1{
while(++prev<$1){
print prev,"0"
}
}
{
prev=$1
print
}
END{
if(prev<value){
while(prev<=value){
print prev,"0"
prev++
}
}
}' Input_file
we can combine seq and awk to make the task easier:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' file <(seq 10)
You can do this as well:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$0}' f <(seq -f '%g 0' 10)
Test with your data:
kent$ cat f
1 10
2 15
3 1
5 50
8 990
kent$ awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' f <(seq 10)
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
Using Bash and join:
$ join -a 1 --nocheck-order -e 0 -o 1.1,2.2 <(seq 10) file
Output:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
another awk
$ awk -v mx=10 '{while(++k<$1) print k,0}1;
END {while(k++<mx) print k,0}' file
this will fill the first records if missing as well.
$ awk '{n[$1]=$2} END{for (i=1;i<=10;i++) print i,n[i]+0}' file
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0

Combine count files into one file and keep zero values

I have multiple count files that look like this:
File1.tab
6 10 0
49 0 53
15 0 15
0 0 0
0 0 0
0 0 0
Other file:
File2.tab
3 1 2
29 0 29
4 0 4
0 0 0
0 0 0
0 0 0
I have over 30 files and I want to combine the second column of each file into one big file.
I know this question have already been asked and I found a similar here How to combine column from multiple text files?
I used the answer from previous question for my problem:
paste *.tab | awk '{i=2;while($i); {printf("%d ",$i);i+=3}printf("\n")}'
The problem is that zero values are not printed, I get something like this:
10 1
and I want something like this:
10 1
0 0
0 0
0 0
0 0
0 0
I cheked the printf format specifiers, but none works. How can I solve this problem?
You picked a bad "answer" to build on. Try this:
paste *.tab |
awk '{for (i=2; i<=NF; i+=3) printf "%s%s", (i>2?OFS:""), $i; print ""}'

Another split file in bash - based on difference between rows of column x

Hello stackoverflow users!
Generally I would like to tune up script I am using, just to make it more insensitive to missing data.
My example data looks like this (tab delimited csv file with headers):
ColA ColB ColC
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
I use awk script found elsewhere, as follows:
awk 'BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$2 }
$2 == delim {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }'
Which gives me output I want - omit 1st line, find 2nd column and set delimiter - in this example it will be '0':
file_no00.txt
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
file_no01.txt
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
file_no02.txt
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
To make the script more robust (imagine that rows with 0's are deleted) I would need to split file according to the subtracted value of rows 'n+1' and 'n' if this value is below 0 split file, so basically if (value_row_n+1)-value_row_n < 0 then split file. Of course I would need also to maintain the file naming. Preferred way is bash with awk use. Any advices? Thanks in advance!
Cheers!
Here is awk command that you can use:
cat file
ColA ColB ColC
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
awk 'NR == 1 {
next
}
!p || $2 < p {
f = sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{
p = $2;
print $0 > f
}' file
I suggest small modifications to your current script:
awk 'BEGIN { fn=0; f=sprintf("file_no%02d.txt",fn++); print "Creating " f }
NR==1 { next }
NR==2 { delim=$2 }
$2 - delim < 0 {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f; delim = $2 }' infile
First, create the first file name just before starting the processing.
Second, in last condition save the value of current line to compare with the value of next line.
Third, instead the comparison with zero, do the substraction between previous value and current one to check if result is less than zero.
It yields:
==> file_no00.txt <==
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
==> file_no01.txt <==
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
==> file_no02.txt <==
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100

Help with duplicating rows based on a field using awk

I have the following data set with the 3rd field consists of 0's and 1's
Input
1 2 1
2 4 0
3 3 1
4 1 1
5 0 0
I wish to expand the data set to the following format
Duplicate each row based on the 2nd field and
Replace only the "new" 1's (obtain after duplication) in the 3rd field by 0
How can I do this with AWK?
Thanks
Output
1 2 1
1 2 0
2 4 0
2 4 0
2 4 0
2 4 0
3 3 1
3 3 0
3 3 0
4 1 1
awk '{print; $3=0; for (i=1; i<$2; i++) print}' inputfile
If you want to actually skip records with a zero in the second field (as your example seems to show):
awk '{if ($2>0) print; $3=0; for (i=1; i<$2; i++) print}' inputfile

Resources