Combining loop with awk - for-loop

I need help combining an awk with a loop.
I have two files, one Bedfile.bed and a Samplelist.txt that look like this:
Bedfile.bed
HiC_scaffold_2 1 50001
HiC_scaffold_2 400001 450001
HiC_scaffold_2 800001 850001
Samplelist.txt
sampleA
sampleB
sampleC
I would like to create a new Bedfile for each sample (from the Samplelist.txt) in which I include the sample name as a new column next to each line, and I add the name in the output. Looking like this, e.g., for the first two sample
Bedfile_SampleA.bed
HiC_scaffold_2 1 50001 SampleA
HiC_scaffold_2 400001 450001 SampleA
HiC_scaffold_2 800001 850001 SampleA
Bedfile_SampleB.bed
HiC_scaffold_2 1 50001 SampleB
HiC_scaffold_2 400001 450001 SampleB
HiC_scaffold_2 800001 850001 SampleB
I have done this for one file but I have more than a hundred files, so I would like to do some sort of loop using a sample list.
awk ' {print $1"\t"$2"\t"$3"\t""SampleA"}' Bedfile.bed > Bedfile_SampleA.bed
Any suggestion?

$ awk -v OFS='\t' '
NR==FNR { samples[$0]; next }
FNR == 1 {
base = FILENAME
sub(/\..*/,"",base)
}
{
for ( sample in samples ) {
out = base "_" sample ".bed"
print $0 (NF ? OFS sample : "") > out
}
}
' Samplelist.txt Bedfile.bed
$ head Bedfile_*
==> Bedfile_sampleA.bed <==
HiC_scaffold_2 1 50001 sampleA
HiC_scaffold_2 400001 450001 sampleA
HiC_scaffold_2 800001 850001 sampleA
==> Bedfile_sampleB.bed <==
HiC_scaffold_2 1 50001 sampleB
HiC_scaffold_2 400001 450001 sampleB
HiC_scaffold_2 800001 850001 sampleB
==> Bedfile_sampleC.bed <==
HiC_scaffold_2 1 50001 sampleC
HiC_scaffold_2 400001 450001 sampleC
HiC_scaffold_2 800001 850001 sampleC
The above will work in any awk, assuming you don't have too many output files to exceed the "too many open files" limit. If you do, it'll still work with GNU awk, and there's a simple tweak to make it work with any awk.

Thus is very straightforward in awk. First you read the sample file in memory, and then you process the full bed-file
awk 'BEGIN{OFS="\t"}(FNR==NR){a[$0]; next}{for(i in a){f=FILENAME"."i; print $0,i > f}}' sample.txt bed.txt

You can do the operation and the loop all in AWK, but if you wanted to do the loop 'separately' for another reason, you could use:
while read -r sample
do
awk -v var="$sample" 'BEGIN{OFS="\t"} {print $0, var}' bedfile.bed > bedfile_"$sample".bed
done < samplelist.txt

Related

How to delete every m-th and n-th line of a sequence of every K number of lines in a file

Is there any way to delete every m-th and n-th line of a sequence of every K lines from a file using sed or awk?
Example:
cat input.txt
Aline1
Aline2
Aline3
Aline4
Aline5
Aline6
Aline7
Aline8
Aline9
Bline1
Bline2
Bline3
Bline4
Bline5
Bline6
Bline7
Bline8
Bline9
...
I want to remove every 3rd (line3) & 7th (line7) lines of a sequence of every 9 lines of the file.
So the output will look like
Aline1
Aline2
Aline4
Aline5
Aline6
Aline8
Aline9
Bline1
Bline2
Bline4
Bline5
Bline6
Bline8
Bline9
...
I tried to combine two conditions at the same time but not successful:
awk '(NR)%3 && (NR)%7' input.txt
Edit:
Here 3rd and 7th lines refer to the lines in each of these sequences (Aline*,Bline* ...) which consist of 9 lines each.
So the first 9lines of the input file define sequence-A in which I want to remove the 3rd and 7th lines.
The next 9 lines of the input file define sequence B and there I want to do the same. So this would correspond to the 12th and 16th lines of the original file.
PS. I do not want to find by characters*line3 & *line7 and delete them since in general, these lines might contain anything.
Using GNU sed
$ sed '3~9d;7~9d' input_file
Aline1
Aline2
Aline4
Aline5
Aline6
Aline8
Aline9
Bline1
Bline2
Bline4
Bline5
Bline6
Bline8
Bline9
Using any awk:
$ awk '(NR%9) !~ /^[37]$/' file
Aline1
Aline2
Aline4
Aline5
Aline6
Aline8
Aline9
Bline1
Bline2
Bline4
Bline5
Bline6
Bline8
Bline9
A few awk ideas:
awk -v line1=3 -v line2=7 -v inc=9 ' # line1/line2 are line numbers to ignore; inc(rement) is added to line1/line2 for next set of lines to ignore
FNR==line1 { line1+=inc; next } # skip line1, add "inc" for next line number
FNR==line2 { line2+=inc; next } # skip line2, add "inc" for next line number
1' input.txt # print current line
# or
awk -v line1=3 -v line2=7 -v blk=9 '
(FNR % blk == line1) || (FNR % blk == line2) {next}
1' input.txt
# or
awk -v line1=3 -v line2=7 -v blk=9 '
(FNR % blk != line1) && (FNR % blk != line2)
' input.txt
# or
awk -v line1=3 -v line2=7 -v blk=9 '
BEGIN { lines[line1]; lines[line2] }
! ((FNR % blk) in lines)
' input.txt
All generate:
Aline1
Aline2
Aline4
Aline5
Aline6
Aline8
Aline9
Bline1
Bline2
Bline4
Bline5
Bline6
Bline8
Bline9
use this combo modulo ( ( NR % 9 ) % 4 ) < 3 ::
mawk 'BEGIN { for(__=length(___="CBA"); __; __--) {
for(_^=___; _!~!_; _++) {
print substr(___,__,!!_)_ } } }' |
{m,g}awk '($3 = ($2 = NR % 9 ) % 4 ) < 3'
A1 1 1
A2 2 2
A4 4 0
A5 5 1
A6 6 2
A8 8 0
A9 0 0
B1 1 1
B2 2 2
B4 4 0
B5 5 1
B6 6 2
B8 8 0
B9 0 0
C1 1 1
C2 2 2
C4 4 0
C5 5 1
C6 6 2
C8 8 0
C9 0 0
or better yet, simplify that to just either one of these :
( ( NR % 9 ) + 1 ) % 4
( ( NR + 1 ) % 9 ) % 4
most compact forms would be ::
awk '(NR%9+1)%4'
awk '(NR+1)%9%4' # valid but not recommended
This might work for you (GNU sed):
sed -E 'x;s/^/x/;/^x{9}$/{s///;x;b};/^x{3}$|^x{7}$/{x;d};x' file
For each line in the file, swap to the hold space and insert an x at the start of the line.
If the line contains 9 x's, reset the hold space, swap back to the pattern space and break out of any further processing i.e. print that line.
If the line contains 3 or 7 x's, swap back to the pattern space and delete that line.
For all other line swap back to the pattern space and print the line as normal.
Of course this can more easily be done using GNU sed specific commands:
sed '3~9d,7~9d' file
First, you need to extend your example with Aline10 line if you want to see the results you displayed otherwise Bline1 will be the 10th and not the 11th line as it is intuitive. So I created a file with Aline1 up to Aline20.
With sed you can easily remove the nth line with this syntax
$ cat test.txt | sed '1~3d'
Aline2
Aline3
Aline5
Aline6
Aline8
...
But you cannot pipeline two seds as the original line numbers will be modified so this is wrong:
$ cat test.txt | sed '1~3d' | sed '1~7d'
Aline3
Aline5
Aline6
Aline8
Aline9
Aline11
Aline14
Aline15
Aline17
Aline18
However this is a piece of cake with awk
$ cat test.txt | awk '{ if (NR%3!=0&&NR%7!=0) printf "%s\n",$0 }'
Aline1
Aline2
Aline4
Aline5
Aline8
Aline10
Aline11
Aline13
Aline16
Aline17
Aline19

all pairs of consecutive lines sharing a field, using awk

I would like to process a multi-line, multi-field input file so that I get a file with all pairs of consecutive lines ONLY IF they have the same value as field #1.
This is, for each line, the output would contain the line itself + the next line, and would omit combinations of lines with different values at field #1.
It's better explained with an example.
Given this input:
1 this
1 that
1 nye
2 more
2 sit
I want to produce something like:
1 this 1 that
1 that 1 nye
2 more 2 sit
So far I've got this:
awk 'NR % 2 == 1 { i=$0 ; next } { print i,$0 } END { if ( NR % 2 == 1 ) { print i } }' input.txt
My output:
1 this 1 that
1 nye 2 more
2 sit
As you can see, my code is blind to field #1 value, and also (and more importantly) it omits "intermediate" results like 1 that 1 nye (once it's done with a line, it jumps to the next pair of lines).
Any ideas? My preferred language is awk/gawk, but if it can be done using unix bash it's ok as well.
Thanks in advance!
You can use this awk:
awk 'NR>1 && ($1 in a){print a[$1], $0} {a[$1]=$0}' file
1 this 1 that
1 that 1 nye
2 more 2 sit
You can do it with simple commands. Assuming your input file is "test.txt" with content:
1 this
1 that
1 nye
2 more
2 sit
following commands gives the requested output:
sort -n test.txt > tmp1
(echo; cat tmp1) | paste tmp1 - | egrep '^([0-9])+ *[^ ]* *\1'
Just for fun
paste -d" " filename <(sed 1d filename) | awk '$1==$3'

How do I grep non-zero words in a file?

I want to grep all strings that are not 0 in a file. Is there a way to do that?
The file looks something like the following:
0
0
0
0.12
0
0
and I would like the output to be:
0.12
You can with awk do:
awk '$1' file
This will skip all lines with 0 in it.
If you like print all words without 0 then try this:
awk '{for (i=1;i<=NF;i++) if ($i!~/0/) printf "%s ",$i;print ""}' file

Find repeating sections and output each section to an individual file?

Take a text file with lines like:
/user$ cat ORIGFILE
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt
If there are duplicate session number (e.g 200289), it should output each repeating section to a file and display like this:
/user$ cat se832p41iEC.200289
se832p41iEC.200289_EDI832I140401232506.txt
se832p41iEC.200289_EDI832I140401232507.txt
se832p41iEC.200289_EDI832I140401232508.txt
/user$ cat xe832p41iEC.201687
xe832p41iEC.201687_EDI832I140401232511.txt
xe832p41iEC.201687_EDI832I140401232512.txt
xe832p41iEC.201687_EDI832I140401232513.txt
/user$ cat NEWFILE
pt832p41iEC.213631_EDI832I140401232501.txt
pt832p41iEC.213632_EDI832I140401232502.txt
Thank you in advance.
Update: Just figured it out after #Jaypal's hint (thanks man):
First - sort ORIGFILE| uniq -u > NEWFILE
Second - sort ORIGFILE | uniq -D > AWKFILE
Last - awk -F_ '{print $0 > $1}' AWKFILE
Now that you have added your attempt, here is a way of doing it with awk:
$ ls
file
$ cat file
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt
$ awk -F_ '{
a[$1] = (a[$1] ? a[$1] RS $0 : $0)
b[$1]++
}
END {
for(x in a) print a[x] > (b[x]>1 ? x : "NEWFILE")
}' file
$ ls
NEWFILE file se832p41iEC.200289 xe832p41iEC.201687
$ head *
==> NEWFILE <==
pt832p41iEC.213631_EDI832I140401232501.txt
pt832p41iEC.213632_EDI832I140401232502.txt
==> file <==
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt
==> se832p41iEC.200289 <==
se832p41iEC.200289_EDI832I140401232506.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
==> xe832p41iEC.201687 <==
xe832p41iEC.201687_EDI832I140401232512.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt

awk: filter a file with another file

I'm trying to filter a file with another file.
I have a file d3_tmp and m2p_tmp; They are as follows:
$ cat d3_tmp
0x000001 0x4d 2
0x1107ce 0x4e 2
0x111deb 0x6b 2
$ cat m2p_tmp
mfn=0x000001 ==> pfn=0xffffffffffffffff
mfn=0x000002 ==> pfn=0xffffffffffffffff
mfn=0x000003 ==> pfn=0xffffffffffffffff
I want to print out the lines in m2p_tmp whose second column is not equal to the first column of d3_tmp. (The files are split with \t and =)
So the desired result is:
mfn=0x000002 ==> pfn=0xffffffffffffffff
mfn=0x000003 ==> pfn=0xffffffffffffffff
However, after I use the following awk command:
awk -F '[\t=]' ' FNR==NR { print $1; a[$1]=1; next } !($2 in a){printf "%s \t 0\n", $2}' d3_tmp m2p_tmp
The result is:
0x000001
0x1107ce
0x111deb
0x000001 0
0x000002 0
0x000003 0
I'm not sure why "$2 in a" does not work.
Could anyone help?
Thank you very much!
Using awk
awk 'NR==FNR{for (i=1;i<=NF;i++) a[$i];next} !($2 in a)' d3_tmp FS="[ =]" m2p_tmp
a[$i] is used to collect all items in file d3_tmp into array a, NR==FNR used to control the collection is only focus on d3_tmp.
in second part, set the FS to space or "=", and compare if $2 in file m2p_tmp is in this array a or not, if in, print it.
The question has been edited, so I have to change the code as well.
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp
awk -v FS="[\t= ]" ' FNR==NR { a[$1]=$1; next } !($2 in a){print $0}' d3_tmp m2p_tmp
mfn=0x000002 ==> pfn=0xffffffffffffffff
mfn=0x000003 ==> pfn=0xffffffffffffffff

Resources