Adding the last number in each file to the numbers in the following file - bash

I have some directories, each of these contains a file with list of integers 1-N which are not necessarily consecutive and they may be different lengths. What I want to achieve is a single file with a list of all those integers as though they had been generated in one list.
What I am trying to do is to add the final value N from file 1 to all the values in file 2, then take the new final value of file 2 and add it to all the values in file 3 etc.
I have tried this by setting a counter and looping over the files, resetting the counter when I get to the end of the file. The problem is the p=0 will continue to reset which is kind of obvious in the code but I am not sure how else to do it.
What I tried:
p=0
for i in dirx/dir_*; do
(cd "$i" || exit;
awk -v p=$p 'NR>1{print last+p} {last=$0} END{$0=last; p=last; print}' file >> /someplace/bigfile)
done
Which is similar to the answer suggested in this question Replacing value in column with another value in txt file using awk
Now I'm wondering whether I need an if else, if it's the first dir then p=0 if not then p=last value from the first file though I'm not sure on that or how I'd get it to take the last value. I used awk because that's what I understand a small amount of and would usually use.

With GNU awk
gawk '{print $1 + last} ENDFILE {last = last + $1}' file ...
Demo:
$ cat a
1
2
4
6
8
$ cat b
2
3
5
7
$ cat c
1
2
3
$ gawk '{print $1 + last} ENDFILE {last = last + $1}' a b c
1
2
4
6
8
10
11
13
15
16
17
18

Related

Modify values of one column based on values of another column on a line-by-line basis

I'm looking to use bash/awk/sed in order to modify a document.
The document contains multiple columns. Column 5 currently has the value "A" at every row. Column six is composed of increasing numbers. I'm attempting a script that goes through the document line by line, checks the value of Column 6, if the value is greater than a certain integer (specifically 275) the value of Column 5 in that same line is changed to "B".
while IFS="" read -r line ; do
awk 'BEGIN {FS = " "}'
Num=$(awk '{print $6}' original.txt)
if [ $Num > 275 ] ; then
awk '{ gsub("A","B",$5) }'
fi
done < original.txt >> edited.txt
For the above, I've tried setting the residueNum variable both inside and outside of the while loop.
I've also tried using a for loop and cat:
awk 'BEGIN {FS = " "}' original.txt
Num=$(awk '{print $6}' heterodimer_P49913/unrelaxed_model_1.pdb)
integer=275
for data in $Num ; do
if [ $data > $integer ] ; then
##Change value in other column to "B" for all lines containing column 6 values greater than "integer"
fi
done
Thanks in advance.
GNU AWK does not need external while loop (there is implicit loop), if you need further explanation read awk info page. Let file.txt content be
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 A 300
and task to be
checks the value of Column 6, if the value is greater than a certain
integer (specifically 275) the value of Column 5 in that same line is
changed to "B".
then it might be done using GNU AWK following way
awk '$6>275{$5="B"}{print}' file.txt
which gives output
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 B 300
Explanation: action set value of 5th field ($5) to B is applied conditionally to rows where value of 6th field is greater than 275. Action to print is applied unconditionally to all lines. Observe that change if applied is done before printing.
(tested in GNU Awk 5.0.1)

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

Get lengths of zeroes (interrupted by ones)

I have a long column of ones and zeroes:
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
....
I can easily get the average number of zeroes between ones (just total/ones):
ones=$(grep -c 1 file.txt)
lines=$(wc -l < file.txt)
echo "$lines / $ones" | bc -l
But how can I get the length of strings of zeroes between the ones? In the short example above it would be:
3
5
5
2
I'd include uniq for a more easily read approach:
uniq -c file.txt | awk '/ 0$/ {print $1}'
Edit: fixed for the case where the last line is a 0
Easy in awk:
awk '/1/{print NR-prev-1; prev=NR;}END{if (NR>prev)print NR-prev;}'
Not so difficult in bash, either:
i=0
for x in $(<file.txt); do
if ((x)); then echo $i; i=0; else ((++i)); fi
done
((i)) && echo $i
Using awk, I would use the fact that a field with the value 0 evaluates as False:
awk '!$1{s++; next} {if (s) print s; s=0} END {if (s) print s}' file
This returns:
3
5
5
2
Also, note the END block to print any "remaining" zeroes appearing after the last 1.
Explanation
!$1{s++; next} if the field is not True, that is, if the field is 0, increment the counter. Then, skip to the next line.
{if (s) print s; s=0} otherwise, print the value of the counter and reset it, but just if it contains some value (to avoid printing 0 if the file starts with a 1).
END {if (s) print s} print the remaining value of the counter after processing the file, but just if it wasn't printed before.
If your file.txt is just a column of ones and zeros, you can use awk and change the record separator to "1\n". This makes each "record" a sequence of "0\n", and the count of 0's in the record is the length of the record divided by 2. Counts will be correct for leading and trailing ones and zeros.
awk 'BEGIN {RS="1\n"} { print length/2 }' file.txt
This seems to be pretty popular question today. Joining the party late, here is another short gnu-awk command to do the job:
awk -F '\n' -v RS='(1\n)+' 'NF{print NF-1}' file
3
5
5
2
How it works:
-F '\n' # set input field separator as \n (newline)
-v RS='(1\n)+' # set input record separator as multipled of 1 followed by newline
NF # execute the block if minimum one field is found
print NF-1 # print num of field -1 to get count of 0
Pure bash:
sum=0
while read n ; do
if ((n)) ; then
echo $sum
sum=0
else
((++sum))
fi
done < file.txt
((sum)) && echo $sum # Don't forget to output the last number if the file ended in 0.
Another way:
perl -lnE 'if(m/1/){say $.-1;$.=0}' < file
"reset" the line counter when 1.
prints
3
5
5
2
You can use awk:
awk '$1=="0"{s++} $1=="1"{if(s)print s;s=0} END{if(s)print(s)}'
Explanation:
The special variable $1 contains the value of the first field (column) of a line of text. Unless you specify the field delimiter using the -F command line option it defaults to a widespace - meaning $1 will contain 0 or 1 in your example.
If the value of $1 equals 0 a variable called s will get incremented but if $1 is equal to 1 the current value of s gets printed (if greater than zero) and re-initialized to 0. (Note that awk initializes s with 0 before the first increment operation)
The END block gets executed after the last line of input has been processed. If the file ends with 0(s) the number of 0s between the file's end and the last 1 will get printed. (Without the END block they wouldn't printed)
Output:
3
5
5
2
if you can use perl:
perl -lne 'BEGIN{$counter=0;} if ($_ == 1){ print $counter; $counter=0; next} $counter++' file
3
5
5
2
It actually looks better with awk same logic:
awk '$1{print c; c=0} !$1{c++}' file
3
5
5
2
My attempt. Not so pretty but.. :3
grep -n 1 test.txt | gawk '{y=$1-x; print y-1; x=$1}' FS=":"
Out:
3
5
5
2
A funny one, in pure Bash:
while read -d 1 -a u || ((${#u[#]})); do
echo "${#u[#]}"
done < file
This tells read to use 1 as a delimiter, i.e., to stop reading as soon as a 1 is encountered; read stores the 0's in the fields of the array u. Then we only need to count the number of fields in u with ${#u[#]}. The || ((${#u[#]})) is here just in case your file doesn't end with a 1.
More strange (and not fully correct) way:
perl -0x31 -laE 'say #F+0' <file
prints
3
5
5
2
0
It
reads the file with the record separator is set to character 1 the -0x31
with autosplit -a (splits the record into array #F)
and prints the number of elements in #F e.g. say #F+0 or could use say scalar #F
Unfortunately, after the final 1 (as record separator) it prints an empty record - therefore prints the last 0.
It is incorrect solution, showing it only as alternative curiosity.
Expanding erickson's excellent answer, you can say:
$ uniq -c file | awk '!$2 {print $1}'
3
5
5
2
From man uniq we see that the purpose of uniq is to:
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
So uniq groups the numbers. Using the -c option we get a prefix with the number of occurrences:
$ uniq -c file
3 0
1 1
5 0
1 1
5 0
1 1
2 0
1 1
Then it is a matter of printing those the counters before the 0. For this we can use awk like: awk '!$2 {print $1}'. That is: print the second field if the field is 0.
The simplest solution would be to use sed together with awk, like this:
sed -n '$bp;/0/{:r;N;/0$/{h;br}};/1/{x;bp};:p;/.\+/{s/\n//g;p}' input.txt \
| awk '{print length}'
Explanation:
The sed command separates the 0s and creates output like this:
000
00000
00000
00
Piped into awk '{print length}' you can get the count of 0 for each interval:
Output:
3
5
5
2

Apply an gawk script to multiple files in a folder

I would like to use the following awk line to remove every even line (and keep the odd lines) in a text file.
awk 'NR%2==1' filename.txt > output
The problem is that I struggle to either loop properly in awk or build a shell script to apply this to all *.txt fies in a folder. I tried to use this one-liner
gawk 'FNR==1{if(o)close(o);o=FILENAME;
sub(/\.txt/,"_oddlines.txt",o)}{NR%2==1; print>o}'
but that didn't remove the even lines. And I am even less familiar with shell scripting. I use gawk under win7 or cygwin with bash. Many thanks for any kind of idea.
Your existing gawk one-liner is really close. Here it is formatted as a more readable script:
FNR == 1 {
if (o)
close(o)
o = FILENAME
sub(/\.txt/, "_oddlines.txt", o)
}
{
NR % 2 == 1
print > o
}
This should make the error obvious1. So now we remove that error:
FNR == 1 {
if (o)
close(o)
o = FILENAME
sub(/\.txt/, "_oddlines.txt", o)
}
NR % 2 == 1 {
print > o
}
$ awk -f foo.awk *.txt
and it works (and of course you can re-one-line-ize this).
(Normally I would do this with a for like the other answers, but I wanted to show you how close you were!)
1Per comment, maybe not quite so obvious?
Awk's basic language construct is the "pattern-action" statement. An awk program is just a list of such statements. The "pattern" is so named because originally they were mostly grep-like regular expression patterns:
$ awk '/^be.*st$/' < /usr/share/dict/web2
beanfeast
beast
[snip]
(Except for the slashes, this is basically just running grep, since it uses the default action, print.)
Patterns can actually contain two addresses, but it's more typical to use one, as in these cases. Patterns not enclosed within slashes allow tests like FNR == 1 (File-specific Number of this Record equals 1) or NR % 2 == 1 (Number of this Record—cumulative across all files!—mod 2 equals 1).
Once you hit the open brace, though, you're into the "action" part. Now NR % 2 == 1 simply calculates the result (true or false) and then throws it away. If you leave out the "pattern" part entirely, the "action" part is run on every input line. So this prints every line.
Note that the test NR % 2 == 1 is testing the cumulative record-number. So if some file has an odd number of lines ("records"), the next file will print out every even-numbered line (and this will persist until you hit another file with an odd number of lines).
For instance, suppose the two input files are A.txt and B.txt. Awk starts reading A.txt and has both FNR and NR set to 1 for the first line, which might be, e.g., file A, line 1. Since FNR == 1 the first "action" is done, setting o. Then awk tests the second pattern. NR is 1, so NR % 2 is 1, so the second "action" is done, printing that line to A_oddlines.txt.
Now suppose file A.txt contains only that one line. Awk now goes on to file B.txt, resetting FNR but leaving NR cumulative. The first line of B might be file B, line 1. Awk tries the first "pattern", and indeed, FNR == 1 so this closes the old o and sets up the new one.
But NR is 2, because NR is cumulative across all input files. So the second pattern (NR % 2 == 1) computes 2 % 2 (which is 0) and compares == 1 which is false, and thus awk skips the second "action" for line 1 of file B.txt. Line 2, if it exists, will have FNR == 2 and NR == 3, so that line will be copied out.
(I originally assumed, since your script was close to working, that you intended this and were just stuck a bit on syntax.)
With GNU awk you could just do:
$ awk 'FNR%2{print > (FILENAME".odd")}' *.txt
This will create a .odd file for every .txt file in the current directory containing only the odd lines.
However sed has the upper hand on conciseness here. The following GNU sed command will remove all even lines and store the old file with the extension .bck for all .txt files in the current directory:
$ sed -ni.bck '1~2p' *txt
Demo:
$ ls
f1.txt f2.txt
$ cat f1.txt
1
2
3
4
5
$ cat f2.txt
6
7
8
9
10
$ sed -ni.bck '1~2p' *txt
$ ls
f1.txt f1.txt.bck f2.txt f2.txt.bck
$ cat f1.txt
1
3
5
$ cat f1.txt.bck
1
2
3
4
5
$ cat f2.txt
6
8
10
$ cat f2.txt.bck
6
7
8
9
10
If you don't won't the back up files then simply:
$ sed -ni '1~2p' *txt
Personally, I'd use
for filename in *.txt; do
awk 'NR%2==1' "$filename" > "oddlines-$filename"
done
EDIT: quote filenames
You can try a for loop :
#!/bin/bash
for file in dir/*.txt
do
oddfile=$(echo "$file" | sed -e 's|\.txt|_odd\.txt|g') #This will create file_odd.txt
awk 'NR%2==1' "$file" > "$oddfile" # This will output it in the same dir.
done
Your problem is that NR%2==1 is inside the {NR%2==1; print>o} 'action block' and is not kicking in as a 'condition'. Use this instead:
gawk 'FNR==1{if(o)close(o);o=FILENAME;sub(/\.txt/,"_oddlines.txt",o)};
FNR%2==1{print > o}' *.txt

Resources