Adding a loop in awk - bash

I had a problem that was resolved in a previous post:
But because I had too many files it was not practical to do an awk on every file and then use a second script to get the output I wanted.
Here are some examples of my files:
3
10
23
.
.
.
720
810
980
And the script was used to see where the numbers from the first file fell in this other file:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
After that range was located, the mean values of the second column in the second file was estimated.
Here are the scripts:
awk -v start=$(head -n 1 file1) -v end=$(tail -n 1 file1) -f script file2
and
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
How could I make a loop so that the mean for every file was estimated. My desired output would be something like this:
Name_of_file
start = number , end = number , mean = number
Thanks in advance.

.. wrap it in a loop?
for f in <files>; do
echo "$f";
awk -v start=$(head -n 1 "$f") -v end=$(tail -n 1 "$f") -f script file2;
done
Personally I would suggest combining them on one line (so that your results are block-data as opposed to file names on different lines from their results -- in that case replace echo "$f" with echo -n "$f " (to not add the newline).
EDIT: Since I suppose you're new to the syntax, <files> can either be a list of files (file1 file2 file 3), a list of files as generated by a glob (file*, files/data_*.txt, whatever), or a list of files generated by a command ( $(find files/ -name 'data' -type f), etc).

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

How to get n random "paragraphs" (groups of ordered lines) from a file

I have a file (originally compressed) with a known structure - every 4 lines, the first line starts with the character "#" and defines an ordered group of 4 lines. I want to select randomly n groups (half) of lines in the most efficient way (preferably in bash/another Unix tool).
My suggestion in python is:
path = "origin.txt.gz"
unzipped_path = "origin_unzipped.txt"
new_path = "/home/labs/amit/diklag/subset.txt"
subprocess.getoutput("""gunzip -c %s > %s """ % (path, unzipped_path))
with open(unzipped_path) as f:
lines = f.readlines()
subset_size = round((len(lines)/4) * 0.5)
l = random.sample(list(range(0, len(lines), 4)),subset_size)
selected_lines = [line for i in l for line in list(range(i,i+4))]
new_lines = [lines[i] for i in selected_lines]
with open(new_path,'w+') as f2:
f2.writelines(new_lines)
Can you help me find another (and faster) way to do it?
Right now it takes ~10 seconds to run this code
The following script might be helpful. This is however, untested as we do not have an example file:
attempt 1 (awk and shuf) :
#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
nrec=$(gunzip -c $path | awk '/^#/{c++}{END print c})'
awk '(NR==FNR){a[$1]=1;next}
!/^#/{next}
((++c) in a) { for(i=1;i<=4;i++) { print; getline } }' \
<(shuf -i 1-$nrec -n $count) <(gunzip -c $path) > $new_path
attempt 2 (sed and shuf) :
#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
gunzip -c $path | sed ':a;N;$!ba;s/\n/__END_LINE__/g;s/__END_LINE__#/\n#/g' \
| shuf -n $count | sed 's/__END_LINE__/\n/g' > $new_path
In this example, the sed line will replace all newlines with the string __END_LINE__, except if it is followed by #. The shuf command will then pick $count random samples out of that list. Afterwards we replace the string __END_LINE__ again by \n.
attempt 3 (awk) :
Create a file called subset.awk containing :
# Uniform(m) :: returns a random integer such that
# 1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }
# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m, i,j,k) {
for (i = 1; i <= m ; i++) { permutation[i] = i }
for (i = 1; i <= m-1; i++) {
j = Uniform(i-1)
k = permutation[i]
permutation[i] = permutation[j]
permutation[j] = k
}
}
BEGIN{RS="\n#"; srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR);
sub("#","",a[1])
for(r = 1; r <= count; r++) {
print "#"a[permutation[r]]
}
}
And then you can run :
$ gunzip -c <file.gz> | awk -c count=30 -f subset.awk > <output.txt>

How to split file by percentage of no. of lines?

How to split file by percentage of no. of lines?
Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :
$ wc -l brown.txt
57339 brown.txt
$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339
$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
34398 part1.txt
11466 part2.txt
11475 part3.txt
57339 total
But I'm sure there's a better way!
There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:
#!/bin/bash
usage () {
printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}
# Collect csplit options
while getopts "ksf:n:" opt; do
case "$opt" in
k|s) args+=(-"$opt") ;; # k: no remove on error, s: silent
f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
*) usage; exit 1 ;;
esac
done
shift $(( OPTIND - 1 ))
fname=$1
shift
ratios=("$#")
len=$(wc -l < "$fname")
# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[#]}"; do
(( total += ratio ))
cumsums+=("$total")
done
# Don't need the last element
unset cumsums[-1]
# Array of numbers of first line in each split file
for sum in "${cumsums[#]}"; do
linenums+=( $(( sum * len / total + 1 )) )
done
csplit "${args[#]}" "$fname" "${linenums[#]}"
After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,
percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1
are all equivalent.
Usage similar to the case in the question is as follows:
$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
34403 part0
11468 part1
11468 part2
57339 total
Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.
This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.
$ cat file
a
b
c
d
e
$ cat tst.awk
BEGIN {
split(pcts,p)
nrs[1]
for (i=1; i in p; i++) {
pct += p[i]
nrs[int(size * pct / 100) + 1]
}
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }
$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt
Change the " > " to just > to actually write to the output files.
Usage
The following bash script allows you to specify the percentage like
./split.sh brown.txt 60 20 20
you also can use the placeholder . which fills the percentage up to 100%.
./split.sh brown.txt 60 20 .
the splitted file is written to
part1-brown.txt
part2-brown.txt
part3-brown.txt
The script always generates as much part files as numbers are specified.
If the percentages sum up to 100, cat part* will always generate the original file (no duplicated or missing lines).
Bash Script: split.sh
#! /bin/bash
file="$1"
fileLength=$(wc -l < "$file")
shift
part=1
percentSum=0
currentLine=1
for percent in "$#"; do
[ "$percent" == "." ] && ((percent = 100 - percentSum))
((percentSum += percent))
if ((percent < 0 || percentSum > 100)); then
echo "invalid percentage" 1>&2
exit 1
fi
((nextLine = fileLength * percentSum / 100))
if ((nextLine < currentLine)); then
printf "" # create empty file
else
sed -n "$currentLine,$nextLine"p "$file"
fi > "part$part-$file"
((currentLine = nextLine + 1))
((part++))
done
BEGIN {
split(w, weight)
total = 0
for (i in weight) {
weight[i] += total
total = weight[i]
}
}
FNR == 1 {
if (NR!=1) {
write_partitioned_files(weight,a)
split("",a,":") #empty a portably
}
name=FILENAME
}
{a[FNR]=$0}
END {
write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
split("",threshold,":")
size = length(a)
for (i in weight){
threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
}
l=1
part=0
for (i in threshold) {
close(out)
out = name ".part" ++part
for (;l<threshold[i];l++) {
print a[l] " > " out
}
}
}
Invoke as:
awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...
Replace " > " with > in script to actually write partitioned files.
The variable w expects space separated numbers. Files are partitioned in that proportion. For example "2 1 1 3" will partition files into four with number of lines in proportion of 2:1:1:3. Any sequence of numbers adding up to 100 can be used as percentages.
For large files the array a may consume too much memory. If that is an issue, here is an alternative awk script:
BEGIN {
split(w, weight)
for (i in weight) {
total += weight[i]; weight[i] = total #cumulative sum
}
}
FNR == 1 {
#get number of lines. take care of single quotes in filename.
name = gensub("'", "'\"'\"'", "g", FILENAME)
"wc -l '" name "'" | getline size
split("", threshold, ":")
for (i in weight){
threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
}
part=1; close(out); out = FILENAME ".part" part
}
{
if(FNR>=threshold[part]) {
close(out); out = FILENAME ".part" ++part
}
print $0 " > " out
}
This passes through each file twice. Once for counting lines (via wc -l) and the other time while writing partitioned files. Invocation and effect is similar to the first method.
i like Benjamin W.'s csplit solution, but it's so long...
#!/bin/bash
# usage ./splitpercs.sh file 60 20 20
n=`wc -l <"$1"` || exit 1
echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` |
awk -v n=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' |
uniq | xargs csplit -sfpart "$1"
(the if(r > 1 && r < n) and uniq bits are to prevent creating empty files or strange behavior for small percentages, files with small numbers of lines, or percentages that add to over 100.)
I just followed your lead and made what you do manually into a script. It may not be the fastest or "best", but if you understand what you are doing now and can just "scriptify" it, you may be better off should you need to maintain it.
#!/bin/bash
# thisScript.sh yourfile.txt 20 50 10 20
YOURFILE=$1
shift
# changed to cat | wc so I dont have to remove the filename which comes from
# wc -l
LINES=$(cat $YOURFILE | wc -l )
startpct=0;
PART=1;
for pct in $#
do
# I am assuming that each parameter is on top of the last
# so 10 30 10 would become 10, 10+30 = 40, 10+30+10 = 50, ...
endpct=$( echo "$startpct + $pct" | bc)
# your math but changed parts of 100 instead of parts of 10.
# change bc <<< to echo "..." | bc
# so that one can capture the output into a bash variable.
FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc )
LASTLINE=$( echo "$LINES * $endpct / 100" | bc )
# use sed every time because the special case for head
# doesn't really help performance.
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
$((PART++))
startpct=$endpct
done
# get the rest if the % dont add to 100%
if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
fi
wc -l part*.txt

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

Find smallest missing integer in an array

I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.

Resources