Count overlapping occurrences of a substring *in a very large file* using Bash

Count overlapping occurrences of a substring *in a very large file* using Bash - bash

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.
Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?
For example,
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
222
333
444
555
666
must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)

Find 6 overlapping substrings aaa in the string
line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"
You don't want to see the strings, you want to count them.
When you try
# wrong
grep -o -F "aaa" <<< "${line}" | wc -l
you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with
grep -Eo "a{3,}" <<< "${line}"
Result
aaa
aaaa
aaaaa
Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa.
Compare the total count of characters with the number of lines (wc):
match lines chars add_to_total
aaa 1 4 1
aaaa 1 5 2
aaaaa 1 6 3
For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate
15 characters - (3 lines * 3 characters) = 15 - 9 = 6
In code:
read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
Or for a file
read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are
abcdabc
abcdabcdabc
abcdabcdabcdabc
Use testline
line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"
you need "abc(dabc)+" and have
match lines chars add_to_total
abcdabcdabcdabc 1 16 3
abcdabcdabcdabc 1 16 3
abcdabc 1 8 1
abcdabcdabc 1 12 2
For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate
(52 characters / fixed 4) / 4 lines = 13 - 4 = 9
In code:
read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"
When you have a large file, you might want to split it first.

I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):
Use the summation method :
# WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array
${input……} | WHINY_USERS=1 {m,g}awk '
BEGIN {
1 FS = "[^a]+(aa?[^a]+)*"
1 OFS = "|"
1 PROCINFO["sorted_in"] = "#ind_str_asc"
} {
2 _ = ""
2 OFS = "|"
2 gsub("^[|]*|[|]*$",_, $!(NF=NF))
2 split(_,__)
split($-_,___,"[|]+")
12 for (_ in ___) {
12 __[___[_]]++
}
2 _____=____=_<_
2 OFS = "\t"
2 print " -- line # "(NR)
7 for (_ in __) {
7 print sprintf(" %20s",_), __[_], \
______=__[_] * (length(_)-2),\
"| "(____+=__[_]), _____+=______
}
print "" }'
|
-- line # 1
aaa 3 3 | 3 3
aaaa 2 4 | 5 7
aaaaa 3 9 | 8 16
aaaaaaaaaaaaaaa 1 13 | 9 29
-- line # 2
aaa 1 1 | 1 1
aaaa 1 2 | 2 3
aaaaa 1 3 | 3 6
Print out all the copies of that substring :
{m,g}awk' {
2 printf("%s%.*s",____=$(_=_<_),_, NF=NF)
9 do { _+=gsub(__,_____)
} while(index($+__,__))
2 if(_) {
2 ____=substr(____,-_<_,_)
2 gsub(".", (":")__, ____)
2 print "}-[(# " (_) ")]--;\f\b" substr(____, 2)
} else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'
|
aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
aaa:aaa:aaa:aaa:aaa:aaa

Related

Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt

Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.

I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.

If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Search replace string in a file based on column in other file

If we have the first file like below:
(a.txt)
1 asm
2 assert
3 bio
4 Bootasm
5 bootmain
6 buf
7 cat
8 console
9 defs
10 echo
and the second like:
(b.txt)
bio cat BIO bootasm
bio defs cat
Bio console
bio BiO
bIo assert
bootasm asm
bootasm echo
bootasm console
bootmain buf
bootmain bio
bootmain bootmain
bootmain defs
cat cat
cat assert
cat assert
and we want the output will be like this:
3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2
we read each second column in each file in the first file, we search if it exist in each column in each line in the second file if yes we replace it with the the number in the first column in the first file. i did it in only the fist column, i couldn't do it for the rest.
Here the command i use
awk 'NR==FNR{a[$2]=$1;next}{$1=a[$1];}1' a.txt b.txt
3 cat bio bootasm
3 defs cat
3 console
3 bio
3 assert
4 asm
4 echo
4 console
5 buf
5 bio
5 bootmain
5 defs
7 cat
7 assert
7 assert
how should i do to the other columns ?
Thankyou

awk 'NR==FNR{h[$2]=$1;next} {for (i=1; i<=NF;i++) $i=h[$i];}1' a.txt b.txt
NR is the global record number (line number default) across all files. FNR is the line number for the current file. The NR==FNR block specifies what action to take when global line number is equal to the current number, which is only true for the first file, i.e., a.txt. The next statement in this block skips the rest of the code so the for loop is only available to the second file, e.i., b.txt.
First, we process the first file in order to store the word ids in an associative array: NR==FNR{h[$2]=$1;next}. After which, we can use these ids to map the words in the second file. The for loop (for (i=1; i<=NF;i++) $i=h[$i];) iterates over all columns and sets each column to a number instead of the string, so $i=h[$i] actually replaces the word at the ith column with its id. Finally the 1 at the end of the scripts causes all lines to be printed out.
Produces:
3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2
To make the script case-insensitive, add tolower calls into the array indices:
awk 'NR==FNR{h[tolower($2)]=$1;next} {for (i=1; i<=NF;i++) $i=h[tolower($i)];}1' a.txt b.txt

divide and conquer!, a bit archaic but does the job =)
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$1];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 1
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$2];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 2
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$3];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 3
awk 'NR==FNR{a[$2]=$0;next}{$1=a[$4];}1' a.txt b.txt | tr ' ' ',' | awk '{ print $1 }' FS="," > 4
paste 1 2 3 4 | tr '\t' ' '
gives:
3 7 3 4
3 9 7
3 8
3 3
3 2
4 1
4 10
4 8
5 6
5 3
5 5
5 9
7 7
7 2
7 2
in this case I just changed the number of columns and paste the results together with a bit of edition in between.

{
cat a.txt; echo "--EndA--";cat b.txt
} | sed -n '1 h
1 !H
$ {
x
: loop
s/^ *\([[:digit:]]\{1,\}\) *\([^[:cntrl:]]*\)\(\n\)\(.*\)\2/\1 \2\3\4\1/
t loop
s/^ *[[:digit:]]\{1,\} *[^[:cntrl:]]*\n//
t loop
s/^[[:space:]]*--EndA--\n//
p
}
'
"--EndA--" could be something else if chance that it will present in one of the file (a.txt mainly)

Paste side by side multiple files by numerical order

I have many files in a directory with similar file names like file1, file2, file3, file4, file5, ..... , file1000. They are of the same dimension, and each one of them has 5 columns and 2000 lines. I want to paste them all together side by side in a numerical order into one large file, so the final large file should have 5000 columns and 2000 lines.
I tried
for x in $(seq 1 1000); do
paste `echo -n "file$x "` > largefile
done
Instead of writing all file names in the command line, is there a way I can paste those files in a numerical order (file1, file2, file3, file4, file5, ..., file10, file11, ..., file1000)?
for example:
file1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
...
file2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
....
file 3
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
....
paste file1 file2 file3 .... file 1000 > largefile
largefile
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
....
Thanks.

If your current shell is bash: paste -d " " file{1..1000}

you need rename the files with leading zeroes, like
paste <(ls -1 file* | sort -te -k2.1n) <(seq -f "file%04g" 1000) | xargs -n2 echo mv
The above is for "dry run" - Remove the echo if you satisfied...
or you can use e.g. perl
ls file* | perl -nlE 'm/file(\d+)/; rename $_, sprintf("file%04d", $1);'
and after you can
paste file*

With zsh:
setopt extendedglob
paste -d ' ' file<->(n)
<x-y> is to match positive decimal integer numbers from x to y. x and/or y can be omitted so <-> is any positive decimal integer number. It could also be written [0-9]## (## being the zsh equivalent of regex +).
The (n) is the globbing qualifiers. The n globbing qualifier turns on numeric sorting which sorts on all sequences of decimal digits appearing in the file names.

Split specific column(s)

I have this kind of recrods:
1 2 12345
2 4 98231
...
I need to split the third column into sub-columns to get this (separated by single-space for example):
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Can anybody offer me a nice solution in sed, awk, ... etc ? Thanks!
EDIT: the size of the original third column may vary record by record.

Awk
% echo '1 2 12345
2 4 98231
...' | awk '{
gsub(/./, "& ", $3)
print
}
'
1 2 1 2 3 4 5
2 4 9 8 2 3 1
...
[Tested with GNU Awk 3.1.7]
This takes every character (/./) in the third column ($3) and replaces (gsub()) it with itself followed by a space ("& ") before printing the entire line.

Sed solution:
sed -e 's/\([0-9]\)/\1 /g' -e 's/ \+/ /g'
The first sed expression replaces every digit with the same digit followed by a space. The second expression replaces every block of spaces with a single space, thus handling the double spaces introduced by the previous expression. With non-GNU seds you may need to use two sed invocations (one for each -e).

Using awk substr and printf:
[srikanth#myhost ~]$ cat records.log
1 2 12345 6 7
2 4 98231 8 0
[srikanth#myhost ~]$ awk '{ len=length($3); for(i=1; i<=NF; i++) { if(i==3) { for(j = 1; j <= len; j++){ printf substr($3,j,1) " "; } } else { printf $i " "; } } printf("\n"); }' records.log
1 2 1 2 3 4 5 6 7
2 4 9 8 2 3 1 8 0
You can use this for more than three column records as well.

Using perl:
perl -pe 's/([0-9])(?! )/\1 /g' INPUT_FILE
Test:
[jaypal:~/Temp] cat tmp
1 2 12345
2 4 98231
[jaypal:~/Temp] perl -pe 's/([0-9])(?! )/\1 /g' tmp
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Using gnu sed:
sed 's/\d/& /3g' INPUT_FILE
Test:
[jaypal:~/Temp] sed 's/[0-9]/& /3g' tmp
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Using gnu awk:
gawk '{print $1,$2,gensub(/./,"& ","G", $NF)}' INPUT_FILE
Test:
[jaypal:~/Temp] gawk '{print $1,$2,gensub(/./,"& ","G", $NF)}' tmp
1 2 1 2 3 4 5
2 4 9 8 2 3 1

If you don't care about spaces, this is a succinct version:
sed 's/[0-9]/& /g'
but if you need to remove spaces, we just chain another regexp:
sed 's/[0-9]/& /g;s/ */ /g'
Note this is compatible with the original sed, thus will run on any UNIX-like.

$ awk -F '' '$1=$1' data.txt | tr -s ' '
1 2 1 2 3 4 5
2 4 9 8 2 3 1

This might work for you:
echo -e "1 2 12345\n2 4 98231" | sed 's/\B\s*/ /g'
1 2 1 2 3 4 5
2 4 9 8 2 3 1
Most probably GNU sed only.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Count overlapping occurrences of a substring in a very large file using Bash - bash

Related

Combining multiple awk output statements into one line

Sorting tab delimited numbers by column with pure bash script.

Search replace string in a file based on column in other file

Paste side by side multiple files by numerical order

Split specific column(s)

Categories

Resources