Counting the number of 10-digit numbers in a file - bash

I need to count the total number of instances in which a 10-digit number appears within a file. All of the numbers have leading zeros, e.g.:
This is some text. 0000000001
Returns:
1
If the same number appears more than once, it is counted again, e.g.:
0000000001 This is some text.
0000000010 This is some more text.
0000000001 This is some other text.
Returns:
3
Sometimes there are no spaces between the numbers, but each continuous string of 10-digits should be counted:
00000000010000000010000000000100000000010000000001
Returns:
5
How can I determine the total number of 10-digit numbers appearing in a file?

Try this:
grep -o '[0-9]\{10\}' inputfilename | wc -l

The last requirement - that you need to count multiple numbers per line - excludes grep, as far as I know it can count only per-line.
Edit: Obviously, I stand corrected by Nate :) grep's -o option is what I was looking for.
You can however do this easily with sed like this:
$ cat mkt.sh
sed -r -e 's/[^0-9]/./g' -e 's/[0-9]{10}/num /g' -e 's/[0-9.]//g' $1
$ for i in *.txt; do echo --- $i; cat $i; echo --- number count; ./mkt.sh $i|wc -w; done
--- 1.txt
This is some text. 0000000001
--- number count
1
--- 2.txt
0000000001 This is some text.
0000000010 This is some more text.
0000000001 This is some other text.
--- number count
3
--- 3.txt
00000000010000000010000000000100000000010000000001
--- number count
5
--- 4.txt
1 2 3 4 5 6 6 7 9 0
11 22 33 44 55 66 77 88 99 00
123456789 0
--- number count
0
--- 5.txt
1.2.3.4.123
1234567890.123-AbceCMA-5553///q/\1231231230
--- number count
2
$

This might work for you:
cat <<! >test.txt
0000000001 This is some text.
0000000010 This is some more text.
0000000001 This is some other text.
00000000010000000010000000000100000000010000000001
1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 0 j
12345 67890 12 34 56 78 90
!
sed 'y/X/ /;s/[0-9]\{10\}/\nX\n/g' test.txt | sed '/X/!d' | sed '$=;d'
8

"I need to count the total number of instances in which a 10-digit number appears within a file. All of the numbers have leading zeros"
So I think this might be more accurate:
$ grep -o '0[0-9]\{9\}' filename | wc -l

Related

Count overlapping occurrences of a substring *in a very large file* using Bash

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.
Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?
For example,
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
222
333
444
555
666
must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)
Find 6 overlapping substrings aaa in the string
line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"
You don't want to see the strings, you want to count them.
When you try
# wrong
grep -o -F "aaa" <<< "${line}" | wc -l
you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with
grep -Eo "a{3,}" <<< "${line}"
Result
aaa
aaaa
aaaaa
Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa.
Compare the total count of characters with the number of lines (wc):
match lines chars add_to_total
aaa 1 4 1
aaaa 1 5 2
aaaaa 1 6 3
For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate
15 characters - (3 lines * 3 characters) = 15 - 9 = 6
In code:
read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
Or for a file
read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are
abcdabc
abcdabcdabc
abcdabcdabcdabc
Use testline
line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"
you need "abc(dabc)+" and have
match lines chars add_to_total
abcdabcdabcdabc 1 16 3
abcdabcdabcdabc 1 16 3
abcdabc 1 8 1
abcdabcdabc 1 12 2
For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate
(52 characters / fixed 4) / 4 lines = 13 - 4 = 9
In code:
read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"
When you have a large file, you might want to split it first.
I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):
Use the summation method :
# WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array
${input……} | WHINY_USERS=1 {m,g}awk '
BEGIN {
1 FS = "[^a]+(aa?[^a]+)*"
1 OFS = "|"
1 PROCINFO["sorted_in"] = "#ind_str_asc"
} {
2 _ = ""
2 OFS = "|"
2 gsub("^[|]*|[|]*$",_, $!(NF=NF))
2 split(_,__)
split($-_,___,"[|]+")
12 for (_ in ___) {
12 __[___[_]]++
}
2 _____=____=_<_
2 OFS = "\t"
2 print " -- line # "(NR)
7 for (_ in __) {
7 print sprintf(" %20s",_), __[_], \
______=__[_] * (length(_)-2),\
"| "(____+=__[_]), _____+=______
}
print "" }'
|
-- line # 1
aaa 3 3 | 3 3
aaaa 2 4 | 5 7
aaaaa 3 9 | 8 16
aaaaaaaaaaaaaaa 1 13 | 9 29
-- line # 2
aaa 1 1 | 1 1
aaaa 1 2 | 2 3
aaaaa 1 3 | 3 6
Print out all the copies of that substring :
{m,g}awk' {
2 printf("%s%.*s",____=$(_=_<_),_, NF=NF)
9 do { _+=gsub(__,_____)
} while(index($+__,__))
2 if(_) {
2 ____=substr(____,-_<_,_)
2 gsub(".", (":")__, ____)
2 print "}-[(# " (_) ")]--;\f\b" substr(____, 2)
} else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'
|
aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
aaa:aaa:aaa:aaa:aaa:aaa

Is there a way to print lines from a file from n to m and than reverse their positions?

I'm trying to print text from line 10 to 20 and then reverse their positions.
I've tried this:
sed '10!G;h;$!d' file.txt
But it only prints from 10 to end of the file. Is there any way to stop it at line 20 by using only one sed command?
Almost there, you just need to replace $!d with the 'until' line-number
sed -n '10,20p' tst.txt
// Prints line 10 <--> 20
sed -n '10!G;h;20p' tst.txt
// Prints REVERSE line 10 <--> 20
output:
20
19
18
17
16
15
14
13
12
11
10
tst.txt:
1
2
3
4
...
19
20
Info
You can use this to print a range of lines:
sed -n -e 10,20p file.txt | tac
tac will reverse the order of the lines
And for those of you without tac (like those mac users out there):
sed -n -e 10,20p file.txt | tail -r

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Dividing one file into separate based on line numbers

I have the following test file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
I want to separate it in a way that each file contains the last line of the previous file as the first line. The example would be:
file 1:
1
2
3
4
5
file2:
5
6
7
8
9
file3:
9
10
11
12
13
file4:
13
14
15
16
17
file5:
17
18
19
20
That would make 4 files with 5 lines and 1 file with 4 lines.
As a first step, I tried to test the following commands I wrote to get only the first file which contains the first 5 lines. I can't figure out why the awk command in the if statement, instead of printing the first 5 lines, it prints the whole 20?
d=$(wc test)
a=$(echo $d | cut -f1 -d " ")
lines=$(echo $a/5 | bc -l)
integer=$(echo $lines | cut -f1 -d ".")
for i in $(seq 1 $integer); do
start=$(echo $i*5 | bc -l)
var=$((var+=1))
echo start $start
echo $var
if [[ $var = 1 ]]; then
awk 'NR<=$start' test
fi
done
Thanks!
Why not just use the split util available from your POSIX toolkit. It has an option to split on number of lines which you can give it as 5
split -l 5 input-file
From the man split page,
-l, --lines=NUMBER
put NUMBER lines/records per output file
Note that, -l is POSIX compliant also.
$ ls
$
$ seq 20 | awk 'NR%4==1{ if (out) { print > out; close(out) } out="file"++c } {print > out}'
$
$ ls
file1 file2 file3 file4 file5
.
$ cat file1
1
2
3
4
5
$ cat file2
5
6
7
8
9
$ cat file3
9
10
11
12
13
$ cat file4
13
14
15
16
17
$ cat file5
17
18
19
20
If you're ever tempted to use a shell loop to manipulate text again, make sure to read https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice first to understand at least some of the reasons to use awk instead. To learn awk, get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
oh. and wrt why your awk command awk 'NR<=$start' test didn't work - awk is not shell, it has no more access to shell variables (or vice-versa) than a C program does. To init an awk variable named awkstart with the value of a shell variable named start and then use that awk variable in your script you'd do awk -v awkstart="$start" 'NR<=awkstart' test. The awk variable can also be named start or anything else sensible - it is completely unrelated to the name of the shell variable.
You could improve your code by removing the unneccesary echo cut and bc and do it like this
#!/bin/bash
for i in $(seq $(wc -l < test) ); do
(( i % 4 != 1 )) && continue
tail +$i test | head -5 > "file$(( 1+i/4 ))"
done
But still the awk solution is much better. Reading the file only once and taking actions based on readily available information (like the linenumber) is the way to go. In shell you have to count the lines, there is no way around it. awk will give you that (and a lot of other things) for free.
Use split:
$ seq 20 | split -l 5
$ for fn in x*; do echo "$fn"; cat "$fn"; done
xaa
1
2
3
4
5
xab
6
7
8
9
10
xac
11
12
13
14
15
xad
16
17
18
19
20
Or, if you have a file:
$ split -l test_file

Resources