How many times has the letter "N" or its repeat(eg: "NNNNN") been found in a text file? - shell

I am given a file.txt (text file) with a string of data. Example contents:
abcabccabbabNababbababaaaNNcacbba
abacabababaaNNNbacabaaccabbacacab
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
aaababababababacacacacccbababNbNa
abababbacababaaacccc
To find the number of distinct repeated patterns of "N" (repeated one or more times) that are present in the file using unix commands.
I am unsure on what commands to use even after trying a range of different commands.
$ grep -E -c "(N)+" file.txt
the output must be 6

One way:
$ sed 's/[^N]\{1,\}/\n/g' file.txt | grep -c N
6
How it works:
Replace all sequences of one or more non-N characters in the input with a newline.
This turns strings like abcabccabbabNababbababaaaNNcacbba into
N
NN
Count the number of lines with at least one N (Ignoring the empty lines).
Regular-expression free alternative:
$ tr -sc N ' ' < file.txt | wc -w
6
Uses tr to replace all runs of non-N characters with a single space, and counts the remaining words (Which are the N sequences). Might not even need the -s option.

Using GNU awk (well, just tested with gawk, mawk, busybox awk and awk version 20121220 and it seemed to work with all of them):
$ gawk -v RS="^$" -F"N+" '{print NF-1}' file
6
It reads in the whole file as a single record, uses regex N+ as field separator and outputs the field count minus one. For other awks:
$ awk -v RS="" -F"N+" '{c+=NF-1}END{print c}' file
It reads in empty line separated blocks of records, counts and sums fields.

Here is an awk that should work on most system.
awk -F'N+' '{a+=NF-1} END {print a}' file
6
It splits the line by one or more N and then count number of fields-1 pr line.

If you have a text file, and you want to count the number times a sequence of letters of N appear, you can do:
awk '{a+=gsub(/N+/,"")}END{print a}' file
This, however, will distinguish sequences that are split over multiple lines. Example:
abcNNN
NNefg
If you want this to be counted as a single sequence, you should do:
awk 'BEGIN{RS=OFS=""}{$1=$1}{a+=gsub(/N+/,"")}END{print a}' file

Related

Making bash output a certain word from a .txt file

I have a question on Bash:
Like the title says, I require bash to output a certain word, depending on where it is in the file. In my explicit example I have a simple .txt file.
I already found out that you can count the number of words within a file with the command:
wc -w < myFile.txt
An output example would be:
78501
There certainly is also a way to make "cat" to only show word number x. Something like:
cat myFile.txt | wordno. 3125
desired-word
Notice, that I will welcome any command, that gets this done, not only cat.
Alternatively or in addition, I would be happy to know how you can make certain characters in a file show, based on their place in it. Something like:
cat myFile.txt | characterno. 2342
desired-character
I already know how you can achieve this with a variable:
a="hello, how are you"
echo ${a:9:1}
w
Only problem is a variable can only be so long. Is it as long as a whole .txt file, it won't work.
I look forward to your answers!
You could use awkfor this job it splits the string at spaces and prints the $wordnumber stringpart and tr is used to remove newlines
cat myFile.txt | tr -d '\n' | awk -v wordnumber=5 '{ print $wordnumber }'
And if you want the for example 5th. character you could do this like so
head -c 5 myFile.txt | tail -c 1
Since you have NOT shown samples of Input_file or expected output so couldn't test it. You could simply do this with awk as follows could be an example.
awk 'FNR==1{print substr($0,2342,1);next}' Input_file
Where we are telling awk to look for 1st line FNR==1 and in substr where we tell awk to take character 2342 and next 1 means from that position take only 1 character you could increase its value or keep it as per your need too.
With gawk:
awk 'BEGIN{RS="[[:space:]]+"} NR==12345' file
or
gawk 'NR==12345' RS="[[:space:]]+" file
I'm setting the record separator to a sequences of spaces which includes newlines and then print the 12345th record.
To improve the average performance you can exit the script once the match is found:
gawk 'BEGIN{RS="[[:space:]]+"}NR==12345{print;exit}' file

grep command to know whther the two strings are in specific order

I was trying to write a shell script to check if two strings are present in a file, also, I'm checking if they are in specific order.
Let's say the file.txt has the following text:
bcd
def
abc
I'm using the command : grep -q abc file.txt && grep -l bcd file.txt
This is giving the output file.txt when the two strings are present in any order. I'd like to get the output only if abc comes before bcd. Please help me with this
With grep PCRE option:
grep -Pzl 'abc[\s\S]*bcd' file.txt
-z - treat input and output data as sequences of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.
If PCRE (-P option) is not supported on your side:
grep -zl 'abc.*bcd' file.txt
You can use awk instead of grep to match abc only after bcd:
awk '/abc/{p=NR} p && /bcd/{print FILENAME; exit}' file
awk -v RS='' '/abc.*bcd/{print FILENAME}' file.txt
You may re-assign the RS (record separator) from default '\n' to '', and start to process the whole file as it is in one record. Then it's no problem to use /abc.*bcd/ to distinguish if abc is ahead bcd.
Noted that it would not be recognized successfully if an empty line is in the case, since an empty line between abc and bcd would split them to different records. That would cause the criterion misjudge.

Searching a file (grep/awk) for 2 carriage return/line-feed characters

I'm trying to write a script that'll simply count the occurrences of \r\n\r\n in a file. (Opening the sample file in vim binary mode shows me the ^M character in the proper places, and the newline is still read as a newline).
Anyway, I know there are tons of solutions, but they don't seem to get me what I want.
e.g. awk -e '/\r/,/\r/!d' or using $'\n' as part of the grep statement.
However, none of these seem to produce what I need. I can't find the \r\n\r\n pattern with grep's "trick", since that just expands one variable. The awk solution is greedy, and so gets me way more lines than I want/need.
Switching grep to binary/Perl/no-newline mode seems to be closer to what I want,
e.g. grep -UPzo '\x0D', but really what I want then is grep -UPzo '\x0D\x00\x0D\x00', which doesn't produce the output I want.
It seems like such a simple task.
By default, awk treats \n as the record separator. That makes it very hard to count \r\n\r\n. If we choose some other record separator, say a letter, then we can easily count the appearance of this combination. Thus:
awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a' file
Here, gsub returns the number of substitutions made. These are summed and, after the end of the file has been reached, we print the total number.
Example
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'\r\n\r\n\r\n\r\na' | awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a'
2
Alternate solution (GNU awk)
We can tell it to treat \r\n\r\n as the record separator and then return the count (minus 1) of the number of records:
cat file <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
In awk, RS is the record separator and NR is the count of the number of records. Since we are using a multiple-character record separator, this requires GNU awk.
If the file ends with \r\n\r\n, the above would be off by one. To avoid that, the echo -n 1 statement is used to assure that there are always at least one character after the last \r\n\r\n in the file.
Examples
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'abc\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'abc\r\n\r\ndef' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'\r\n\r\n\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2
$ echo -n $'1\r\n\r\n2\r\n\r\n3' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2

How to sort,uniq and display line that appear more than X times

I have a file like this:
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.1
80.13.178.3
80.13.178.3
80.13.178.3
80.13.178.4
80.13.178.4
80.13.178.7
I need to display unique entries for repeated line (similar to uniq -d) but only entries that occur more than just twice (twice being an example so flexibility to define the lower limit.)
Output for this example should be like this when looking for entries with three or more occurrences:
80.13.178.2
80.13.178.3
Feed the output from uniq -cd to awk
sort test.file | uniq -cd | awk -v limit=2 '$1 > limit{print $2}'
With pure awk:
awk '{a[$0]++}END{for(i in a){if(a[i] > 2){print i}}}' a.txt
It iterates over the file and counts the occurances of every IP. At the end of the file it outputs every IP which occurs more than 2 times.

Grep penultimate line

Like the title says, how can I filter with grep (or similar bash tool) the line-before-the-last-line of a (variable length) file?
That is, show everything EXCEPT the penultimate line.
Thanks
You can use a combination of head and tail like this for example:
$ cat input
one
two
three
HIDDEN
four
$ head -n -2 input ; tail -n 1 input
one
two
three
four
From the coreutils head documentation:
‘-n k’
‘--lines=k’
Output the first k lines. However, if k starts with a ‘-’, print all but the last k lines of each file. Size multiplier suffixes are the same as with the -c option.
So the head -n -2 part strips all but the last two lines of its input.
This is unfortunately not portable. (POSIX does not allow negative values in the -n parameter.)
grep is the wrong tool for this. You can wing it with something like
# Get line count
count=$(wc -l <file)
# Subtract one
penultimate=$(expr $count - 1)
# Delete that line, i.e. print all other lines.
# This doesn't modify the file, just prints
# the requested lines to standard output.
sed "${penultimate}d" file
Bash has built-in arithmetic operators which are more elegant than expr; but expr is portable to other shells.
You could also do this in pure sed but I don't want to think about it. In Perl or awk, it would be easy to print the previous line and then at EOF print the final line.
Edit: I thought about sed after all.
sed -n '$!x;1!p' file
In more detail; unless we are at the last line ($), exchange the pattern space and the hold space (remember the current line; retrieve the previous line, if any). Then, unless this is the first line, print whatever is now in the pattern space (the previous line, except when we are on the last line).
awk oneliner: (test with seq 10):
kent$ seq 10|awk '{a[NR]=$0}END{for(i=1;i<=NR;i++)if(i!=NR-1)print a[i]}'
1
2
3
4
5
6
7
8
10
Using ed:
printf '%s\n' H '$-1d' wq | ed -s file # in-place file edit
printf '%s\n' H '$-1d' ',p' wq | ed -s file # write to stdout

Resources