shell: What the means of number of sentence - shell

I need to count number of sentences and paragraphs but I do not understand how to do this from a text file.
I can count the number of lines and words using the wc command but I do not understand the meaning of sentence and paragraph in text file. Is there any command in shell do this?
Here's how we count number of words and lines in a text file:
wc -w filename
wc -l filename
For sentences and paragraphs, here is what I tried:
grep -c \\. #to count number of sentences.
grep -o [.'\n'] #to count number of paragraph.
I do not understand how to count number of sentences and paragraphs in a text file.
Any ideas will be helpful.
for example:
Main article: SSID#Security of SSID hiding.
A simple but ineffective method to attempt to secure a wireless network is SSID (Service Set Identifier).[12][13] This provides very little protection against anything but the most casual intrusion efforts...
2 paragraph,and 3 sentence.

A first approximation can be obtained under the assumptions that:
Sentences end with a period and periods are only used for that (no
decimal numbers, no ellipsis, etc.)
Paragraphs are separated with exactly one empty line
(Of course those are not met in reality but it should get you started)
grep -oc \\.
will count the number of sentences, and
grep -c "^$"
will count the number of paragraphs. If your text is strongly formatted you may get to something that works, otherwise, you could consider using Natural Language Processing tools such as NLTK.

To count the number of sentences, you could count the number of peroids, question marks, and exclamation points. But then you run into the problem of an ellipsis (...). I suppose you could only count it if it has whitespace afterwards.
Paragraphs are another matter. Are they indented? How, with a tab? Then count them.
The big question is 'What is the delimiter between sentences and paragraphs?'
When you know that, define the delimiter regex, and count how many are in the file using the tool of your choice.

Related

Need help understanding why this string in grep pulls IP addresses rather than this other string

The following statement is from a homework question which I tested out and answered, but I'm just not understanding how come this line behaves the way it does and I want to understand why. I realize why this expression is flawed to find an IP address but I don't fully understand why it behaves the way it does since it seems as if the question mark doesn't actually behave as 0 or 1 times in like it's supposed to.
"user#machine:~$ grep -E '[01]?[0-9][0-9]?' "
To my understanding "[01]?" should look for any number 0-1 as indicated by the brackets while the question mark tells grep to look for zero or one instance only and similar with "[0-9]?". Thing is this line will print an unlimited number of digits far exceeding 3 digits. I ruled out that it was due to the 3rd bracket that didn't have a proceeding question mark since it would still print an unlimited amount of digits if I piped an echo or used a testing .txt file full of numbers.
This above example made me than wonder how to find IP's with grep the correct way. So I found countless examples like the following expression for IPv4 octets:
\.(25[0-5]\|2[0-4][0-9]\|[01][0-9][0-9]\|[0-9][0-9]).\
Is this telling me to look for any number 2-5 anywhere from 0-5 times? 0-5 is too many digits for an octet. Is it telling me to look for any number 0-5 up to 25 times? Again that's way too many digits for an octet. What does \2[0-4][0-9]\ mean in this case? I'm confused about how this expression finds numbers strictly between 1-255?
Look at it this way: x?[0-9]x? matches anything which contains a digit because both the x:es are optional. You might as well leave them out because they do not constrain the match at all.
25[0-5] looks for 25 followed by a digit in the range 0-5. In other words, the expression matches a number in the range 250-255.
The full expression in your example looks for a number in the range 00-255 by enumerating strings beginning with 25, 20-24, etc; though it's incomplete in that it doesn't permit single-digit numbers.
The expression matches a single octet (incompletely), not an entire IP address. Here is a common way to match an IPv4 address:
([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?)(\.([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?){3}
where the square brackets express character classes which match a single character out of a set, and the final curly braces {3} express a repetition.
Some regex dialects (e.g. POSIX grep) require backslashes before | and \( but I have used the extended notation (a la grep -E and most online regex exploration tools) which doesn't want backslashes.

How to determine the number of grouped numbers in a string in bash

I have a string in bash
string="123 abc 456"
Where numbers that are grouped together are considered 1 number.
"123" and "456" would be considered numbers in this case.
How can i determine the number of grouped together numbers?
so
"123"
is determined to be a string with just one number, and
"123 abc 456"
is determined to be a string with 2 numbers.
egrep -o '[0-9]+' <<<"$string" | wc -l
Explanation
egrep: This performs an extended regular expression match on the lines of a given file (or, in this case, a herestring). It usually returns lines of text within the string that contain at least one chunk of text that matches the supplied pattern. However, the -o flag tells it to return only those matching chunks, one per line of output.
'[0-9]+': This is the regular expression that the string is compared against. Here, we are telling it to match successive runs of 1 or more digits, and no other character.
<<< The herestring operator allows us to pass a string into a command as if were the contents of a file.
| This pipes the output of the previous command (egrep) to become the input for the next command (wc).
wc: This performs a word count, normally returning the number of words in a given argument. However, the -l tells it to do a line count instead.
UPDATE: 2018-08-23
Is there any way to adapt your solution to work with floats?
The regular expression that matches both integer numbers and floating point decimal numbers would be something like this: '[0-9]*\.?[0-9]+'. Inserting this into the command above in place of its predecessor, forms this command chain:
egrep -o '[0-9]*\.?[0-9]+' <<<"$string" | wc -l
Focussing now only on the regular expression, here's how it works:
[0-9]: This matches any single digit from 0 to 9.
*: This is an operator that applies to the expression that comes directly before it, i.e. the [0-9] character class. It tells the search engine to match any number of occurrences of the digits 0 to 9 instead of just one, but no other character. Therefore, it will match "2", "26", "4839583", ... but it will not match "9.99" as a singular entity (but will, of course, match the "9" and the "99" that feature within it). As the * operator matches any number of successive digits, this can include zero occurrences (this will become relevant later).
\.: This matches a singular occurrence of a period (or decimal point), ".". The backslash is a special character that tells the search engine to interpret the period as a literal period, because this character itself has special function in regular expression strings, acting as a wildcard to match any character except a line-break. Without the backslash, that's what it would do, which would potentially match "28s" if it came across it, where the "s" was caught by the wildcard period. However, the backslash removes the wildcard functionality, so it will now only match with an actual period.
?: Another operator, like the *, except this one tells the search engine to match the previous expression either zero or one times, but no more. In other words, it makes the decimal point optional.
[0-9]+: As before, this will match digits 0 to 9, the number of which here is determined by the + operator, which standards for at least one, i.e. one or more digits.
Applying this to the following string:
"The value of pi is approximately 3.14159. The value of e is about 2.71828. The Golden Ratio is approximately 1.61803, which can be expressed as (√5 + 1)/2."
yields the following matches (one per line):
3.14159
2.71828
1.61803
5
1
2
And when this is piped through the wc -l command, returns a count of the lines, which is 6, i.e. the supplied string contains 6 occurrences of number strings, which includes integers and floating point decimals.
If you wanted only the floating point decimals, and to exclude the integers, the regular expression is this:
'[0-9]*\.[0-9]+'
If you look carefully, it's identical to the previous regular expression, except for the missing ? operator. If you recall, the ? made the decimal point an optional feature to match; removing this operator now means the decimal point must be present. Likewise, the + operator is matching at least one instance of a digit following the decimal point. However, the * operator before it matches any number of digits, including zero digits. Therefore, "0.61803" would be a valid match (if it were present in the string, which it isn't), and ".33333" would also be a valid match, since the digits before the decimal point needn't be there thanks to the * operator. However, whilst "1.1111" could be a valid match, "1111." would not be, because + operator dictates that there must be at least one digit following the decimal point.
Putting it into the command chain:
egrep -o '[0-9]*\.[0-9]+' <<<"$string" | wc -l
returns a value of 3, for the three floating point decimals occurring in the string, which, if you remove the | wc -l portion of the command, you will see in the terminal output as:
3.14159
2.71828
1.61803
For reasons I won't go into, matching integers exclusively and excluding floating point decimals is harder to accomplish with Perl-flavoured regular expression matching (which egrep is not). However, since you're really only interested in the number of these occurrences, rather than the matches themselves, we can create a regular expression that doesn't need to worry about accurate matching of integers, as long as it produces the same number of matched items. This expression:
'[^.0-9][0-9]+(\.([^0-9]|$)|[^.])'
seems to be good enough for counting the integers in the string, which includes the 5, 1 and 2 (ignoring, of course, the √ symbol), returning these approximately matches substrings:
√5
1)
/2.
I haven't tested it that thoroughly, however, and only formulated it tonight when I read your comment. But, hopefully, you are beginning to get a rough sense of what's going on.
In case you need to know the number of grouped digits in string then following may help you.
string="123 abc 456"
echo "$string" | awk '{print gsub(/[0-9]+/,"")}'
Explanation: Adding explanation too here, following is only for explanation purposes.
string="123 abc 456" ##Creating string named string with value of 123 abc 456.
echo "$string" ##Printing value of string here with echo.
| ##Putting its output as input to awk command.
awk '{ ##Initializing awk command here.
print gsub(/[0-9]+/,"") ##printing value of gsub here(where gsub is for substituting the values of all digits in group with ""(NULL)).
it will globally substitute the digits and give its count(how many substitutions happens will be equal to group of digits present).
}' ##Closing awk command here.

Removing blankspace at the start of a line (size of blankspace is not constant)

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

How to neglect the output of OCR Engine that has no meaning?

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design an algorithm that neglects any text or word that has no meaning, below is some sort of output text that i want to neglect,my simple solution is to count the words in the recognized text that's separated by " " and the text which has too many words will be garbage(Hint: i'm scanning images which at most will contains 40 words) any idea will be helpful,thanks.
wo:>"|axnoA1wvw\
ldflfig
°J!9O‘ !P99W M9N 6 13!-|15!Cl ‘I-/Vl
978 89l9 Z0 3+ 3 'l9.l.
97 999 VLL lLOZ+ 3 9l!q°lN
wo0'|axno/(#|au1e>1e: new;
1=96r2a1ey\1 1uauud0|e/\e(]
|8UJB){ p8UJL|\7'
Divide the output text into words. Divide the words into triples. Count the triple frequencies, and compare to triple frequencies from text of a known-good text corpus (EG all the articles from some mailing list discussing what you intend to OCR, minus the header lines).
When I say "triples", I mean:
whe, hen, i, say, tri, rip, ipl, ple, les, i, mea, ean
...so "i" has a frequency of 2 in this short example, while the others are all frequency 1.
If you do a frequency count of each of these triples for a large document in your intended language, it should become possible to be reasonably accurate in guessing whether a string is in the same language.
Granted, it's heuristic.
I've used a similar approach for detecting English passwords in a password changing program. It worked pretty well, though there's no such thing as a perfect "obvious password rejecter".
Check the words against a dictionary?
Of course, this will have false-positives for things like foreign-phrases or code. The problem in general is intractable (ex. is this code or gibberish? :) ). The only (nearly) perfect method would be to use this as a heuristic to flag certain sections for human review.

Resources