Removing blankspace at the start of a line (size of blankspace is not constant) - bash

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.

It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

Related

Solved: Grep and Dynamically Truncate at Same Time

Given the following:
for(condition which changes $z)
aptitude show $z | grep -E 'Uncompressed Size: |x' | sed s/Uncompressed Size: //";
done
That means 3 items are outputting to screen ($Z, Uncompressed Size, x).
I want all of that to fit on one line, and a line I deem is = 100 characters.
So, ($Z, Uncompressed Size, x) must fit on one line. But X is very long and will have to be truncated. So there is a requirement to add "used" characters by $z and Uncompressed Size, so that x can be truncated dynamically. I love scripting and being able to do this I deem an absolute must. Needless to say all 3 items being output to screen change hence the characters of the first two outputs must be calculated to subtract from the allowed characters for x, and sum of all characters between all 3 items cannot exceed 100 characters.
sed 's/.//5g'
Lmao, sometimes I wish I thought in simpler terms; complicated description + simple solution = simple problem over complicated by interpreter.
Thank you, Barmar
That only leaves sed (100 - amount of characters used by $z which is this function: ${#z}

Need help understanding why this string in grep pulls IP addresses rather than this other string

The following statement is from a homework question which I tested out and answered, but I'm just not understanding how come this line behaves the way it does and I want to understand why. I realize why this expression is flawed to find an IP address but I don't fully understand why it behaves the way it does since it seems as if the question mark doesn't actually behave as 0 or 1 times in like it's supposed to.
"user#machine:~$ grep -E '[01]?[0-9][0-9]?' "
To my understanding "[01]?" should look for any number 0-1 as indicated by the brackets while the question mark tells grep to look for zero or one instance only and similar with "[0-9]?". Thing is this line will print an unlimited number of digits far exceeding 3 digits. I ruled out that it was due to the 3rd bracket that didn't have a proceeding question mark since it would still print an unlimited amount of digits if I piped an echo or used a testing .txt file full of numbers.
This above example made me than wonder how to find IP's with grep the correct way. So I found countless examples like the following expression for IPv4 octets:
\.(25[0-5]\|2[0-4][0-9]\|[01][0-9][0-9]\|[0-9][0-9]).\
Is this telling me to look for any number 2-5 anywhere from 0-5 times? 0-5 is too many digits for an octet. Is it telling me to look for any number 0-5 up to 25 times? Again that's way too many digits for an octet. What does \2[0-4][0-9]\ mean in this case? I'm confused about how this expression finds numbers strictly between 1-255?
Look at it this way: x?[0-9]x? matches anything which contains a digit because both the x:es are optional. You might as well leave them out because they do not constrain the match at all.
25[0-5] looks for 25 followed by a digit in the range 0-5. In other words, the expression matches a number in the range 250-255.
The full expression in your example looks for a number in the range 00-255 by enumerating strings beginning with 25, 20-24, etc; though it's incomplete in that it doesn't permit single-digit numbers.
The expression matches a single octet (incompletely), not an entire IP address. Here is a common way to match an IPv4 address:
([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?)(\.([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?){3}
where the square brackets express character classes which match a single character out of a set, and the final curly braces {3} express a repetition.
Some regex dialects (e.g. POSIX grep) require backslashes before | and \( but I have used the extended notation (a la grep -E and most online regex exploration tools) which doesn't want backslashes.

How to determine the number of grouped numbers in a string in bash

I have a string in bash
string="123 abc 456"
Where numbers that are grouped together are considered 1 number.
"123" and "456" would be considered numbers in this case.
How can i determine the number of grouped together numbers?
so
"123"
is determined to be a string with just one number, and
"123 abc 456"
is determined to be a string with 2 numbers.
egrep -o '[0-9]+' <<<"$string" | wc -l
Explanation
egrep: This performs an extended regular expression match on the lines of a given file (or, in this case, a herestring). It usually returns lines of text within the string that contain at least one chunk of text that matches the supplied pattern. However, the -o flag tells it to return only those matching chunks, one per line of output.
'[0-9]+': This is the regular expression that the string is compared against. Here, we are telling it to match successive runs of 1 or more digits, and no other character.
<<< The herestring operator allows us to pass a string into a command as if were the contents of a file.
| This pipes the output of the previous command (egrep) to become the input for the next command (wc).
wc: This performs a word count, normally returning the number of words in a given argument. However, the -l tells it to do a line count instead.
UPDATE: 2018-08-23
Is there any way to adapt your solution to work with floats?
The regular expression that matches both integer numbers and floating point decimal numbers would be something like this: '[0-9]*\.?[0-9]+'. Inserting this into the command above in place of its predecessor, forms this command chain:
egrep -o '[0-9]*\.?[0-9]+' <<<"$string" | wc -l
Focussing now only on the regular expression, here's how it works:
[0-9]: This matches any single digit from 0 to 9.
*: This is an operator that applies to the expression that comes directly before it, i.e. the [0-9] character class. It tells the search engine to match any number of occurrences of the digits 0 to 9 instead of just one, but no other character. Therefore, it will match "2", "26", "4839583", ... but it will not match "9.99" as a singular entity (but will, of course, match the "9" and the "99" that feature within it). As the * operator matches any number of successive digits, this can include zero occurrences (this will become relevant later).
\.: This matches a singular occurrence of a period (or decimal point), ".". The backslash is a special character that tells the search engine to interpret the period as a literal period, because this character itself has special function in regular expression strings, acting as a wildcard to match any character except a line-break. Without the backslash, that's what it would do, which would potentially match "28s" if it came across it, where the "s" was caught by the wildcard period. However, the backslash removes the wildcard functionality, so it will now only match with an actual period.
?: Another operator, like the *, except this one tells the search engine to match the previous expression either zero or one times, but no more. In other words, it makes the decimal point optional.
[0-9]+: As before, this will match digits 0 to 9, the number of which here is determined by the + operator, which standards for at least one, i.e. one or more digits.
Applying this to the following string:
"The value of pi is approximately 3.14159. The value of e is about 2.71828. The Golden Ratio is approximately 1.61803, which can be expressed as (√5 + 1)/2."
yields the following matches (one per line):
3.14159
2.71828
1.61803
5
1
2
And when this is piped through the wc -l command, returns a count of the lines, which is 6, i.e. the supplied string contains 6 occurrences of number strings, which includes integers and floating point decimals.
If you wanted only the floating point decimals, and to exclude the integers, the regular expression is this:
'[0-9]*\.[0-9]+'
If you look carefully, it's identical to the previous regular expression, except for the missing ? operator. If you recall, the ? made the decimal point an optional feature to match; removing this operator now means the decimal point must be present. Likewise, the + operator is matching at least one instance of a digit following the decimal point. However, the * operator before it matches any number of digits, including zero digits. Therefore, "0.61803" would be a valid match (if it were present in the string, which it isn't), and ".33333" would also be a valid match, since the digits before the decimal point needn't be there thanks to the * operator. However, whilst "1.1111" could be a valid match, "1111." would not be, because + operator dictates that there must be at least one digit following the decimal point.
Putting it into the command chain:
egrep -o '[0-9]*\.[0-9]+' <<<"$string" | wc -l
returns a value of 3, for the three floating point decimals occurring in the string, which, if you remove the | wc -l portion of the command, you will see in the terminal output as:
3.14159
2.71828
1.61803
For reasons I won't go into, matching integers exclusively and excluding floating point decimals is harder to accomplish with Perl-flavoured regular expression matching (which egrep is not). However, since you're really only interested in the number of these occurrences, rather than the matches themselves, we can create a regular expression that doesn't need to worry about accurate matching of integers, as long as it produces the same number of matched items. This expression:
'[^.0-9][0-9]+(\.([^0-9]|$)|[^.])'
seems to be good enough for counting the integers in the string, which includes the 5, 1 and 2 (ignoring, of course, the √ symbol), returning these approximately matches substrings:
√5
1)
/2.
I haven't tested it that thoroughly, however, and only formulated it tonight when I read your comment. But, hopefully, you are beginning to get a rough sense of what's going on.
In case you need to know the number of grouped digits in string then following may help you.
string="123 abc 456"
echo "$string" | awk '{print gsub(/[0-9]+/,"")}'
Explanation: Adding explanation too here, following is only for explanation purposes.
string="123 abc 456" ##Creating string named string with value of 123 abc 456.
echo "$string" ##Printing value of string here with echo.
| ##Putting its output as input to awk command.
awk '{ ##Initializing awk command here.
print gsub(/[0-9]+/,"") ##printing value of gsub here(where gsub is for substituting the values of all digits in group with ""(NULL)).
it will globally substitute the digits and give its count(how many substitutions happens will be equal to group of digits present).
}' ##Closing awk command here.

shell: What the means of number of sentence

I need to count number of sentences and paragraphs but I do not understand how to do this from a text file.
I can count the number of lines and words using the wc command but I do not understand the meaning of sentence and paragraph in text file. Is there any command in shell do this?
Here's how we count number of words and lines in a text file:
wc -w filename
wc -l filename
For sentences and paragraphs, here is what I tried:
grep -c \\. #to count number of sentences.
grep -o [.'\n'] #to count number of paragraph.
I do not understand how to count number of sentences and paragraphs in a text file.
Any ideas will be helpful.
for example:
Main article: SSID#Security of SSID hiding.
A simple but ineffective method to attempt to secure a wireless network is SSID (Service Set Identifier).[12][13] This provides very little protection against anything but the most casual intrusion efforts...
2 paragraph,and 3 sentence.
A first approximation can be obtained under the assumptions that:
Sentences end with a period and periods are only used for that (no
decimal numbers, no ellipsis, etc.)
Paragraphs are separated with exactly one empty line
(Of course those are not met in reality but it should get you started)
grep -oc \\.
will count the number of sentences, and
grep -c "^$"
will count the number of paragraphs. If your text is strongly formatted you may get to something that works, otherwise, you could consider using Natural Language Processing tools such as NLTK.
To count the number of sentences, you could count the number of peroids, question marks, and exclamation points. But then you run into the problem of an ellipsis (...). I suppose you could only count it if it has whitespace afterwards.
Paragraphs are another matter. Are they indented? How, with a tab? Then count them.
The big question is 'What is the delimiter between sentences and paragraphs?'
When you know that, define the delimiter regex, and count how many are in the file using the tool of your choice.

Sorting lines with numbers and word characters

I recently wrote a simple utility in Perl to count words in a file to determinate its frequency, that's how many times it appears.
It's all fine, but I'd like to sort the result to make it easier to read. An output example would be:
4:an
2:but
5:does
10:end
2:etc
2:for
As you can see, it's ordered by word, not frequency. But with a little help of :sort I could reorganize that. Using n, numbers like 10 go to the right place (even though it start with 1), plus a little ! and the order gets reversed, so the word that appears more is the first one.
:sort! n
10:end
5:does
4:an
2:for
2:etc
2:but
The problem is: when the number is repeated it gets sorted by word — which is nice — but remember, the order was reversed!
for -> etc -> but
How can I fix that? Will I have to use some Vim scripting to iterate over each line checking whether it starts with the previous number, and marking relevant lines to sort them after the number changes?
tac | sort -nr
does this, so select the lines with shift+V and use !
From the vim :help sort:
The details about sorting depend on the library function used. There is no
guarantee that sorting is "stable" or obeys the current locale. You will have
to try it out.
As a result, you might want to perform the sorting in your Perl script instead; it wouldn't be hard to extend your Perl sort to be stable, see perldoc sort for entirely too many details.
If you just want this problem finished, then you can replace your :sort command with this:
!sort -rn --stable (it might be easiest to use Shift-V to visually select the lines first, or use a range for the sort, or something similar, but if you're writing vim scripts, none of this will be news to you. :)

Resources