Number of intergers in a file using Command Line Interface - bash

How to count number of integers in a file using egrep?
I tried to solve it as a pattern finding problem. Actually, I am facing problem of how to represent range of characters [0-9] continuously which include "space" before the beginning and "space or dot" after the end. I think the latter can be solved by using \< and \> respectively. Also, It should not include dot in between otherwise it will not be an integer. I am unable to convert this logic into regular expression using available tools and techniques.
My name is 2322.
33 is my sister.
I am blessed with a son named 55.
Why are you so 69. Is everything 33.
66.88 is not an integer
55whereareyou?
The right answer should be 5 i.e. for 2322, 33, 55, 69 and 33.

grep -Eo '(^| )([0-9]+[\.\?\=\:]?( |$))+' | wc -w
^^ ^ ^ ^ ^ ^ ^
|| | | | | | |
E = extended regex--------+| | | | | | |
o = extract what found-----+ | | | | | |
starts with new line or space---+ | | | | |
digits--------------------------------+ | | | |
optional dot, question mark, etc.-------------+ | | |
ends with end line or space----------------------------+ | |
repeat 1 time or more (to detect integers like "123 456")--+ |
count words------------------------------------------------------+
Note: 123. 123? 123: are also counted as integer
Test:
#!/bin/bash
exec 3<<EOF
My name is 2322.
33 is my sister.
I am blessed with a son named 55.
Why are you so 69. Is everything 33.
66.88 is not an integer
55whereareyou?
two integers 123 456.
how many tables in room 400? 50.
50? oh I thought it was 40.
23: It's late, 23:00 already
EOF
grep -Eo '(^| )([0-9]+[\.\?\=\:]?( |$))+' <&3 | \
tee >(sleep 0.5; echo -n "integer counted: "; wc -w; )
Outputs:
2322.
33
55.
69.
33.
123 456.
400? 50.
50?
40.
23:
integer counted: 12

Based on the observation that you want 66.88 excluded, I'm guessing
grep -Ec '[0-9]\.?( |$)' file
which finds a digit, optionally followed by a dot, followed by either a space or end of line.
The -c option says to report the number of lines which contain a match (so not strictly the number of matches, if there are lines which contain multiple matches) and the -E option enables extended regular expression syntax, i.e. what was traditionally calned egrep (though the command name is now obsolescent).
If you need to count matches, the -o option prints each match on a separate line, which you can then pass to wc -l (or in lucky cases combine with grep -c, but check first; this doesn't work e.g. with GNU grep currently).

On my ubuntu this code working fine
grep -P '((^)|(\s+))[-+]?\d+\.?((\s+)|($))' test

Related

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

Awk, or similar command, to get the last column and do some action with it in Bash

I'm writing a script to grab the last update/patch date on a few hundred servers. Lacking another tool due to various reasons, I've decided to write a script. At the moment I'm using the following command:
sudo yum history | grep [0-9] | grep -E 'Update|Install|U|I' | awk 'NR==1'
Which gives me the first line with an action on it. But it only gives me the first line, I toyed with the idea of grabbing the first 5 rows but that may not be applicable to every situation.
sudo yum history | grep [0-9] | grep -E 'Update|Install|U|I' | awk 'NR>=1&&NR<=5'
So I would like to check the last column or two and if more than x packages have been updated or installed then to grab that row.
Generically speaking the output of yum history is:
18 | first last <username> | 2018-08-30 19:41 | E, I, U | 43 ss
17 | first last <username> | 2018-07-10 15:28 | E, I, U | 230 EE
16 | first last <username> | 2018-04-25 20:08 | E, I, U | 44 ss
15 | first last <username> | 2018-01-30 20:57 | E, I, O, U | 108 EE
14 | first last <username> | 2018-01-30 20:39 | Install | 4
The issue I'm running into is the last two columns can differ in their column position and the last column may just be numeric or it may contain letters or special characters. I want to ignore any last column that has any character that is not numeric, and to evaluate whether the last column has more than 20 packages installed or updated. If the last column is more than 20 packages to then grab that row and only that row.
Use a regular expression, matching for the number in the last column. To print all history records with >=20 alterations:
sudo yum history | \
perl -ne '/\| *(\d+)[^\|]*$/ and $1>=20 and print($_)'
Of - if you only want the time stamp from matching history records:
sudo yum history | \
perl -ne '
#col=split(/\|/);
$col[4]=~/^ *(\d+)/ and $1>=20 and print($col[2],"\n")'
use awk:
sudo yum history | awk -F ' *\\| *' '$4 ~ /\<(Install|Update|U|I)\>/ && $5 > 20'

How to split string using regex to split between +,-,*,/ symbols?

I need to tell Ruby in regex to split before and after the + - * / symbols in my program.
Examples:
I need to turn "1+12" into [1.0, "+", 12.0]
and "6/0.25" into [6.0, "/", 0.25]
There could be cases like "3/0.125" but highly unlikely. If first two I listed above are satisfied it should be good.
On the Ruby docs, "hi mom".split(%r{\s*}) #=> ["h", "i", "m", "o", "m"]
I looked up a cheat-sheet to try to understand %r{\s*}, and I know that the stuff inside %r{} such as \s are skipped and \s means white space in regex.
'1.0+23.7'.scan(/(((\d\.?)+)|[\+\-\*\/])/)
instead of splitting, match with capture groups to parse your inputs:
(?<operand1>(?:\d+(?:\.\d+)?)|(?:\.\d+))\s*(?<operator>[+\/*-])\s*(?<operand2>(?:\d+(?:\.\d+)?)|(?:\.\d+))
explanation:
I've used named groups (?<groupName>regex) but they aren't necessary and could just be ()'s - either way, the sub-captures will still be available as 1,2,and 3. Also note the (?:regex) constructs that are for grouping only and do not "remember" anything, and won't mess up your captures)
(?:\d+(?:\.\d+)?)|(?:\.\d+)) first number: either leading digit(s) followed optionally by a decimal point and digit(s), OR a leading decimal point followed by digit(s)
\s* zero or more spaces in between
[+\/*-] operator: character class meaning a plus, division sign, minus, or multiply.
\s* zero or more spaces in between
(?:\d+(?:\.\d+)?)|(?:\.\d+) second number: same pattern as first number.
regex demo output:
I arrived a little late to this party, and found that many of the good answers had already been taken. So, I set out to expand on the theme slightly and compare the performance and robustness of each of the solutions. It seemed like a fun way to entertain myself this morning.
In addition to the 3 examples given in the question, I added test cases for each of the four operators, as well as for some new edge cases. These edge cases included handling of negative numbers and arbitrary spaces between operands, as well as how each of the algorithms handled expected failures.
The answers revolved around 3 methods: split, scan, and match. I also wrote new solutions using each of these 3 methods, specifically respecting the additional edge cases that I added to here. I ran all of the algorithms against this full set of test cases, and ended up with a table of pass/fail results.
Next, I created a benchmark that created 1,000,000 test strings that each of the solutions would be able to parse properly, and ran each solution against that sample set.
On first benchmarking, Cary Swoveland's solution performed far better than the others, but didn't pass the added test cases. I made very minor changes to his solution to produce a solution that supported both negative numbers and arbitrary spaces, and included that test as Swoveland+.
The final results printed from to the console are here (note: horizontal scroll to see all results):
| Test Case | match | match | scan | scan |partition| split | split | split | split |
| | Gaskill | sweaver | Gaskill | techbio |Swoveland| Gaskill |Swoveland|Swoveland+| Lilue |
|------------------------------------------------------------------------------------------------------|
| "1+12" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "6/0.25" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "3/0.125" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "30-6" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "3*8" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "20--4" | Pass | -- | Pass | -- | Pass | Pass | -- | Pass | Pass |
| "33+-9" | Pass | -- | Pass | -- | Pass | Pass | -- | Pass | Pass |
| "-12*-2" | Pass | -- | Pass | -- | Pass | Pass | -- | Pass | Pass |
| "-72/-3" | Pass | -- | Pass | -- | Pass | Pass | -- | Pass | Pass |
| "34 - 10" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| " 15+ 9" | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "4*6 " | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| "b+0.5" | Pass | Pass | Pass | -- | -- | -- | -- | -- | -- |
| "8---0.5" | Pass | Pass | Pass | -- | -- | -- | -- | -- | -- |
| "8+6+10" | Pass | -- | Pass | -- | -- | -- | -- | -- | -- |
| "15*x" | Pass | Pass | Pass | -- | -- | -- | -- | -- | -- |
| "1.A^ff" | Pass | Pass | Pass | -- | -- | -- | -- | -- | -- |
ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin14]
============================================================
user system total real
match (Gaskill): 4.770000 0.090000 4.860000 ( 5.214996)
match (sweaver2112): 4.640000 0.040000 4.680000 ( 4.911849)
scan (Gaskill): 7.360000 0.080000 7.440000 ( 7.719646)
scan (techbio): 12.930000 0.140000 13.070000 ( 13.791613)
partition (Swoveland): 5.390000 0.050000 5.440000 ( 5.648762)
split (Gaskill): 5.150000 0.100000 5.250000 ( 5.455094)
split (Swoveland): 3.860000 0.060000 3.920000 ( 4.040774)
split (Swoveland+): 4.240000 0.040000 4.280000 ( 4.537570)
split (Lilue): 7.540000 0.090000 7.630000 ( 8.022252)
In order to keep this post from being far too long, I've included the complete code for this test at https://gist.github.com/mgaskill/96f04e7e1f72a86446f4939ac690759a
The robustness test cases can be found in the first table above. The Swoveland+ solution is:
f,op,l = formula.split(/\b\s*([+\/*-])\s*/)
return [f.to_f, op, l.to_f]
This includes a \b metacharacter prior to splitting on an operator ensures that the previous character is a word character, giving support for negative numbers in the second operand. The \s* metacharacter expressions support arbitrary spaces between operands and operator. These changes incur less than 10% performance overhead for the additional robustness.
The solutions that I provided are here:
def match_gaskill(formula)
return [] unless (match = formula.match(/^\s*(-?\d+(?:\.\d+)?)\s*([+\/*-])\s*(-?\d+(?:\.\d+)?)\s*$/))
return [match[1].to_f, match[2], match[3].to_f]
end
def scan_gaskill(formula)
return [] unless (match = formula.scan(/^\s*(-?\d+(?:\.\d+)?)\s*([+*\/-])\s*(-?\d+(?:\.\d+)?)\s*$/))[0]
return [match[0][0].to_f, match[0][1], match[0][2].to_f]
end
def split_gaskill(formula)
match = formula.split(/(-?\d+(?:\.\d+)?)\s*([+\/*-])\s*(-?\d+(?:\.\d+)?)/)
return [match[1].to_f, match[2], match[3].to_f]
end
The match and scan solutions are very similar, but perform significantly differently, which is very interesting, because they use the exact same regex to do the work. The split solution is slightly simpler, and only splits on the entire expression, capturing each operand and the operator, separately.
Note that none of the split solutions was able to properly identify failures. Adding this support requires additional parsing of the operands, which significantly increases the overhead of the solution, typically running about 3 times slower.
For both performance and robustness, match is the clear winner. If robustness isn't a concern, but performance is, use split. On the other hand, scan provided complete robustness, but was more than 50% slower than the equivalent match solution.
Also note that using an efficient way to extract the results from the solution into the result array is as important to performance as is the algorithm chosen. The technique of capturing the results array into multiple variables (used in Woveland) outperformed the map solutions dramatically. Early testing showed that the map extraction solution more than doubled the runtimes of even the highest-performing solutions, hence the exceptionally high runtime numbers for Lilue.
I think this could be useful:
"1.2+3.453".split('+').flat_map{|elem| [elem, "+"]}[0...-1]
# => ["1.2", "+", "3.453"]
"1.2+3.453".split('+').flat_map{|elem| [elem.to_f, "+"]}[0...-1]
# => [1.2, "+", 3.453]
Obviously this work only for +. But you can change the split character.
EDIT:
This version work for every operator
"1.2+3.453".split(%r{(\+|\-|\/|\*)}).map do |x|
unless x =~ /(\+|\-|\/|\*)/ then x.to_f else x end
end
# => [1.2, "+", 3.453]
R = /
(?<=\d) # match a digit in a positive lookbehind
[^\d\.] # match any character other than a digit or period
/x # free-spacing regex definition mode
def split_it(str)
f,op,l = str.delete(' ').partition(R)
[convert(f), op, convert(l)]
end
def convert(str)
(str =~ /\./) ? str.to_f : str.to_i
end
split_it "1+12"
#=> [1, "+", 12]
split_it "3/ 5.2"
#=> [3, "/", 5.2]
split_it "-4.1 * 6"
#=> [-4.1, "*", 6]
split_it "-8/-2"
#=> [-8, "/", -2]
The regex can of course be written in the conventional way:
R = /(?<=\d)[^\d\.]/

Substitute Date issue/ and unterminated error

I am in need of some help and maybe some knowledge.
I am trying to change all the dates in a .text document from dd/mm/yyy to dd.mm.yyy . I am not going to lie using sed confuses me so much! Can any of you help me?
`# DAY SEP #2Month SEP #3year min2,max4
sed 's/\[0-3]?[0-9\][.\/]\([0-1]*[0-9]\)[-\/.]\([0-9]\{2,4\}\)/\2.\1\3/'`
Here is my error sed: file Frank_Alvarado_hw2.sed line 5: unterminated `s' command.
Presidency ,President ,Wikipedia Entry,Took office ,Left office ,Party ,Portrait ,Thumbnail,Home State
1,George Washington,http://en.wikipedia.org/wiki/George_Washington,30/04/1789,4/03/1797,Independent ,GeorgeWashington.jpg,thmb_GeorgeWashington.jpg,Virginia
If those escapes are confusing in sed then use:
sed -r 's~([0-9]{2})/([0-9]{2})/([0-9]{3})~\1.\2.\3~g' file
i.e.
Use of -r option for extended regex
Use of alternate regex delimiters like ~ to avoid escaping / in your pattern
PS: On OSX use sed -E instead of sed -r
sed 's/\[0-3]?[0-9\][.\/]\([0-1]*[0-9]\)[-\/.]\([0-9]\{2,4\}\)/\2.\1\3/'
^ ^^ ^ ^ ^ ^ ^ ^
| || | | | | | L 3th group (missing)
| || | | | | L end of group 2
| || | | | L start of group 2
| || | | L stop group 1
| || | L Start group 1
| || L class close so any `0123456789\\][./`
| |L class open
| L 0 or 1 occurence (of `]`)
L escape the `[` (not a class open) so litteral char
so mainly, missing the third group reference in second pattern

Regexp issue involving reverse polish calculator

I'm trying to use a regular expression to solve a reverse polish calculator problem, but I'm having issues with converting the mathematical expressions into conventional form.
I wrote:
puts '35 29 1 - 5 + *'.gsub(/(\d*) (\d*) (\W)/, '(\1\3\2)')
which prints:
35 (29-1)(+5) *
expected
(35*((29-1)+5))
but I'm getting a different result. What am I doing wrong?
I'm assuming you meant you tried
puts '35 29 1 - 5 + *'.gsub(/(\d*) (\d*) (\W)/, '(\1\3\2)')
^ ^
Anyway, you have to use the quantifier + instead of *, since otherwise you will match an empty string for \d* as one of your captures, hence the (+5):
/(\d+) (\d+) (\W)/
I would further extend/constrain the expression to something like:
/([\d+*\/()-]+)\s+([\d+*\/()-]+)\s+([+*\/-])/
| | | | |
| | | | Valid operators, +, -, *, and /.
| | | |
| | | Whitespace.
| | |
| | Arbitrary atom, e.g. "35", "(29-1)", "((29-1)+5)".
| |
| Whitepsace.
|
Arbitrary atom, e.g. "35", "(29-1)", "((29-1)+5)".
...and instead of using gsub, use sub in a while loop that quits when it detects that no more substitutions can be made. This is very important because otherwise, you will violate the order of operations. For example, take a look at this Rubular demo. You can see that by using gsub, you might potentially replace the second triad of atoms, "5 + *", when really a second iteration should substitute an "earlier" triad after substituting the first triad!
WARNING: The - (minus) character must appear first or last in a character class, since otherwise it will specify a range! (Thanks to #JoshuaCheek.)

Resources