Regular expression in bash to match multiple conditions - bash

I would like to implement a regular expression in bash that allows me to verify a series of characteristics on a dataset.
A sample is attached below:
id, date of birth, grade, explusion, serious misdemeanor
123,2005-01-01,5.36,1,1
582,1999-05-12,8.51,0,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1
id is required to have only 3 digits, date of birth less than 2000, minimum grade point average is 5.60 with the second decimal place being other than 0, and at least one expulsion or serious misconduct.
The result of executing the regular expression should be:
582, 1999-05-12, 8.51, 0, 1
I have tried to implement the following regular expression and it does not give me any result.
grep -E "^\d{0,3},[0-2][0-9][0-9][0-9].*,[1-5].[0-5][1-9],[1-9],[1-9]$"
Any idea?

If it is mandatory to use grep, would you please try:
grep -E '^[0-9]{1,3},1[0-9]{3}(-[0-9]{2}){2},(5\.[6-9][1-9]|[6-9]\.[0-9][1-9]|[1-9][0-9]+\.[0-9][1-9]),([1-9][0-9]*,[0-9]+|[0-9]+,[1-9][0-9]*)[[:space:]]?$' input_file
Result:
582,1999-05-12,8.51,0,1
[0-9]{1,3} matches if id has 1-3 digits. (I have interpreted only 3 digits like that. If it means differently, tweak the regex accordingly.)
1[0-9]{3}(-[0-9]{2}){2} matches if the birth year is before 200 exclusive.
(5\.[6-9][1-9]|[6-9]\.[0-9][1-9]|[1-9][0-9]+\.[0-9][1-9]) matches if grade is greater than 5.60 with the second decimal place being other than 0.
([1-9][0-9]*,[0-9]+|[0-9]+,[1-9][0-9]*) matches if either or both of explusion and serious misdemeanor have non-zero value.

Regular expressions do not understand numeric values, and they certainly do not understand boolean logic. All it knows is text. You'll need to use an actual programming language like Awk or Perl to do this.
Here's an example:
$ perl -l -a -F, -E'say if length($F[0])>3 || $F[2] < 5.60' foo.txt
123,2005-01-01,5.36,1,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1
This call to perl splits apart the fields on commas, and then prints the line if the length of the first column is over 3, or the value of the third column is less than 5.60.
This is just a starting point, but this is the direction to go.

Related

Finding number range with grep

I have a database in this format:
username:something:UID:something:name:home_folder
Now I want to see which users have a UID ranging from 1000-5000. This is what what I tried to do:
ypcat passwd | grep '^.*:.*:[1-5][0-9]\{2\}:'
My thinking is this: I go to the third column and find numbers that start with a number from 1-5, the next number can be any number - range [0-9] and that range repeats itself 2 more times making it a 4 digit number. In other words it would be something like [1-5][0-9][0-9][0-9].
My output, however, lists even UID's that are greater than 5000. What am I doing wrong?
Also, I realize the code I wrote could potentially lists numbers up to 5999. How can I make the numbers 1000-5000?
EDIT: I'm intentionally not using awk since I want to understand what I'm doing wrong with grep.
There are several problems with your regex:
As Sundeep pointed out in a comment, ^.*:.*: will match two or more columns, because the .* parts can match field delimiters (":") as well as field contents. To fix this, use ^[^:]*:[^:]*: (or, equivalently, ^\([^:]:\)\{2\}); see the notes on bracket expressions and basic vs extended RE syntax below)
[0-9]\{2\} will match exactly two digits, not three
As you realized, it matches numbers starting with "5" followed by digits other than "0"
As a result of these problems, the pattern ^.*:.*:[1-5][0-9]\{2\}: will match any record with a UID or GID in the range 100-599.
To do it correctly with grep, use grep -E '^([^:]*:){2}([1-4][0-9]{3}|5000):' (again, see Sundeep's comments).
[Added in edit:]
Concerning bracket expressions and what ^ means in them, here's the relevant section of the re_format man page:
A bracket expression is a list of characters enclosed in '[]'. It
normally matches any single character from the list (but see below).
If the list begins with '^', it matches any single character (but see
below) not from the rest of the list. If two characters in the list
are separated by '-', this is shorthand for the full range of
characters between those two (inclusive) in the collating sequence,
e.g. '[0-9]' in ASCII matches any decimal digit.
(bracket expressions can also contain other things, like character classes and equivalence classes, and there are all sorts of special rules about things like how to include characters like "^", "-", "[", or "]" as part of a character list, rather than negating, indicating a range, class, or end of the expression, etc. It's all rather messy, actually.)
Concerning basic vs. extended RE syntax: grep -E uses the "extended" syntax, which is just different enough to mess you up. The relevant differences here are that in a basic RE, the characters "(){}" are treated as literal characters unless escaped (if escaped, they're treated as RE syntax indicating grouping and repetition); in an extended RE, this is reversed: they're treated as RE syntax unless escaped (if escaped, they're treated as literal characters).
That's why I suggest ^\([^:]:\)\{2\} in the first bullet point, but then actually use ^([^:]*:){2} in the proposed solution -- the first is basic syntax, the second is extended.
The other relevant difference -- and the reason I switched to extended for the actual solution -- is that only extended RE allows | to indicate alternatives, as in this|that|theother (which matches "this" or "that" or "theother"). I need this capability to match a 4-digit number starting with 1-4 or the specific number 5000 ([1-4][0-9]{3}|5000). There's simply no way to do this in a basic RE, so grep -E and the extended syntax are required here.
(There are also many other RE variants, such as Perl-compatible RE (PCRE). When using regular expressions, always be sure to know which variant your regex tool uses, so you don't use syntax it doesn't understand.)
ypcat passwd |awk -F: '$3>1000 && $3 <5000{print $1}'
awk here can go the task in a simple manner. Here we made ":" as the delimiter between the fields and put the condition that third field should be greater than 1000 and less then 5000. If this condition meets print first field.

AWK - I need to write a one line shell command that will count all lines that

I need to write this solution as an AWK command. I am stuck on the last question:
Write a one line shell command that will count all lines in a file called "file.txt" that begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters, and end with a period.
Example(s):
This is the format of lines we want to print. Lines that do not match this format should be skipped:
(10) This is a sample line from file.txt that your script should
count.
(117) And this is another line your script should count.
Lines like this, as well as other non-matching lines, should be skipped:
15 this line should not be printed
and this line should not be printed
Thanks in advance, I'm not really sure how to tackle this in one line.
This is not a homework solution service. But I think I can give a few pointers.
One idea would be to create a counter, and then print the result at the end:
awk '<COND> {c++} END {print c}'
I'm getting a bit confused by the terminology. First you claim that the lines should be counted, but in the examples, it says that those lines should be printed.
Now of course you could do something like this:
awk '<COND>' file.txt | wc -l
The first part will print out all lines that follow the condition, but the output will be parsed to wc -l which is a separate program that counts the number of lines.
Now as to what the condition <COND> should be, I leave to you. I strongly suggest that you google regular expressions and awk, it shouldn't be too hard.
I think the requirement is very clear
Write a one line shell command that will count all lines in a file called "file.txt" that begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters, and end with a period.
1. begin with a decimal number in parenthesis
2. containing a mix of both upper and lower case letters
3. end with a period
check all three conditions. Note that in 2. it doesn't say "only" so you can have extra class of characters but it should have at least one uppercase and one lowercase character.
The example mixes concepts printing and counting, if part of the exercise it's very poorly worded or perhaps assumes that the counting will be done by wc by a piped output of a filtering script; regardless more attention should have been paid, especially for a student exercise.
Please comment if anything not clear and I'll add more details...

How to format a US currency string using python or sed

I have numerous invoices that I sent to clients with this string at the bottom:
Total: 1,000.00
or whatever the amount. Some are 2 figures, some 5 figures + the decimal part.
The thing is that the number's format is inconsistant accross all invoices. Sometimes its 1.000,00 and it keeps on switching the dot and the coma.
so with grep, awk and sed, i am able to only get the amount part from all invoices, without the dollar sign in order to sum them up to a grand total. But the dot and coma switching confuses python, obviously.
So in python (could be in sed as well), i am looking to convert the third char from the right to a dot and then from there on, every fourth char it finds, convert it to a coma.
In other words, it has to be able to separate the digits in groups of 3 from the right, add a coma in between each of them except for the first group at the far right which would be 2 digits separated by a dot.
Hope that is clear enough...
Try this:
yourstring = yourstring[:(len(yourstring)-3)].replace(".",",") + "." + yourstring[-2:]
I tried this on python and I think that works.
sed 's/$/ /
:coma
s/\([0-9]\)[.]\([0-9]\{3\}\)/\1,\2/g;t coma
:dot
s/\([0-9]\),\([0-9][0-9][^0-9]\)/\1.\2/g;t dot
s/ $//
' YourFile
use general and recursive modification for all number on each line.
change every dot number into coma structure then change last coma to a dot
need a trick to change number at end of string (add a space at start, remove it at the end [this could be optimized with a previous test])
posix compliant
Well, the simplest way i've found to handle this is using a bit of sed, some bash and for the final print, printf, which allow us easy currency formatting with "%'.2f" (note the ' character, it is mandatory):
# Get rid of every character that is not a digit
totals=$( echo "$totals" | sed 's/[^0-9]*//g' )
# Sum up the amounts
sum=0
for n in $totals; do
sum=$(($sum+$n))
done
# Put back the comas at each thousand, the dot at decimals and the $ sign in
sansdec=(${#sum}-2)
sum="${sum:0:$sansdec}.${sum: -2}"
printf "%s" "\$"
printf "%'.2f\n" "$sum"

How to implement Siri/Cortana like functionality in commandline?

I would like to implement a small subset of siri/cortana like features in command line.
For e.g.
$ What is the sum of 100 and 1000
> Response: 1100
$ What is the product of 10 and 12
> Response: 120
The questions are predefined regular expressions. It needs to call the matching function in ruby.
Pattern: What is the sum of (\d)+ and (\d)+
Ruby method to call: sum(a,b)
Any pointers/suggestion is appreciated.
That sounds exactly like cucumber, maybe take a look and see if you can just use their classes to hack something together :) ?
You could do something like the following:
question = gets.chomp
/\A.*(sum |product |quotient |difference )\D+([0-9]+)\D+([0-9]+).*\z/.match question
send($1, $2.to_i, $3.to_i)
Quick explanation for anyone that may be new to matching in Ruby:
This gets a line of input from the command line and scans it for a function name (i.e. sum, product, etc) followed by a space and potentially some non-digit characters. Then, it looks for a first number (similarly followed by a space and 0 or more non-digit characters) and a second number followed by nothing or anything. The parentheses determine what gets assigned to the variables preceded by a $, i.e. the substring that matches the contents of the first set of parentheses gets assigned to $1.
Next, it calls the method whose name is the value of $1 with the arguments (casted to integers) found in $2 and $3.
Obviously, this isn't generalized at all--you're putting the method names in the regex, and it's taking a fixed number of arguments--but it'll hopefully be useful for getting you on the right track.

gnu/unix sort numerical only using first column?

With regular strings, if the first field matches, we sort by the next field and so on, and things work as we expect.
echo -e 'a c\na b' | sort #regular string sort
a b
a c
With numbers, if the first field matches, we…switch to string sort on subsequent fields? Why? I would think it would compare each field numerically.
echo -e '1 22\n1 3' | sort -n #numeric sort
1 22
1 3
FYI, using sort (GNU coreutils) 5.97 on RHEL 5.5.
What am I missing here? I know I can use -k to pick the field I want to sort on, but that drastically reduces the flexibility of input allowed, as it requires the user to know the numbers of fields.
Thanks!
Sadly you haven't missed anything. This apparently simple task - split lines into fields and then sort numerically on all of them - can't be done by the unix sort program. You just have to figure out how many columns there are and name them all individually as keys.
What's happening when you specify -n no other options is that the whole line is being passed to the "convert string to number" routine, which converts the number at the start of the line and ignores the rest. The split into fields is not done at all.
Your first example, without -n, is also doing whole-line comparison. It's not comparing "a" to "a" then "b" to "c". It's comparing "a b" to "a c".

Resources