I have a file of addresses that I am attempting to scrub and I am using sed to get rid of unwanted charachters and formatting. In this case, I have zip codes followed by a period:
Mr. John Doe
Exclusively Stuff, 186
Caravelle Drive, Ponte Vedra FL
33487.
(for the time being, ignore the new lines; I am just focusing on the zip and period for now)
I want to remove the period (.) from the zip as my first step in cleaning this up. I tried to use sub strings in sed as follows (using "|" as a delimiter - it easier for me to see):
sed 's|\([0-9]{4}\)\.|\1|g' test.txt
Unfortunately, it doesn't remove the period. It just prints it out as part of the sub string based on this post:
Replace period surrounded by characters with sed
A point in the right direction would be greatly appreciated.
You specified 4 digits {4} but have 5 and you have to escape the { and }, for example:
sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt
Notice that you also have a space after the dot, so you might want to trim everything following five digits but to be safe you might want to specify that they must be at start of line ^.
In my case, if I type info sed which is more complete than man sed, I find this:
'-r'
'--regexp-extended'
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that 'egrep' accepts; they
can be clearer because they usually have less backslashes, but are
a GNU extension and hence scripts that use them are not portable.
*Note Extended regular expressions: Extended regexps.
And under Appendix A Extended regular expressions you can read:
The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, braces ('{}'),
and '|'. While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them _to match a
literal character_. '|' is special here because '\|' is a GNU extension
- standard basic regular expressions do not provide its functionality.
Examples:
'abc?'
becomes 'abc\?' when using extended regular expressions. It
matches the literal string 'abc?'.
'c\+'
becomes 'c+' when using extended regular expressions. It matches
one or more 'c's.
'a\{3,\}'
becomes 'a{3,}' when using extended regular expressions. It
matches three or more 'a's.
'\(abc\)\{2,3\}'
becomes '(abc){2,3}' when using extended regular expressions. It
matches either 'abcabc' or 'abcabcabc'.
'\(abc*\)\1'
becomes '(abc*)\1' when using extended regular expressions.
Backreferences must still be escaped when using extended regular
expressions.
Basic Solution: Use a Range Atom to Handle Your Posted Input
An easy (but slightly naive) way to do this with your posted input is to look for:
start of line
followed by exactly 5 digits (a standard US ZIP Code)
followed by zero or more characters (e.g. a ZIP+4)
followed by zero or more non-period characters (don't match a street address)
followed by a literal period
and just replace the whole match with the captured part of the match. For example:
With BSD sed or without extended expressions:
sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'
With GNU sed and extended regular expressions:
sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'
Either way, given your posted input you end up with:
Mr. John Doe
Exclusively Stuff, 186
Caravelle Drive, Ponte Vedra FL
33487
Advanced Solution: Handle ZIP Codes Properly
The main caveat is that the solution above works with your posted sample, but won't match if the ZIP Code is properly at the end of the last line of the address as it should be in a standardized USPS address. That's fine if you've got a custom format, but it will likely cause you problems with standardized or corrected addresses such as:
Mr. John Doe
12345 Exclusively Stuff, 186
Caravelle Drive, Ponte Vedra FL 33487.
The following will work with both your posted input and a more typical USPS address, but your mileage on other non-standard inputs may vary.
# More reliable, but much harder to read.
sed -r 's/([[:digit:]]{5}(-[[:digit:]]{4})?[[:space:]]*)\.[[:space:]]*$/\1/'
Related
I have a database in this format:
username:something:UID:something:name:home_folder
Now I want to see which users have a UID ranging from 1000-5000. This is what what I tried to do:
ypcat passwd | grep '^.*:.*:[1-5][0-9]\{2\}:'
My thinking is this: I go to the third column and find numbers that start with a number from 1-5, the next number can be any number - range [0-9] and that range repeats itself 2 more times making it a 4 digit number. In other words it would be something like [1-5][0-9][0-9][0-9].
My output, however, lists even UID's that are greater than 5000. What am I doing wrong?
Also, I realize the code I wrote could potentially lists numbers up to 5999. How can I make the numbers 1000-5000?
EDIT: I'm intentionally not using awk since I want to understand what I'm doing wrong with grep.
There are several problems with your regex:
As Sundeep pointed out in a comment, ^.*:.*: will match two or more columns, because the .* parts can match field delimiters (":") as well as field contents. To fix this, use ^[^:]*:[^:]*: (or, equivalently, ^\([^:]:\)\{2\}); see the notes on bracket expressions and basic vs extended RE syntax below)
[0-9]\{2\} will match exactly two digits, not three
As you realized, it matches numbers starting with "5" followed by digits other than "0"
As a result of these problems, the pattern ^.*:.*:[1-5][0-9]\{2\}: will match any record with a UID or GID in the range 100-599.
To do it correctly with grep, use grep -E '^([^:]*:){2}([1-4][0-9]{3}|5000):' (again, see Sundeep's comments).
[Added in edit:]
Concerning bracket expressions and what ^ means in them, here's the relevant section of the re_format man page:
A bracket expression is a list of characters enclosed in '[]'. It
normally matches any single character from the list (but see below).
If the list begins with '^', it matches any single character (but see
below) not from the rest of the list. If two characters in the list
are separated by '-', this is shorthand for the full range of
characters between those two (inclusive) in the collating sequence,
e.g. '[0-9]' in ASCII matches any decimal digit.
(bracket expressions can also contain other things, like character classes and equivalence classes, and there are all sorts of special rules about things like how to include characters like "^", "-", "[", or "]" as part of a character list, rather than negating, indicating a range, class, or end of the expression, etc. It's all rather messy, actually.)
Concerning basic vs. extended RE syntax: grep -E uses the "extended" syntax, which is just different enough to mess you up. The relevant differences here are that in a basic RE, the characters "(){}" are treated as literal characters unless escaped (if escaped, they're treated as RE syntax indicating grouping and repetition); in an extended RE, this is reversed: they're treated as RE syntax unless escaped (if escaped, they're treated as literal characters).
That's why I suggest ^\([^:]:\)\{2\} in the first bullet point, but then actually use ^([^:]*:){2} in the proposed solution -- the first is basic syntax, the second is extended.
The other relevant difference -- and the reason I switched to extended for the actual solution -- is that only extended RE allows | to indicate alternatives, as in this|that|theother (which matches "this" or "that" or "theother"). I need this capability to match a 4-digit number starting with 1-4 or the specific number 5000 ([1-4][0-9]{3}|5000). There's simply no way to do this in a basic RE, so grep -E and the extended syntax are required here.
(There are also many other RE variants, such as Perl-compatible RE (PCRE). When using regular expressions, always be sure to know which variant your regex tool uses, so you don't use syntax it doesn't understand.)
ypcat passwd |awk -F: '$3>1000 && $3 <5000{print $1}'
awk here can go the task in a simple manner. Here we made ":" as the delimiter between the fields and put the condition that third field should be greater than 1000 and less then 5000. If this condition meets print first field.
I want to puts sharp's instead of password in ruby code
puts " found password: #{pass.tr('?','#')}"
I need as many sharp '#' characters output as characters in a password.
How to do it right?
The method .tr is intended to swap specific characters, you cannot do a wild-card match. Even if you extended it to cover many characters, there is a risk that you miss or forget a special character that is allowed in passwords on your system.
A simple variant of what you have is to use .gsub instead:
pass.gsub(/./,'#')
This uses regular expressions to find groups of characters to swap. The simple Regexp /./ matches any single character. The Ruby core documentation on regular expressions includes a brief introduction, in case you have not used them much before.
I have an assignment in which I have to send to a file an unlimited list of parameters, the file will have to print the strings which are repeated in the following way:
NumNumNumCharCharChar...
Num- number
Char-character
every three following numbers are the same, as well as the three next characters, then another three numbers and then another three characters.
The string must start with numbers and end with characters in a repeated way.
In order to solve this question, you may use only grep/egrep — up to you, which means that the solution is in regular expressions..
OK, this is what I thought to do for the egrep:
egrep "^([0-9][0-9][0-9][a-b][a-b][a-b])\1*$"
Your attempt is almost correct. The backreference \1 will require repetitions of the matching string, not the matching pattern. Allow the pattern to repeat instead. Inside the repetitions, you do want backreferences:
egrep '^(([0-9])\2{2}([a-z])\3{2})+$' file
As a shell scripting tweak, I switched to single quotes (double quotes are less safe) and I extended the lowercase class to [a-z]. Note that the outer parentheses are group 1, so the backreferences to the inner parenthesized expressions will be \2 and \3.
How should I use 'sed' command to find and replace the given word/words/sentence without considering any of them as the special character?
In other words hot to treat find and replace parameters as the plain text.
In following example I want to replace 'sagar' with '+sagar' then I have to give following command
sed "s/sagar/\\+sagar#g"
I know that \ should be escaped with another \ ,but I can't do this manipulation.
As there are so many special characters and theie combinations.
I am going to take find and replace parameters as input from user screen.
I want to execute the sed from c# code.
Simply, I do not want regular expression of sed to use. I want my command text to be treated as plain text?
Is this possible?
If so how can I do it?
While there may be sed versions that have an option like --noregex_matching, most of them don't have that option. Because you're getting the search and replace input by prompting a user, you're best bet is to scan the user input strings for reg-exp special characters and escape them as appropriate.
Also, will your users expect for example, their all caps search input to correctly match and replace a lower or mixed case string? In that case, recall that you could rewrite their target string as [Ss][Aa][Gg][Aa][Rr], and replace with +Sagar.
Note that there are far fewer regex characters used on the replacement side, with '&' meaning "complete string that was matched", and then the numbered replacment groups, like \1,\2,.... Given users that have no knowledge or expectation that they can use such characters, the likelyhood of them using is \1 in their required substitution is pretty low. More likely they may have a valid use for &, so you'll have to scan (at least) for that and replace with \&. In a basic sed, that's about it. (There may be others in the latest gnu seds, or some of the seds that have the genesis as PC tools).
For a replacement string, you shouldn't have to escape the + char at all. Probably yes for \. Again, you can scan your user's "naive" input, and add escape chars as need.
Finally if you're doing this for a "package" that will be distributed, and you'll be relying on the users' version of sed, beware that there are many versions of sed floating around, some that have their roots in Unix/Linux, and others, particularly of super-sed, that (I'm pretty sure) got started as PC-standalones and has a very different feature set.
IHTH.
I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).