Rewrite Word with capitals - mod-rewrite

I'm trying to write RewriteRule rules to select a capital letter word and rewrite to a query. The capital letter word could be in different positions. There are other single capital letters that are to be ignored
An example would be finding the word KELPIE - note it is the only word in full capitals
http://www.atestdomain.com.au/DogsBigBlackKELPIE.htm
needs to become
http://www.atestdomain.com.au/animals/search.php?keyword=&category=2&dogtype=KELPIE&location_id=2&submit=Search

Something like this is what you're after.
RewriteRule ([A-Z]{2,})\.htm$ animals/search.php?keyword=&category=2&dogtype=$1&location_‌id=2&submit=Search [NS,NE,B,DPI,L]
But, like I said, that still won't be able to differentiate between an uppercase keyword, and one preceded by a single letter word (which looks like it would be uppercased in your scheme).

Related

Can someone help me with Ruby regex to check any word with letters starting with t and ending with r and replace with word Twitter? Thank you

Can someone help me with Ruby regex to check any word with letters starting with t and ending with r and replace with word Twitter? Thank you
I find that Rubular is very useful for working out how regexes work in Ruby.
You have two questions here. First, what regex will recognise what you want. Second, how to replace that found string with something else.
Your regex will be something like /\bt\w*r\b/. The elements here are \b, which is a word boundary. Then, we have the letter t, then any number of word characters \w*, then the letter r, and finally another word boundary \b. (Without the word-boundary characters, your regex will find t...r inside other words, too, so will work on things like 'stress', 'stirs' etc.
To do the replacement you want the gsub method.
new_string = your_string.gsub(/\bt\w*r\b/i, 'Twitter')
This will substitute the string Twitter for the found regex. The i on the end of the regex makes it case-insensitive - omit this if you want it to only find the lower-case text as in the regex.

Discard contractions from string

I have a special use case where I want to discard all the contractions from the string and select only words followed by alphabets which do not contain any special character.
For eg:
string = "~ ASAP ASCII Achilles Ada Stackoverflow James I'd I'll I'm I've"
string.scan(/\b[A-z][a-z]+\b/)
#=> ["Achilles", "Ada", "Stackoverflow", "James", "ll", "ve"]
Note: It's not discarding the whole word I'll and I've
Can someone please help how to discard the whole word which contains contractions?
Try this Regex:
(?:(?<=\s)|(?<=^))[a-zA-Z]+(?=\s|$)
Explanation:
(?:(?<=\s)|(?<=^)) - finds the position immediately preceded by either start of the line or by a white-space
[a-zA-Z]+ - matches 1+ occurrences of a letter
(?=\s|$) - The substring matched above must be followed by either a whitespace or end of the line
Click for Demo
Update:
To make sure that not all the letters are in upper case, use the following regex:
(?:(?<=\s)|(?<=^))(?=\S*[a-z])[a-zA-Z]+(?=\s|$)
Click for Demo
The only thing added here is (?=\S*[a-z]) which means that there must be atleast one lowercase letter
I know that there's an accepted answer already, but I'd like to give my own shot:
(?<=\s|^)\w+[a-z]\w*
You can test it here. This regex is shorter and more efficient (157 steps against 315 from the accepted answer).
The explanation is rather simple:
(?<=\s|^)- This is a positive look behind. It means that we want strings preceded by a whitespace character or the start of the string.
\w+[a-z]\w* - This one means that we want strings composed by letters only (word characters) containing least one lowercase letter, thus discarding words which are whole uppercase. Along with the positive look behind, the whole regex ends up discarding words containing special characters.
NOTE: this regex won't take into account one-letter words. If you want to accomplish that, then you should use \w*[a-z]\w* instead, with a little efficiency cost.

Replace non-word characters, unless given sequence matches

I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

Rewriting URLs with mod_rewrite for languages

I need some URL rewriting for my website using mod_rewrite but I can't figure out the regular expressions.
Here is what the current URLs may look like:
http://mydomain.com/zenphoto/pages/xyz?locale=en_US
http://mydomain.com/zenphoto/pages/xyz?locale=de_DE
http://mydomain.com/zenphoto/gallery_1?locale=de_DE
http://mydomain.com/zenphoto/gallery_n?locale=de_DE
xyz may contain different strings, e.g. legal, about, etc.
And that's how I'd like the URLs to be used:
http://mydomain.com/zenphoto/de/pages/xyz
http://mydomain.com/zenphoto/en/pages/xyz
http://mydomain.com/zenphoto/de/gallery_1
http://mydomain.com/zenphoto/en/gallery_n
I should mention that only de and en shall be possible. Any other strings shall be rerouted to de.
Could somebody help me please? :-)
Thanks,
Robert
RewriteEngine on
RewriteRule ^zenphoto/pages/([a-z]+)\?locale=(en|de)_[A-Z]{2}$ /zenphoto/$2/pages/$1
RewriteRule ^zenphoto/gallery_([0-9])\?locale=(en|de)_[A-Z]{2}$ /zenphoto/$2/gallery_$1
For the first example, I say: "If the URL starts (^) with "zenphoto/pages/" then have a sequence of lowercase letters (+ means "one or more", and [a-z] means "a letter in [a, b, ..., y, z]"), which is my first group (there is parentheses -> it's a group). Then it's followed by "?locale=", then by "en" or (| means "or") "de", and this is my second group, then it's followed by an underscore ("_") and two uppercase letters, and there is nothing after ($ means it's the end of the URL)".
I write a space, and the new URL I want, and I use $n to use the n-th group.
The second URL is the 'pretty one', and the first is the real.
You have to use backslashes before special chars like ?,+,{,},(,),[,],*,.,| if you want to use one in your URL.
Edit:
If you want to avoid infinite loops, you should add the flag [L] (L = Last) at the end of each line.

Resources