Dealing with special character in Nokogiri / Regex - ruby

I am getting the text from the body of an HTML doc as below. When I try to regex scan for the term "Exhibit 99", I get an no matched, i.e, an empty array. However, in the html, I do see "Exhibit 99", although inspect element shows it with &nbsp99. How can I get rid of these HTML characters and search for "Exhibit 99" as if it were a regular string?
url = "https://www.sec.gov/Archives/edgar/data/1467373/000146737316000912/fy16q3plc8-kbody.htm"
doc = Nokogiri::HTML(open(url))
body = doc.css("body").text
body.scan(/exhibit 99/i)

Unicode character space
You can use :
body.scan(/exhibit\p{Zs}99/i)
From the documentation about Unicode character’s General Category:
/\p{Z}/ - 'Separator'
/\p{Zs}/ - 'Separator: Space'
It matches a whitespace or a non-breaking space, but no tab or newline. The string should be encoded in UTF-8. See this related question for more information.
non-word character
A more permissive regex would be :
body.scan(/exhibit\W99/i)
This allows any character other than a letter, a digit or an underscore between exhibit and 99. It would match a whitespace, a nbsp, a tab, a dash, ...

Related

Regex to select all the commas from string that do not have any white space around them

I want to select all the commas in a string that do not have any white space around. Suppose I have this string:
"He,she, They"
I want to select only the comma between he and she. I tried this in rubular and came up with this regex:
(,[^(,\s)(\s,)])
This selects the comma that I want, but also selects an s which is a character after it.
In your regex (,[^(,\s)(\s,)]) you capture a comma followed by a negated character class that matches not any of the specified characters, which could also be written as (,[^)(,\s]) which will capture for example ,s in a group,
What you could do is use a positive lookahead and a positve lookbehind to check what is on the left and what is on the right is not a \S whitespace character:
(?<=\S),(?=\S)
Regex demo
In Ruby, you may use [[:space:]] to match any (Unicode) whitespace and [^[:space:]] to match any char other than whitespace. Using these character classes inside lookarounds solves the problem:
/(?<=[^[:space:]]),(?=[^[:space:]])/
See the Rubular demo
Here,
(?<=[^[:space:]]) - a positive lookbehind that matches a location that is immediately preceded with a non-whitespace char (if the string start position should also be matched, replace with (?<![[:space:]]))
, - a comma
(?=[^[:space:]]) - a positive lookahead that matches a location that is immediately followed with a non-whitespace char (if the string end position should also be matched, replace with (?![[:space:]])).
Check the regex below and use the code hope it will help you!
re = /[^\s](,)[^\s]/m
str = 'check ,my,domain, qwe,sd'
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
Check LIVE DEMO HERE

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

regex any non-digit with exception

I've got strings like these:
+996999966966AA
-996999966966AA
I am using this code:
"+996999966966AA".gsub!(/\D/, "")
to get rid of any character except digits, but the sign + also being stripped. How can my code retain the +?
Use:
[^+\d]
to match anything that isn't + or a digit.
You can also use \W, "non-word character" which matches any character that is not a word character (alphanumeric & underscore)).
(\W\d+)\w+

Why won't my simple regex pattern match and remove a file extension?

I have a string:
app_copy--28.ipa
The result I want is:
app_copy
The number after -- could be of variable length, so I want to match everything including and after --.
I've tried a few patterns, but none are matching for some reason:
gsub("--\*", "")
gsub("--*", "")
gsub("--*.ipa", "")
gsub("--\[0-9].ipa", "")
What am I missing?
Let's take a look at your test patterns:
"--\*" is actually equivalent to "--*" (since the \* is an escape sequence).
"--*" will match a single - character, followed by zero or more - characters.
"--*.ipa" will match a single - character, followed by zero or more - characters, followed by any single character, followed by a literal ipa.
"--\[0-9].ipa" is actually equivalent to "--[0-9].ipa" (since the \[ is an escape sequence), which will match a literal --, followed by a single decimal digit, followed by any single character, followed by a literal ipa.
However, none of these patterns would work as you used them because gsub will not treat it as a regular expression:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally…
You'd need to wrap type convert your pattern to a Regexp (using Regexp.new), or use a regular expression literal.
Try this pattern
--.*
This pattern will find any literal --, followed by zero or more of any character.
For example:
"app_copy--28.ipa".gsub(/--.*/, "") # app_copy
Don't use gsub to try to change the string, simply use a pattern to match the part you want:
"app_copy--28.ipa"[/^(.+?)--/, 1] # => "app_copy"
String's [] takes a lot of different types of parameters. You can pass in a pattern, and the index of the capture that you want, to extract just that part. From the documentation:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
How is this ?
str = "app_copy--28.ipa"
str[0..str.index("-")-1]
# => "app_copy"
str = "app_copy--28.ipa"
str.split("--").first
# => "app_copy"

Why do I get the Regexp warning "warning: nested repeat operator ? and * was replaced with '*'"

I have a regular expression for parsing Norwegian street addresses:
STREET_ADDRESS_PATTERN = <<-REGEX
^
(?<street_name>[\w\D\. ]+)\s+
(?<house_number>\d+)
(?<entrance>[A-Z])?\s*,\s*
(
(?<postal_code>\d{4})\s+
(?<city>[\w\D ]+)
)?
$
REGEX
It worked earlier, and I can't remember if I changed something, and in which case what I changed. In any case, now I'm getting this warning:
warning: nested repeat operator ? and * was replaced with '*'
And the match is returning nil. Can anybody see why I'm getting this warning?
Note: I'm currently using this (fake) address to test the expression: "Storgata 38H, 0273 Oslo".
Let's take a look at something you're doing to the poor regular expression engine:
(?<street_name>[\w\D\. ]+)\s+
The problem is inside the character class: [\w\D\. ]+. The following definitions are from Ruby's Regexp class documentation:
/\w/ - A word character ([a-zA-Z0-9_])
/\D/ - A non-digit character ([^0-9])
You're telling the engine to select:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
_
every character that is NOT 0123456789
. and spaces
In other words, every possible character. You'd do just as well to use:
(?<street_name>.+)
because that's going to be pretty greedy. This Rubular example shows your pattern is allowing the engine to capture everything thrown at it, including almost the entire string Storgata 38H, 0273 Oslo: http://rubular.com/r/nMfcB0cUdu
Also, \. inside [] is the same as [.] because the special use of period as a wildcard is escaped automatically inside the brackets. You don't need to escape it again to try to make it literal because it already is literal.
I'd strongly recommend using Rubular to take a look at each section of your regex, and try matching against several other possible addresses strings, and see if Rubular says the patterns are going to match what you expect. Once you've done that, try putting together the complete pattern. As is, I think your subsections are interacting and masking some problems that will come back to bite you later.
My hope was that [\w\D] would select all word characters except numbers... Any way to do that?
Ah. Let's dive into the documentation again:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
You want to use the /[[:alpha:]]/ pattern. As displayed it would capture only one character, but it'd be within any of the POSIX set of "letter" characters, which is the range you want:
[4] (pry) main: 0> 'æ, ø and å'.scan(/[[:alpha:]]/)
[
[0] "æ",
[1] "ø",
[2] "a",
[3] "n",
[4] "d",
[5] "å"
]
Here's a wee tweak:
[5] (pry) main: 0> 'æ, ø and å'.scan(/[[:alpha:]]+/)
[
[0] "æ",
[1] "ø",
[2] "and",
[3] "å"
]
Oh, now I see what I did. I replaced the ' delimiters of the string with <<-REGEX which means that all backslashes in the expression must now be escaped. Changing back to single ticks fixed the issue. After sepp2k's recommendation I further edited the Regex string into a literal:
STREET_ADDRESS_PATTERN = /
^
(?<street_name>[\w\D\. ]+)\s+
(?<house_number>\d+)
(?<entrance>[A-Z])?\s*,\s*
(
(?<postal_code>\d{4})\s+
(?<city>[\w\D ]+)
)?
$
/xi

Resources