Extract values after pattern in Ruby string - ruby

I have a string like this:
"<root><some ProdCode=\"40\" ProducerName=\"demo1\" ProdCode=\"40\" Need_Confirmation=\"1\"/><some ProdCode=\"40\" ProducerName=\"demo1\" ProdCode=\"40\" Need_Confirmation=\"1\"/></root>"
I'm trying to pull the content from this string which is between =\"content\" and put it in an array, like ["40","demo1","40","1",40......]

You should use :scan to select elements by regexp pattern. Then remove escape characters.
string.scan(/"[^"]+"/).map { |element| element.delete('\\"') }
Explanation of pattern:
/ – regexp starts
" – first char should be "
[^"]+ – next should be any char except ". + sign says that number of such chars should be at least 1.
" – next should be again "
/ – regexp ends
So string.scan(/"[^"]+"/) would return:
["\"40\"", "\"demo1\"", "\"40\"", "\"1\"", "\"40\"", "\"demo1\"", "\"40\"", "\"1\""]
Then we can just delete \" using :delete method.
Convenient tool to build regexps is http://rubular.com/

When your string is this simple you can use scan + regular expression like this:
result = html.scan(/ProdCode="\d+?"/)
If it is more complex you can use a html parser like nokogiri or oga.

Related

How can I select quoted strings that are outside html tags?

I am working on a syntax highlighter in ruby. From this input string (processed per line):
"left"<div class="wer">"test"</div>"right"
var car = ['Toyota', 'Honda']
How can I find "left", and "right" in the first line, 'Toyota', and 'Honda' on the second line?
I have (["'])(\\\1|[^\1]*?)\1 to highlight the quoted strings. I am struggling with the negative look behind part of the regex.
I tried appending another regex (?![^<]*>|[^<>]*<\/), but I can't get it to work with quoted strings. It works with simple alphanumeric only.
You can match one or more tokens by creating groups using parentheses in regex, and using | to create an or condition:
/("left")|("right")|('Toyota')|('Honda')/
Here's an example:
http://rubular.com/r/C8ONnxKYEV
EDIT
Just saw the tile of your question specified that you want to search outside HTML tags.
Unfortunately this isn't possible using only Regular expressions. The reason is that HTML, along with any language that requires delimiters like "", '', (), aren't regular. In other words, regexes don't contain a way of distinguishing levels of nesting and therefore you'll need to use a parser along with your Regex. If you're doing this strictly in Ruby, consider using a tool like Nokogiri or Mechanize to properly parse and interact with the DOM.
Description
This Ruby script first finds and replaces the HTML tags, note this is not perfect, and is susceptible to many edge cases. Then the script just looks for all the single and double quoted values.
str = %Q["left" <div class="wer">"test"</div>"right"\n]
str = str + %Q<var car = ['Toyota', 'Honda']>
puts "SourceString: \n" + str + "\n\n"
str.gsub!(/(?:<([a-z]+)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>).*?<\/\1>/i, '_')
puts "SourceString after replacement: \n" + str + "\n\n"
puts "array of quoted values"
str.scan(/"[^"]*"|'[^']*'/)
Sample Output
SourceString:
"left" <div class="wer">"test"</div>"right"
var car = ['Toyota', 'Honda']
SourceString after replacement:
"left" _"right"
var car = ['Toyota', 'Honda']
=> ["\"left\"", "\"right\"", "'Toyota'", "'Honda'"]
Live Example
https://repl.it/CRGo
HTML Parsing
I do recommend using an HTML parsing engine instead. This one seems pretty decent for Ruby: https://www.ruby-toolbox.com/categories/html_parsing

Ruby Regex Group Replacement

I am trying to perform regular expression matching and replacement on the same line in Ruby. I have some libraries that manipulate strings in Ruby and add special formatting characters to it. The formatting can be applied in any order. However, if I would like to change the string formatting, I want to keep some of the original formatting. I'm using regex for that. I have the regular expression matching correctly what I need:
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, 'New Text')
However, what I really want is the matching from the first grouping found in:
(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))
to be appended to New Text and replaced as opposed to just New Text. I'm trying to reference the match in the form of
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, '\1' + 'New Text')
but my understanding is that \1 only works when using \d or \k. Is there any way to reference that specific capturing group in my replacement string? Additionally, since I am using an asterik for the [], I know that this grouping could occur more than once. Therefore, I would like to have the last matching occurrence yielded.
My expected input/output with a sample is:
Input: "\e[1mHello there\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
Input: "\e[1mHello there\e[44m\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
So the last grouping is found and appended.
You can use the following regex with back-reference \\1 in the replacement:
reg = /(\\e\[(?:[0-9]{1,2}|[3,9][0-8])m)+Text/
mystring = "\\e[1mHello there\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
mystring = "\\e[1mHello there\\e[44m\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
Output of the IDEONE demo:
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
Mind that your input has backslash \ that needs escaping in a regular string literal. To match it inside the regex, we use double slash, as we are looking for a literal backslash.

Why pipes are not deleted using "gsub" in Ruby?

I would like to delete from notes everything starting from the example_header. I tried to do:
example_header = <<-EXAMPLE
-----------------
---| Example |---
-----------------
EXAMPLE
notes = <<-HTML
Hello World
#{example_header}
Example Here
HTML
puts notes.gsub(Regexp.new(example_header + ".*", Regexp::MULTILINE), "")
but the output is:
Hello World
||
Why || isn't deleted?
The pipes in your regular expression are being interpreted as the alternation operator. Your regular expression will replace the following three strings:
"-----------------\n---"
" Example "
"---\n-----------------"
You can solve your problem by using Regexp.escape to escape the string when you use it in a regular expression (ideone):
puts notes.gsub(Regexp.new(Regexp.escape(example_header) + ".*",
Regexp::MULTILINE),
"")
You could also consider avoiding regular expressions and just using the ordinary string methods instead (ideone):
puts notes[0, notes.index(example_header)]
Pipes are part of regexp syntax (they mean "or"). You need to escape them with a backslash in order to have them count as actual characters to be matched.

Ruby regular expression

Apparently I still don't understand exactly how it works ...
Here is my problem: I'm trying to match numbers in strings such as:
910 -6.258000 6.290
That string should gives me an array like this:
[910, -6.2580000, 6.290]
while the string
blabla9999 some more text 1.1
should not be matched.
The regex I'm trying to use is
/([-]?\d+[.]?\d+)/
but it doesn't do exactly that. Could someone help me ?
It would be great if the answer could clarify the use of the parenthesis in the matching.
Here's a pattern that works:
/^[^\d]+?\d+[^\d]+?\d+[\.]?\d+$/
Note that [^\d]+ means at least one non digit character.
On second thought, here's a more generic solution that doesn't need to deal with regular expressions:
str.gsub(/[^\d.-]+/, " ").split.collect{|d| d.to_f}
Example:
str = "blabla9999 some more text -1.1"
Parsed:
[9999.0, -1.1]
The parenthesis have different meanings.
[] defines a character class, that means one character is matched that is part of this class
() is defining a capturing group, the string that is matched by this part in brackets is put into a variable.
You did not define any anchors so your pattern will match your second string
blabla9999 some more text 1.1
^^^^ here ^^^ and here
Maybe this is more what you wanted
^(\s*-?\d+(?:\.\d+)?\s*)+$
See it here on Regexr
^ anchors the pattern to the start of the string and $ to the end.
it allows Whitespace \s before and after the number and an optional fraction part (?:\.\d+)? This kind of pattern will be matched at least once.
maybe /(-?\d+(.\d+)?)+/
irb(main):010:0> "910 -6.258000 6.290".scan(/(\-?\d+(\.\d+)?)+/).map{|x| x[0]}
=> ["910", "-6.258000", "6.290"]
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map(&:to_f)
# => [910.0, -6.258, 6.29]
If you don't want integers to be converted to floats, try this:
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map do |ns|
ns[/\./] ? ns.to_f : ns.to_i
end
# => [910, -6.258, 6.29]

Simple Ruby Regex Question

I have a string in Ruby:
str = "<TAG1>Text 1<TAG1>Text 2"
I want to use gsub to get a string like this:
want = "<TAG2>Text 1</TAG2><TAG2>Text2</TAG2>"
In other words, I want to save everything in between a <TAG1> and EITHER: 1) the next occurrence of a "<", or 2) the end of the string.
The best regex i could come up with was:
regex = /<TAG1>(.*)(?:<|$)/
But the problem with this is that it'll just match the entire str, where what I want is both matches within str. (In other words, it seems like the end of string char ($) seems to have precedence over the "<" character--is there a way to flip it around?
/<TAG1>([^<]*)/ will match that. If there's no < it'll go all the way to the end of the string. Otherwise it will stop when it hits a <. Your problem is that . matches < as well. An alternative way would be to do /<TAG1>(.*?)(?:<|$)/, which makes the * non-greedy.

Resources