How can I create a nokogiri case insensitive Xpath selector? - ruby

I'm using nokogiri to select the 'keywords' attribute like this:
puts page.parser.xpath("//meta[#name='keywords']").to_html
One of the pages I'm working with has the keywords label with a capital "K" which has motivated me to make the query case insensitive.
<meta name="keywords"> AND <meta name="Keywords">
So, my question is: What is the best way to make a nokogiri selection case insensitive?
EDIT Tomalak's suggestion below works great for this specific problem. I'd like to also use this example to help understand nokogiri better though and have a couple issues that I'm wondering about and have not been successful searching for. For example, are the regex 'pseudo classes' Nokogiri Docs appropriate for a problem like this?
I'm also curious about the matches?() method in nokogiri. I have not been able to find any clarification on the method. Does it have anything to do with the 'matches' concept in XPath 2.0 (and therefore could it be used to solve this problem)?
Thanks very much.

Nokogiri allows custom XPath functions. The nokogiri docs that you link to show an inline class definition for when you're only using it once. If you have a lot of custom functions or if you use the case-insensitive match a lot, you may want to define it in a class.
class XpathFunctions
def case_insensitive_equals(node_set, str_to_match)
node_set.find_all {|node| node.to_s.downcase == str_to_match.to_s.downcase }
end
end
Then call it like any other XPath function, passing in an instance of your class as the 2nd argument.
page.parser.xpath("//meta[case_insensitive_equals(#name,'keywords')]",
XpathFunctions.new).to_html
In your Ruby method, node_set will be bound to a Nokogiri::XML::NodeSet. In the case where you're passing in an attribute value like #name, it will be a NodeSet with a single Nokogiri::XML::Attr. So calling to_s on it gives you its value. (Alternatively, you could use node.value.)
Unlike using XPath translate where you have to specify every character, this works on all the characters and character encodings that Ruby works on.
Also, if you're interested in doing other things besides case-insensitive matching that XPath 1.0 doesn't support, it's just Ruby at this point. So this is a good starting point.

Wrapped for legibility:
puts page.parser.xpath("
//meta[
translate(
#name,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'
) = 'keywords'
]
").to_html
There is no "to lower case" function in XPath 1.0, so you have to use translate() for this kind of thing. Add accented letters as necessary.

Related

How do I write a CSS selector that looks for an element starting with text in a case-insensitive way?

I'm using Rails 5.0.1 with Nokogiri. How do I select a CSS element whose text starts with a certain string in a case insensitive way? Right now I can search for something in a case-sensitive way using
doc.css("#select_id option:starts-with('ABC')")
but I would like to know how to disregard case when looking for an option that starts with certain text?
Summary It's ugly. You're better off just using Ruby:
doc.css('select#select_id > option').select{ |opt| opt.text =~ /^ABC/i }
Details
Nokogiri uses libxml2, which uses XPath to search XML and HTML documents. Nokogiri transforms ~CSS expressions into XPath. For example, for your ~CSS selector, this is what Nokogiri actually searches for:
Nokogiri::CSS.xpath_for("#select_id option:starts-with('ABC')")
#=> ["//*[#id = 'select_id']//option[starts-with(., 'ABC')]"]
The expression you wrote is not actually CSS. There is no :starts-with() pseudo-class in CSS, not even proposed in Selectors 4. What there is is the starts-with() function in XPath, and Nokogiri is (somewhat surprisingly) allowing you to mix XPath functions into your CSS and carrying them over to the XPath it uses internally.
The libxml2 library is limited to XPath 1.0, and in XPath 1.0 case-insensitive searches are done by translating all characters to lowercase. The XPath expression you'd want is thus:
//select[#id='select_id']/option[starts-with(translate(.,'ABC','abc'),'abc')]
(Assuming you only care about those characters!)
I'm not sure that you CAN write CSS+XPath in a way that Nokogiri would produce that expression. You'd need to use the xpath method and feed it that query.
Finally, you can create your own custom CSS pseudo-classes and implement them in Ruby. For example:
class MySearch
def insensitive_starts_with(nodes, str)
nodes.find_all{ |n| n.text =~ /^#{Regex.escape(str)}/i }
end
end
doc.css( "select#select_id > option:insensitive_starts_with('ABC')", MySearch )
...but all this gives you is re-usability of your search code.

How can I remove HTML using Ruby regular expressions?

I want to remove everything contained within two HTML tags, as well as the tags themselves, using regular expressions in Ruby. Here's an example:
<tag>a bunch of stuff between the tags, no matter what it is</tag>
Basically, I want to use gsub! to filter all instances of this type out, like so:
text_file_contents.gsub!(/appropriate regex/, '')
What would be a good Ruby regular expression for doing so?
As has been said in the comments use an html parser. If, however, you just want to remove everything between two tags and don't care about nesting (e.g. if you have <tag><tag></tag></tag>) then you can simply use:
text_file_contents.gsub!(/<tag>.*?<\/tag>/, '')
But again this is flaky. Nokogiri is really easy to use and will be a lot more stable, please use that.
require 'nokogiri'
doc = Nokogiri::XML(yourfile)
doc.search('//tag').each do |node|
node.remove
end

Ruby Regex: Return just the match

When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.

ruby regex links not already in anchor tag

I am using ruby 1.8.7. I am not using rails.
How do I find all the links which are not already in anchor tag.
s = %Q{ <a href='www.a.com'><b>www.a.com</b></a> www.b.com <div>www.c.com</div> }
The output of above string should be
www.b.com
www.c.com
I know "b" tag before www.a.com complicates the case but that's what I have to work with.
You are going to want to use a real XML parser (Nokogiri will do). Regexes are unsuitable for a task like this. Especially so in ruby 1.8.7 where negative look behind is not supported.
Dirty way to get rid of anchor tags. Doesn't work the way you want if they're nested. Also use a real parser ;-)
s.gsub(%r[<a\b.*?</a>]i, "")
=> " www.b.com <div>www.c.com</div> "

Using upper-case and lower-case xpath functions in selenium IDE

I am trying to get a xpath query using the xpath function lower-case or upper-case, but they seem to not work in selenium (where I test my xpath before I apply it).
Example that does NOT work:
//*[.=upper-case('some text')]
I have no problem locating the nodes I need in complex path and even using aggregated functions, as long as I don't use the upper and lower case.
Has anyone encountered this before? Does it make sense?
Thanks.
upper-case() and lower-case() are XPath 2.0 functions. Chances are your platform supports XPath 1.0 only.
Try:
translate('some text','abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')
which is the XPath 1.0 way to do it. Unfortunately, this requires knowledge of the alphabet the text uses. For plain English, the above probably works, but if you expect accented characters, make sure you add them to the list.
In most environments you are using XPath out of a host language of some sort, and can use the host language's capabilities to work around this XPath 1.0 limitation by externally providing upper- and lower-case variants of the search string to translate().
Shown on the example of Python:
search = 'Some Text'
lc = search.lower()
uc = search.upper()
xpath = f"//p[contains(translate(., '{lc}', '{uc}'), '{uc}')]"
This would produce the following XPath expression:
//p[contains(translate(., 'some text', 'SOME TEXT'), 'SOME TEXT')]
which searches case-insensitively and works for arbitrary search text.
If you are going to need upper case in multiple places in your xslt, you can define variables for the lower case and upper case and then use them in your translate function everywhere. It should make your xslt much cleaner.
Example at XSL/XPATH : No upper-case function in MSXML 4.0 ?

Resources