How do I select an element with its text content via CSS not XPath? - ruby

"Nokogiri: How to select nodes by matching text?" can do this via XPath, however, I am looking for a way to use a CSS select that matches the text of element.
PyQuery and PHPQuery can do this. Isn't there a jQuery API lib for Ruby?

Nokogiri (now) implements jQuery selectors, making it possible to search the text of a node:
For instance:
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("bar")').text.strip
=> "bar"

Cannot be done with pure CSS, you'll have to mix it with Ruby code
doc = Nokogiri::HTML("<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>")
p doc.css('li').select{|li|li.text =~ /Apple/}

Related

Excluding contents of <span> from text using Waitr

Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end

Can I use Selenium and Nokogiri to locate an element based on a nearby label?

Let's say I want to scrape the "Weight" attribute from the following content on a website:
<div>
<h2>Details</h2>
<ul>
<li><b>Height:</b>6 ft</li>
<li><b>Weight:</b>6 kg</li>
<li><b>Age:</b>6</li>
</ul>
</div>
All I want is "6 kg". But it's not labeled, and neither is anything around it. But I know that I always want the text after "Weight:". Is there a way of selecting an element based on the text near it or in it?
In pseudocode, this is what it might look like:
require 'selenium-webdriver'
require 'nokogiri'
doc = parsed document
div_of_interest = doc.div where text of h2 == "Details"
element_of_interest = <li> element in div_of_interest with content that contains the string "Weight:"
selected_text = (content in element) minus ("<b>Weight:</b>")
Is this possible?
You can write the following code
p driver.find_elements(xpath: "//li").detect{|li| li.text.include?'Weight'}.text[/:(.*)/,1]
output
"6 kg"
My suggestion is to use WATIR which is wrapper around Ruby Selenium Binding where you can easily write the following code
p b.li(text: /Weight/).text[/:(.*)/,1]
Yes.
require 'nokogiri'
Nokogiri::HTML.parse(File.read(path_to_file))
.css("div > ul > li")
.children # get the 'li' items
.each_slice(2) # pair a 'b' item and the text following it
.find{|b, text| b.text == "Weight:"}
.last # extract the text element
.text
will return
"6 kg"
You can locate the element through pure xpath: use the contains() function which returns Boolean is its second argument found in the first, and pass to it text() (which returns the text of the node) and the target string.
xpath_locator = '/div/ul/li[contains(text(), "Weight:")]'
value = driver.find_element(:xpath, xpath_locator).text.partition('Weight:').last
Then just get the value after "Weight:".

Get element in particular index nokogiri

How can I get the element at index 2.
For example in following HTML I want to display the third element i.e a DIV:
<HTMl>
<DIV></DIV>
<OL></OL>
<DIV> </DIV>
</HTML>
I have been trying the following:
p1 = html_doc.css('body:nth-child(2)')
puts p1
I don't think you're understanding how we use a parser like Nokogiri, because it's a lot easier than you make it out to be.
I'd use:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<HTMl>
<DIV>1</DIV>
<OL></OL>
<DIV>2</DIV>
</HTML>
EOT
doc.at('//div[2]').to_html # => "<div>2</div>"
That's using at which returns the first Node that matches the selector. //div[2] is an XPath selector that will return the second <div> found. search could be used instead of at, but it returns a NodeSet, which is like an array, and would mean I'd need to extract that particular node.
Alternately, I could use CSS instead of XPath:
doc.search('div:nth-child(3)').to_html # => "<div>2</div>"
Which, to me, is not really an improvement over the XPath as far as readability.
Using search to find all occurrences of a particular tag, means I have to select the particular element from the returned NodeSet:
doc.search('div')[1].to_html # => "<div>2</div>"
Or:
doc.search('div').last.to_html # => "<div>2</div>"
The downside to using search this way, is it will be slower and needlessly memory intensive on big documents since search finds all occurrences of the nodes that match the selector in the document, and which are then thrown away after selecting only one. search, css and xpath all behave that way, so, if you only need the first matching node, use at or its at_css and at_xpath equivalents and provide a sufficiently definitive selector to find just the tag you want.
'body:nth-child(2)' doesn't work because you're not using it right, according to ":nth-child()" and how I understand it works. nth-child looks at the tag supplied, and finds the "nth" occurrence of it under its parent. So, you're asking for the third tag under body's "html" parent, which doesn't exist because a correctly formed HTML document would be:
<html>
<head></head>
<body></body
</html>
(How you tell Nokogiri to parse the document determines how the resulting DOM is structured.)
Instead, use: div:nth-child(3) which says, "find the third child of the parent of div, which is "body", and results in the second div tag.
Back to how Nokogiri can be told to parse a document; Meditate on the difference between these:
doc = Nokogiri::HTML(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>foo</p>
# >> </body></html>
and:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <p>foo</p>
If you can modify the HTML add id's and classes to target easily what you are looking for (also add the body tag).
If you can not modify the HTML keep your selector simple and access the second element of the array.
html_doc.css('div')[1]

Extract Image from Instagram using Nokogiri?

I'm trying to extract an image from Instagram using Nokogiri. I've tried so many things that I don't even think it's a good idea to show what I've done so far.
I'm starting with:
image_url = Nokogiri::HTML(open('http://instagram.com/p/g3mXJ1p109/'))
And I've noticed the picture on Instagram.com is in the following div:
<div class="Image iLoaded iWithTransition Frame" src="http://distilleryimage9.ak.instagram.com/b711daf4508c11e385ff1234c61f9f0f_8.jpg"></div>
Ok, just one thing I've tried:
Nokogiri::HTML(open(pic)).css('body script').children.first
and it gives me this:
#<Nokogiri::XML::CDATA:0x767c904 "\nwindow._csrf_token = '88b78a58e333056bcc67e338f06ce786';\nwindow._jscalls = [\n\n['bluebar', 'init', []],\n\n['framework/config', 'init', [{staticRoot: '//d36xtkk24g8jdx.cloudfront.net/bluebar/89c8068'}]],\n\n [\"lib\\/fullpage\\/transitions\",\"bootstrap\",[{\"componentName\":null,\"moduleName\":\"lib\\/ui\\/media\\/DesktopPPage\",\"props\":{\"viewer\":null,\"shortcode\":\"g3mXJ1p109\",\"prerelease\":false,\"staticRoot\":\"\\/\\/d36xtkk24g8jdx.cloudfront.net\\/bluebar\\/89c8068\",\"media\":{\"code\":\"g3mXJ1p109\",\"comments\":{\"nodes\":[]},\"date\":1384805104.0,\"likes\":{\"count\":0,\"viewer_has_liked\":false,\"nodes\":[]},\"owner\":{\"username\":\"marunbai\",\"requested_by_viewer\":false,\"profile_pic_url\":\"http:\\/\\/images.ak.instagram.com\\/profiles\\/anonymousUser.jpg\",\"id\":\"549577518\",\"followed_by_viewer\":false},\"is_video\":false,\"id\":\"592110592901733693\",\"display_src\":\"http:\\/\\/distilleryimage9.ak.instagram.com\\/b711daf4508c11e385ff1234c61f9f0f_8.jpg\"}}}]],\n\n];\n">
Nokogiri cannot evaluate the JavaScript. As a result, if you look at the HTML that Nokogiri sees, the div tag you see will not be there.
However, the page does contain a meta-element with the image source. You can extract the desired value from there:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://instagram.com/p/g3mXJ1p109/'))
p doc.at_css('meta[property="og:image"]')['content']
#=> "http://distilleryimage9.ak.instagram.com/b711daf4508c11e385ff1234c61f9f0f_8.jpg"

Inserting an element in local HTML file

I am trying to write a Ruby script that would read a local HTML file, and insert some more HTML (basically a string) into it after a certain #divid.
I am kinda noob so please don't hesitate to put in some code here.
Thanks
I was able to this by following...
doc = Nokogiri::HTML(open('file.html'))
data = "<div>something</div>"
doc.children.css("#divid").first.add_next_sibling(data)
And then (over)write the file with same data...
File.open("file.html", 'w') {|f| f.write(doc.to_html) }
This is a bit more correct way to do it:
html = '<html><body><div id="certaindivid">blah</div></body></html>'
doc = Nokogiri::HTML(html)
doc.at_css('div#certaindivid').add_next_sibling('<div>junk goes here</div>')
print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div id="certaindivid">blah</div>
<div>junk goes here</div>
</body></html>
Notice the use of .at_css(), which finds the first occurrence of the target node and returns it, avoiding getting a nodeset back, and relieving you of the need to grab the .first() node.

Resources