Get all tags followings a certain with mechanize ? (ruby) - ruby

How can I get all elements following once, like :
<div id="exemple">
<h2 class="target">foo</h2>
<p>bla bla</p>
<ul>
<li>bar1</li>
<li>bar2</li>
<li>bar3</li>
</ul>
<h4>baz</h4>
<ul>
<li>lot</li>
</ul>
<div>of</div>
<p>possible</p>
<p>tags</p>
after
</div>
I need to detect <h2 class="target"> and get all tags to the next <h4> and ignore <h4> AND all followings tags (if <h4> not exist, I have to get all tags to the end of parent [here : end of <div>])
The content is dynamic and unpredictable The only rule is : we know there is a target and there is a (or end of element). I need to get all tags beetween both and exclud all others.
With this exemple I need to get the HTML following :
<h2 class="target">foo</h2>
<p>bla bla</p>
<ul>
<li>bar1</li>
<li>bar2</li>
<li>bar3</li>
</ul>
so I can get : target = page.at('#exemple .target')
I know next_sibling method, but how can i test the type of tag of the current node?
I think about something like that to course the node tree :
html = ''
while not target.is_a? 'h4'
html << target.inner_html
target = target.next_sibling
How can I do this?

You can subtract the ones you don't want from your nodeset:
h2 = page.at('h2')
(h2.search('~ *') - h2.search('~ h4','~ h4 ~ *')).each do |el|
# el is not a h4 and does not follow a h4
end
Maybe it makes more sense to use xpath but I can do this without googling.
Your idea of iterating next sibling can work too:
el = page.at('h2 ~ *')
while el && el.name != 'h4'
# do something with el
el = el.at('+ *')
end

Looks like you want to return the h2 element and its following siblings. I'm not clear on whether you want to keep or discard the h4; if you want to keep it the XPath would be:
//h2[#class="target"] | //h2[#class="target"]/following-sibling::*
If you need to exclude the h4:
//h2[#class="target"] | //h2[#class="target"]/following-sibling::*[not(self::h4)]
Edit: If you need to exclude an h4 and anything beyond:
//h2[#class="target"] | //h2[#class="target"]/following-sibling::*[not(self::h4) | not(preceding-sibling::h4)]

Related

Scraping content from html page

I'm using nokogiri to scrape web pages. The structure of the page is made of an unordered list containing multiple list items each of which has a link, an image and text, all contained in a div.
I'm trying to find clean way to extract the elements in each list item so I can have each li contained in an array or hash like so:
li[0] = ['Acme co 1', 'image1.png', 'Customer 1 details']
li[1] = ['Acme co 2', 'image2.png', 'Customer 2 details']
At the moment I get all the elements in one go then store them in separate arrays. Is there a better, more idiomatic way of doing this?
This is the code atm:
data = Nokogiri::HTML(html)
images = []
name = []
data.css('ul li img').each {|l| images << l}
data.css('ul li a').each {|a| names << a.text }
This is the html I'm working from:
<ul class="customers">
<li>
<div>
Acme co 1
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Cusomter 1 details
</div>
</div>
</li>
<li>
<div>
Acme co 2
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Customer 2 details
</div>
</div>
</li>
</ul>
Thanks
Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each methods with #map:
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
This simplifies your code slightly, but your original version wasn't too bad.
And my simplification may not generalise if you are, for example, scraping images from multiple regions on the page! In which case, reverting back to something like your original may be fine.

XPath in RSelenium for indexing list of values

Here is an example of html:
<li class="index i1"
<ol id="rem">
<div class="bare">
<h3>
<a class="tlt mhead" href="https://www.myexample.com">
<li class="index i2"
<ol id="rem">
<div class="bare">
<h3>
<a class="tlt mhead" href="https://www.myexample2.com">
I would like to take the value of every href in a element. What makes the list is the class in the first li in which class' name change i1, i2.
So I have a counter and change it when I go to take the value.
i <- 1
stablestr <- "index "
myVal <- paste(stablestr , i, sep="")
so even if try just to access the general lib with myVal index using this
profile<-remDr$findElement(using = 'xpath', "//*/input[#li = myVal]")
profile$highlightElement()
or the href using this
profile<-remDr$findElement(using = 'xpath', "/li[#class=myVal]/ol[#id='rem']/div[#id='bare']/h3/a[#class='tlt']")
profile$highlightElement()
Is there anything wrong with xpath?
Your HTML structure is invalid. Your <li> tags are not closed properly, and it seems you are confusing <ol> with <li>. But for the sake of the question, I assume the structure is as you write, with properly closed <li> tags.
Then, constructing myVal is not right. It will yield "index 1" while you want "index i1". Use "index i" for stablestr.
Now for the XPath:
//*/input[#li = myVal]
This is obviously wrong since there is no input in your XML. Also, you didn't prefix the variable with $. And finally, the * seems to be unnecessary. Try this:
//li[#class = $myVal]
In your second XPath, there are also some errors:
/li[#class=myVal]/ol[#id='rem']/div[#id='bare']/h3/a[#class='tlt']
^ ^ ^
missing $ should be #class is actually 'tlt mhead'
The first two issues are easy to fix. The third one is not. You could use contains(#class, 'tlt'), but that would also match if the class is, e.g., tltt, which is probably not what you want. Anyway, it might suffice for your use-case. Fixed XPath:
/li[#class=$myVal]/ol[#id='rem']/div[#class='bare']/h3/a[contains(#class, 'tlt')]

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Watir. How to make a call to an element of page object, inside a line like $browser.link.click?

We have a page objects elements like
link (:test_link, xpath: './/a[#id = '3'])
unordered_list (:list, id: 'test')
And the code:
def method(elementcontainer, elementlink)
elementcontainer = elementcontainer.downcase.gsub(' ', '_')
elementlink = elementlink.downcase.gsub(' ', '_')
object = send("#{elementcontainer}_element")
object2 = send("#{elementlink}_element")
total_results_1 = object.element.links(id: '3')]").length
total_results_2 = object.element.links(object2).length
end
The last 2 lines contain the mystery.
The total_results_1 is able to get the number of links contained in the unordered list that have id = '3'.
total_results_2 does not work (of course). I don´t want to write in the middle of the code, again, the identification of the links. That is done in the page object.
How it is possible to write something like the total_results_2 line, but in a working version?
I might be misunderstanding the question, but I do not believe you need to create a method for what you want. It can all be done using the page object accessors.
Say we have the following page (I matched this to your accessors, though it seems unlikely that all links would have the same id):
<html>
<body>
<a id="3" href="#">1</a>
<ul id="test">
<li><a id="3" href="#">2</a></li>
<li><a id="3" href="#">3</a></li>
<li><a id="3" href="#">4</a></li>
</ul>
<a id="3" href="#">5</a>
</body>
</html>
As you did, you could define the list with the accessor:
unordered_list(:list, id: 'test')
To get the links with id 3, but are only within the list, you could:
Define the links as a collection - ie use links instead of link.
Use a block to locate the elements. This would allow you to consider the element nesting - ie locate links within the list element.
This would be done with:
links(:test_link){ list_element.link_elements(:id => '3') }
All together, your page object would be:
class MyPage
include PageObject
unordered_list(:list, id: 'test')
links(:test_link){ list_element.link_elements(:id => '3') }
end
To find the number of links, you would access the element collection and check its length.
browser = Watir::Browser.new
browser.goto('your_test_page.htm')
page = MyPage.new(browser)
puts page.test_link_elements.length
#=> 3

Returning a list of <li> WebElements via find_element_by_xpath

So I am using a combination of Selenium and Python 2.7 (and if it matters the browser I am in is Firefox). I am new to XPath but it seems very useful for fetching WebElements.
I have the following HTML file that I am parsing through:
<html>
<head></head>
<body>
..
<div id="childItem">
<ul>
<li class="listItem"><img/><span>text1</span></li>
<li class="listItem"><img/><span>text2</span></li>
...
<li class="listItem"><img/><span>textN</span></li>
</ul>
</div>
</body>
</html>
Now I can use the following code to get a list of all the li elements:
root = element.find_element_by_xpath('..')
child = root.find_element_by_id('childDiv')
list = child.find_elements_by_css_selector('div.childDiv > ul > li.listItem')
I am wondering how I can do this in an XPath statement. I have tried a few statments but the most simple is:
list = child.find_element_by_xpath('li[#class="listItem"]')
But I always end up getting the error:
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"li[#class=\\"listItem\\"]"}';
As I do have a work around (the first three lines) this is not critical for me, but I would like to know what I am doing wrong.
You are missing the .// at the start of the xpath:
list = child.find_element_by_xpath('.//li[#class="listItem"]')
The .// means to search anywhere within the child element.

Resources