Scraping content from html page - ruby

I'm using nokogiri to scrape web pages. The structure of the page is made of an unordered list containing multiple list items each of which has a link, an image and text, all contained in a div.
I'm trying to find clean way to extract the elements in each list item so I can have each li contained in an array or hash like so:
li[0] = ['Acme co 1', 'image1.png', 'Customer 1 details']
li[1] = ['Acme co 2', 'image2.png', 'Customer 2 details']
At the moment I get all the elements in one go then store them in separate arrays. Is there a better, more idiomatic way of doing this?
This is the code atm:
data = Nokogiri::HTML(html)
images = []
name = []
data.css('ul li img').each {|l| images << l}
data.css('ul li a').each {|a| names << a.text }
This is the html I'm working from:
<ul class="customers">
<li>
<div>
Acme co 1
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Cusomter 1 details
</div>
</div>
</li>
<li>
<div>
Acme co 2
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Customer 2 details
</div>
</div>
</li>
</ul>
Thanks

Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each methods with #map:
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)

data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
This simplifies your code slightly, but your original version wasn't too bad.
And my simplification may not generalise if you are, for example, scraping images from multiple regions on the page! In which case, reverting back to something like your original may be fine.

Related

Using XPath to get complete text of paragraph with span inside

<ul>
<li class="xyz">
<div class="divClass">
<span class="ContentItem---status---dL0iS">
<span>Success</span>
</span>
<p class="ContentItem---title---37IqA">
<span>Test Check</span>
: Please display the text
</p>
</div>
</li>
<li class="xyz">
<div class="divClass">
<span class="ContentItem---status---dL0iS">
<span>Not COMPLETED</span>
</span>
<p class="ContentItem---title---37IqA">
<span>Knowledge</span> A Team
</p>
</div>
</li>
.... and so on
</ul>
This is my html structure.I have this text Test Check inside a Span and : Please display the text inside a Paragraph tag.
What i need is ,i need to identify, whether my structure contains this complete text or not Test Check: Please display the text.
I have tried multiple ways and couldn't identify the complete path.Please find the way which i have tried
//span[text()='Test Check']/p[text()=': Please display the text']
Can you please provide me the xpath for this?
I think there is one possible solution to identify within the given html text and retrieve. I hope this solves your problem.
def get_tag_if_present(html_text):
soup_obj = BeautifulSoup(html_text,"html.parser")
test_check = soup_obj.find_all(text = re.compile(r"Test Check"))
result_val = "NOT FOUND"
if test_check:
for each_value in test_check:
parent_tag_span = each_value.parent
if parent_tag_span.name == "span":
parent_p_tag = parent_tag_span.parent
if parent_p_tag.name == "p" and "Please display the text" in parent_p_tag.get_text():
result_val = parent_p_tag
break
return result_val
The returned result_val will have the tag corresponding to the p tag element with the parameter. It would return NOT FOUND, if no such element exists.
I've taken this with the assumption that the corresponding data entries would exist in a "p" tag and "span" tag respectively , feel free to remove the said conditions for all identifications of the text in the given html text.

Get content after header tag with Nokogiri

I am playing with Nokogiri just to learn it and am trying to write a little CL scraper. Right now I am trying to match up each State on the main page with the cities underneath. Below is a snippet of the HTML:
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
I can already pull out just this div class of "colmask" easy enough. But now I am just trying to get the UL directly after each h4, but can't find a way to do it so far. Suggestions?
You can get ul elements after h4 using following-sibling:
require 'nokogiri'
html = <<-EOF
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
EOF
doc = Nokogiri::HTML(html)
doc.xpath('//h4/following-sibling::ul').each do |node|
puts node.to_html
end
To select ul after an h4 with exact text:
puts doc.xpath("//h4[text()='Alabama']/following-sibling::ul")[0].to_html
I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
</ul>
EOT
states = doc.search('h4')
states_and_cities = states.map{ |state|
cities = state.next_element.search('li a')
[state.text, cities.map(&:text)]
}.to_h
At this point states_and_cities is a hash of arrays:
states_and_cities
# => {"Alabama"=>["auburn", "birmingham"],
# "Alaska"=>["anchorage / mat-su", "fairbanks"]}
If you're concerned about having a big structure, it'd be very easy to convert states to a hash where each state's name is a key, and the associated value is the state's node. Then, that node could be grabbed to find only the cities for the particular state.
However, if you're running this code to generate content for a web-page on the fly, then you're going about it wrong. The information for states and cities should be dumped into a database where it can be accessed much more quickly. Then you won't have to do it every time the page is generated.
Being kind and gentle to other sites is important; Research the HEAD HTTP request. It's your key to determining whether you should retrieve a page in full. Also, learn how to sniff the cache information from the HTTP header returned from a server. That tells you what your minimum refresh rate should be. Also, pay attention to the robots.txt file, which tells you what they consider safe for you to scrape; ignoring that can lead to being banned.

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Extracting contents from a list split across different divs

Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

Watir. How to make a call to an element of page object, inside a line like $browser.link.click?

We have a page objects elements like
link (:test_link, xpath: './/a[#id = '3'])
unordered_list (:list, id: 'test')
And the code:
def method(elementcontainer, elementlink)
elementcontainer = elementcontainer.downcase.gsub(' ', '_')
elementlink = elementlink.downcase.gsub(' ', '_')
object = send("#{elementcontainer}_element")
object2 = send("#{elementlink}_element")
total_results_1 = object.element.links(id: '3')]").length
total_results_2 = object.element.links(object2).length
end
The last 2 lines contain the mystery.
The total_results_1 is able to get the number of links contained in the unordered list that have id = '3'.
total_results_2 does not work (of course). I don´t want to write in the middle of the code, again, the identification of the links. That is done in the page object.
How it is possible to write something like the total_results_2 line, but in a working version?
I might be misunderstanding the question, but I do not believe you need to create a method for what you want. It can all be done using the page object accessors.
Say we have the following page (I matched this to your accessors, though it seems unlikely that all links would have the same id):
<html>
<body>
<a id="3" href="#">1</a>
<ul id="test">
<li><a id="3" href="#">2</a></li>
<li><a id="3" href="#">3</a></li>
<li><a id="3" href="#">4</a></li>
</ul>
<a id="3" href="#">5</a>
</body>
</html>
As you did, you could define the list with the accessor:
unordered_list(:list, id: 'test')
To get the links with id 3, but are only within the list, you could:
Define the links as a collection - ie use links instead of link.
Use a block to locate the elements. This would allow you to consider the element nesting - ie locate links within the list element.
This would be done with:
links(:test_link){ list_element.link_elements(:id => '3') }
All together, your page object would be:
class MyPage
include PageObject
unordered_list(:list, id: 'test')
links(:test_link){ list_element.link_elements(:id => '3') }
end
To find the number of links, you would access the element collection and check its length.
browser = Watir::Browser.new
browser.goto('your_test_page.htm')
page = MyPage.new(browser)
puts page.test_link_elements.length
#=> 3

Resources