Scraping separate sets of items from a website with Ruby & Nokogiri - ruby

I'm working on a project for school and I'm having issues with finding the correct CSS selectors within the website's HTML in order to pull in the data I'm looking for. This is also my very first time with web scraping & I'm fairly new to Ruby as well so I apologize if this is a silly question.
I have successfully parsed the first set of data (although I'm sure there are better ways to do this, my method IS working but even feedback on this is welcome):
The website is platinumgod.co.uk for reference.
The HTML I scraped for the first part is as follows (along with the first item listed as an example):
<div class="repentanceitems-container">
<h2>
"Repentance Items "
<span class="rep-item-ttl">(169)</span>
</h2>
<li class="textbox" data-tid="42.5" data-cid="42" data-sid="263">
<a
<div onclick class="item reb-itm-new re-itm263"></div>
<span>
<p class="item-title">Clear Rune</p>
<p class="r-itemid">ItemID: 263</p>
<p class="pickup">"Rune mimic"</p>
<p class="quality">Quality: 2</p>
<p>"When used, copies the effect of the Rune or Soul stone you are holding (like the Blank Card)"</p>
<p>Drops a random rune on the floor when picked up</p>
<p>The recharge time of this item depends on the Rune/Soul Stone held:</p>
<p>1 room: Soul of Lazarus</p>
<p>2 rooms: Rune of Ansuz, Rune of Berkano, Rune of Hagalaz, Soul of Cain</p>
<p>3 rooms: Rune of Algiz, Blank Rune, Soul of Magdalene, Soul of Judas, Soul of ???, Soul of the Lost</p>
<p>4 rooms: Rune of Ehwaz, Rune of Perthro, Black Rune, Soul of Isaac, Soul of Eve, Soul of Eden, Soul of the Forgotten, Soul of Jacob and Esau</p>
<p>6 rooms: Rune of Dagaz, Soul of Samson, Soul of Azazel, Soul of Apollyon, Soul of Bethany</p>
<p>12 rooms: Rune of Jera, Soul of Lilith, Soul of the Keeper</p>
<ul>
<p>Type: Active</p>
<p>Recharge time: Varies</p>
<p>Item Pool: Secret Room, Crane Game</p>
</ul>
<p class="tags">* Secret Room</p>
</span>
</a>
</li>
This is just an example of one item in the Repentance Items category, so this is my code to parse all the information for each item in that category:
# Repentance Items
repentance_items = []
html.at(".repentanceitems-container").css("li.textbox").each do |item |
item_name = item.css("a span p.item-title").text
item_id = item.css("a span p.r-itemid").text.sub(/^ItemID: /, "")
pickup_text = item.css("a span p.pickup").text.gsub("\"", "")
quality = item.css("a span p.quality").text.sub(/^Quality: /, "")
use = item.css(".quality ~ p:not(.tags)").map { |row| row.text }
item_type = item.css("a span ul")
item.css("a span ul").each.map do |child|
item_type = child.css("p")[0].text.sub(/^Type: /, "")
if child.css("p")[1].text.match "Recharge time"
recharge_time = child.css("p")[1].text.sub(/^Recharge time: /, "")
item_pool = child.css("p")[2].text.sub(/^Item Pool: /, "").gsub(/,\s*$/m, "").split(", ")
else
recharge_time = "N/A"
item_pool = child.css("p")[1].text.sub(/^Item Pool: /, "").gsub(/,\s*$/m, "").split(", ")
end
repentance_items << {name: item_name, item_id: item_id, pickup_text: pickup_text, quality: quality, use: use, item_type: item_type, recharge_time: recharge_time, item_pool: item_pool}
end
end
The problem I'm facing is when I try to scrape the next category, which is Repentance Item Trinkets, I'm not sure what the CSS selectors should be in order to get this information because a lot of the same classes are used as in the Repentance Items HTML & so I just get the same items I did before. The HTML for the trinkets is the following (along with the first item listed as an example):
<div class="repentanceitems-container">
<h2>
"Repentance Trinkets "
<span class="a-item-ttl">(61)</span>
</h2>
<li class="textbox" data-tid="1000" data-cid="804" data-sid="10129">
<a
<div onclick class="item rep-item rep-trink rep-junxx129"></div>
<span>
<p class="item-title">Jawbreaker</p>
<p class="r-itemid">TrinketID: 129</p>
<p class="pickup">"Don't chew on it"</p>
<p>Tears have a chance to become a tooth, dealing x3.2 damage, similar to Tough Love</p>
<p>The chance to fire a tooth with this trinket is affected by your Luck stat</p>
<p>At +0 luck you have ~12% chance for this effect to activate</p>
<p>At +9 luck every tear you fire will be a tooth</p>
<p class="tags">*, </p>
</span>
</a>
</li>
I'm not sure where to begin in order to select only these items. If I go by the same selectors used in the first part of my code, it obviously just re-pulls in the Repentance Items & not the Trinkets.
Hopefully I've explained this well enough but please feel free to ask me more questions & I'll do my best to explain better.
Thank you all so much in advance for helping me!

Maybe you could start by breaking your first selector line in 2 parts: one to capture the container, and then another one to look for items.
This could look like this (not tested):
repentance_items = []
repentance_trinklets = []
html.at(".repentanceitems-container").each do |container|
# Check to know in what category you are, so in which table to add the results, something like:
repentance_target = if container.css('h2').text =~ /items/i
repentance_items
else
repentance_trinklets
end
css("li.textbox").each do |item|
# your current logic
# affectation in the correct results array
repentance_target << ...
end
end
In the end the two arrays should be populated with the correct items
This is a bit drafty, but I hope that helps, let me know if something is not clear

Related

undefined method `children' for nil:NilClass (NoMethodError)

I'm trying to revive a simple example of parsing a site with the help of nokogiri and hit about an error undefined method `children' for nil:NilClass (NoMethodError)
require 'open-uri'
url = 'http://www.cubecinema.com/programme'
html = open(url)
puts html
require 'nokogiri'
doc = Nokogiri::HTML(html)
showings = doc.css('.showing').map do |showing|
showing_id = showing['id'].split('_').last.to_i
tags = showing.css('.tags a')
.map{|tag| tag.text.strip}
title_el = showing.at_css('h1 a')
.children
.delete_if{|c| c.name == 'span'}
title = title_el.text.strip
dates = showing.at_css('.start_and_pricing')
.inner_html
.strip
.split('<br>')
.map(&:strip)
.map{|d| DateTime.parse(d)}
description = showing.at_css('.copy')
.text
.delete('[more...]')
.strip
{id: showing_id,
title: title,
tags: tags,
dates: dates,
description: description}
end
I found a possible solution https://translate.googleusercontent.com/translate_c?anno=2&depth=1&rurl=translate.google.com&sl=auto&sp=nmt4&tl=ru&u=https://github.com/dwightjack/grunt-email-boilerplate/issues/12&xid=25657,15700023,15700186,15700191,15700248,15700253&usg=ALkJrhgLkK2xqf-6SfL3K16DBRdtdNH0Cw but it’s not clear what the premailer subtasks are, reading the site didn’t really help them, where do I need to write down these subtasks. I will be very grateful to the clarification either by my mistake or by the way how these subtasks need to be determined, I myself don’t understand and lack experience it is possible.
I am not able to leave just a comment due to lack of reputation, so I can only give advise in answers section.
So, I think that you should check showing.at_css('h1 a') instance first to be sure that it have a children method. Some Nokogiri objects do not have any children (For example meta tag). Hope it helps.
I ran your program locally, and I can't find any tags within the section of code you are scraping.
The reason you are getting this error is because Nokogiri is returning a nil element and you are attempting to delete something which already has no value therefore giving you the NilClass Error.
This is the section of code you are attempting to retrieve "h1 a" from.
<div class="showing" id="event_10427"> <div class="event_image"> <a href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">
<img src="/media/diary/thumbnails/MSJ_vvlive.jpg.600x0_q45.jpg" alt="Picture for event Vula Viel - “Do Not Be Afraid” Album Tour"></a> <span class="tags"> music </span> </div> <!-- div event_image --> <a href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">
<p><span class="pre_title"> Ear Trumpet Music presents </span></p> <h3>Vula Viel - “Do Not Be Afraid” Album Tour</h3> <span class="post_title"> </span> </a> <p></p>
<div class="event_details"> <p class="start_and_pricing"> Thu 28 March // 20:00 <br> </p> <p class="copy">The trio of music makers called Vula Viel weave sparse polyrhythms and intricate rhythm structures around ... [<a class="more" href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">more</a>]</p> </div> </div>
As you can see there is no h1 tags, therefore Nokogiri is returning nil on your search.
You can either change the tag if it's an error on your behalf; or if not every page has a 'h1 a' tag. You will need to check if
title_el = showing.at_css('h3 a')
returns nil before you try to delete it.

Result of xpath is object text error, how do i get around this in Ruby on a site built around hiding everything?

My company uses ways to hide most data on their website and i'm tying to create a driver that will scan closed jobs to populate an array to create new jobs thus requiring no user input / database access for users.
I did research and it seems this can't be done the way i'm doing it:
# Scan page and place 4 different Users into an array
String name = [nil, nil, nil, nil]
String compare_name = nil
c = 0
tr = 1
while c < 4
String compare_name = driver.find_element(:xpath, '//*
[#id="job_list"]/tbody/tr['+tr.to_s+']/td[2]/span[1]/a/span/text()[2]').gets
if compare_name != name[c]
name[c] = compare_name
c = +1
tr = +1
else if compare_name == name[c]
tr = +1
end
end
end
Also i am a newb learning as i go, so this might not be optimal or whatever just how i've learned to do what i want.
Now the website code for the item i want on the screen:
<span ng-if="job.customer.company_name != null &&
job.customer.company_name != ''" class="pointer capitalize ng-scope" data-
toggle="tooltip" data-placement="top" title="" data-original-title="406-962-
5835">
<a href="/#/edit_customer/903519"class="capitalize notranslate">
<span class="ng-binding">Name Stuff<br>
<!-- ngIf: ::job.customer.is_cip_user --
<i ng-if="::job.customer.is_cip_user" class="fa fa-user-circle-o ng-scope">
::before == $0
</i>
> Diago Stein</span>
</a>
</span>
Xpath can find the Diago Stein area, but because of it being a text object it doesn't work. Now to note something all the class titles, button names, etc are all the same with everything else on the page. They always do that which makes it even harder to scan because those same things are likely elsewhere that might not have anything to do with this area of the site.
Is there any way to grab this text without knowing what might be in the text area based on the HTML? Note "Name Stuff" is the name of a company i hid it with this generic one for privacy.
Thanks for any ideas or suggestions and help.
EDIT: Clarification, i will NOT know the name of the company or the user name (in this case Diago Stein) the entire purpose of this part of the code is to populate an array with the customers name from this table on the closed page.
You can back your XPath up one level to
//*[#id="job_list"]/tbody/tr[' + tr.to_s + ']/td[2]/span[1]/a/span
then grab the innerText. The SPAN is
<span class="ng-binding">Name Stuff<br>
<!-- ngIf: ::job.customer.is_cip_user --
<i ng-if="::job.customer.is_cip_user" class="fa fa-user-circle-o ng-scope">
::before == $0
</i>
> Diago Stein</span>
The problem is that this HTML has some conditionals in it which makes it hard to read, hard to figure out what's actually there. If we strip out the conditional, we are left with
<span class="ng-binding">Name Stuff<br>Diago Stein</span>
If we take the innerText of this, we get
Name Stuff
Diago Stein
What this does is you can split the string by a carriage return and part 0 is the 'Name Stuff' and part 1 is 'Diago Stein'. So you use your locator to find the SPAN, get innerText, split it by a carriage return, and then take the second part and you have your desired string.
This code isn't tested but it should be something like
name = driver.find_element(:xpath => "//*[#id="job_list"]/tbody/tr[' + tr.to_s + ']/td[2]/span[1]/a/span").get_text.split("\n")[1]

XPath in RSelenium for indexing list of values

Here is an example of html:
<li class="index i1"
<ol id="rem">
<div class="bare">
<h3>
<a class="tlt mhead" href="https://www.myexample.com">
<li class="index i2"
<ol id="rem">
<div class="bare">
<h3>
<a class="tlt mhead" href="https://www.myexample2.com">
I would like to take the value of every href in a element. What makes the list is the class in the first li in which class' name change i1, i2.
So I have a counter and change it when I go to take the value.
i <- 1
stablestr <- "index "
myVal <- paste(stablestr , i, sep="")
so even if try just to access the general lib with myVal index using this
profile<-remDr$findElement(using = 'xpath', "//*/input[#li = myVal]")
profile$highlightElement()
or the href using this
profile<-remDr$findElement(using = 'xpath', "/li[#class=myVal]/ol[#id='rem']/div[#id='bare']/h3/a[#class='tlt']")
profile$highlightElement()
Is there anything wrong with xpath?
Your HTML structure is invalid. Your <li> tags are not closed properly, and it seems you are confusing <ol> with <li>. But for the sake of the question, I assume the structure is as you write, with properly closed <li> tags.
Then, constructing myVal is not right. It will yield "index 1" while you want "index i1". Use "index i" for stablestr.
Now for the XPath:
//*/input[#li = myVal]
This is obviously wrong since there is no input in your XML. Also, you didn't prefix the variable with $. And finally, the * seems to be unnecessary. Try this:
//li[#class = $myVal]
In your second XPath, there are also some errors:
/li[#class=myVal]/ol[#id='rem']/div[#id='bare']/h3/a[#class='tlt']
^ ^ ^
missing $ should be #class is actually 'tlt mhead'
The first two issues are easy to fix. The third one is not. You could use contains(#class, 'tlt'), but that would also match if the class is, e.g., tltt, which is probably not what you want. Anyway, it might suffice for your use-case. Fixed XPath:
/li[#class=$myVal]/ol[#id='rem']/div[#class='bare']/h3/a[contains(#class, 'tlt')]

How to check box in Capybara if there are no name, id or label text?

I am newbie here. Please advise. How to select checkbox in my case?
<ul class="phrases-list" style="">
<li>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
The following doesn't work for me:
When /I check box "([^\"]+)"$/ do |label|
page.check(label)
end
step: And I check box "Dog - Wikipedia, the free encyclopedia"
If you can change the html, wrap the input and span in a label element
<ul class="phrases-list" style="">
<li>
<label>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
</label>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
which has the added benefit of clicks on the "Dog - Wikipedia ..." text triggering the checkbox too. With that change your step should work as written. If you can't modify the html then things get more difficult.
Something like
find('span', text: label).find(:xpath, './preceding-sibling::input').set(true)
should work, although I'm curious how you're using these checkboxes from JS with nothing tying them to any specific value
Let's assume that you are prevented from changing the HTML. In this case, it would probably be easiest to query for the element via XPath. For example:
# Here's the XPath query
q = "//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input"
# Use the query to find the checkbox. Then, check the checkbox.
page.find(:xpath, q).set(true)
Okay - it's not as bad as it looks! Let's analyze this XPath so we can understand what it's doing:
//span
This first part says "Search the entire HTML document and discover all "span" elements. Of course, there are probably a LOT of "span" elements in the HTML document, so we'll need to restrict this:
//span[contains(text(), 'Dog - Wikipedia')]
Now we're only searching for the "span" elements that contain the text "Dog - Wikipedia". Presumably, this text will uniquely identify the desired "span" element on the page (if not, then just search for more of the text).
At this point, we have the "span" element that is adjacent to the desired "input" element. So, we can query for the "input" element using the "preceding-sibling::" XPath Axis:
//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input

Regex a regexed match in 1 search? Other minor regex questions

I have an email that has some html code that I'm looking to regex. I'm using a gmail gem to read my emails and using nokogiri fails when reading through gmail. Thus I'm looking for a regex solution
What I'd like to do is to scan for the section that is labeled important title and then look at the unordered list within that section, capturing the urls. The html code that is labeled important title is provided below.
I wasn't sure how to do this so I thought the proper way to do it, was to regex for the section called important title and capture everything up to the end of the unordered list. Then within this match, subsequently find the links.
To find the links, I used this regex which works fine: (?:")([^"]*)(?:" )
To capture the section called important title however, I wanted to simply use the following regex (?:important title).*(?:<\/ul>). From my understanding that would look for important title then as many characters as possible, followed by </ul>. However from the below, it only captures </h3>. The new line character is causing it to stop. Which is one of my questions: why is . which is supposed to capture all characters, not capturing a new line character? If that's by design, I don't need more than a simply 'its by design'...
So assuming it's by design, I then tried (?:important title)((.|\s)*)(?:<\/ul>) and that's giving me 2 matches for some reason. The first matches the entire code that I need, stopping at </ul> and the second match is literally just a blank string. I don't get why that's the case...
Finally my last and most important question is, do I need to do 2 regexes to get the links? Or is there a way to combine both regexes so that my "link regex" only searches within my "section regex"?
<h3>the important title </h3>
<ul>
<li><a href="http://www.link.com/23232=
.32434" target="_blank">first link»</a></li>
<li><a href="http://www.link.com/234234468=
.059400" target="_blank">second link »</a></li>
<li><a href="http://www.link.com/287=
.059400" target="_blank">third link»</a></li>
<li><a href="http://www.link.com/4234501=
.059400" target="_blank">fourth link»</a></li>
<li><a href="http://www.link.com/34517=
.059400" target="_blank">5th link»</a></li>
</ul>
An example with nokogiri:
# encoding: utf-8
require 'nokogiri'
html_doc = '''
<h3>the important title </h3>
<ul>
<li>first link»</li>
<li>second link »</li>
<li>third link»</li>
<li>fourth link»</li>
<li>5th link»</li>
</ul>
'''
doc = Nokogiri::HTML.parse(html_doc)
doc.search('//h3[text()="the important title "]/following-sibling::ul[1]/li/a/#href').each do |link|
puts link.content
end
The regex way use the anchor \G that matches the position at the end of the precedent match, since this anchor is initialized to the start of the string at the begining, you must add (?!\A) (not a the start of the string) to forbid this case, and only allow the first match with the second entry point.
To be more readable, all the pattern use the extended mode (or verbose mode, or comment mode, or free-spacing mode...) that allows comments inside the pattern and where spaces are ignored. This mode can be set or unset inline with (?x) and (?-x)
pattern = Regexp.new('
# entry points
(?:
\G (?!\A) # contiguous to the precedent match
|
<h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s*
)
<li>
<a \s+ href=" (?<url> [^"]* ) " [^>]* >
(?<txt> (?> [^<]+ | <(?!/a>) )* )
\s* </a> \s* </li> \s*', Regexp::EXTENDED | Regexp::IGNORECASE)
html_doc.scan(pattern) do |url, txt|
puts "\nurl: #{url}\ntxt: #{txt}"
end
The first match uses the second entry point: <h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s* and all next matches use the second: \G (?!\A)
After the last match, since there is no more contiguous li tags (there is only a closing ul tag), the pattern fails. To succeed again the regex engine will find a new second entry point.
I have html that I'm looking to regex.
Use the nokogiri gem: http://nokogiri.org/
It's the defacto standard for searching html. Ignore the requirements that are listed--they are out of date.
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::HTML(open('http://www.some_site.com'))
html_doc = Nokogiri::HTML(<<'END_OF_HTML')
<h3>not important</h3>
<ul>
<li>first link»</li>
<li>second link »</li>
</ul>
<h3>the important title </h3>
<ul>
<li>first link</li>
<li>second link</li>
<li>third link</li>
<li>fourth link</li>
<li>5th link</li>
</ul>
END_OF_HTML
a_tags = html_doc.xpath(
'//h3[text()="the important title "]/following-sibling::ul[1]//a'
)
a_tags.each do |tag|
puts tag.content
puts tag['href']
end
--output:--
first link
http://www.link.com/23232=.32434
second link
http://www.link.com/234234468=.059400
third link
http://www.link.com/287=.059400
fourth link
http://www.link.com/4234501=.059400
5th link
http://www.link.com/34517=.059400

Resources