internet.find(:xpath, '/html/body/div[1]/div[10]/div[2]/div[2]/div[1]/div[1]/div[1]/div[5]/div/div[2]/a').text
I am looping through a series of pages and sometimes this xpath will not be available. How do I continue to the next url instead of throwing an error and stopping the program? Thanks.
First, stop using xpaths like that - they're going to be ultra-fragile and horrendous to read. Without seeing the HTML I can't give you a better one - but at least part way along there has to be an element with an id you can target instead.
Next, you could catch the exception returned from find and ignore it or better yet you could check if page has the element first
if internet.has_xpath?(...)
internet.find(:xpath, ...).text
...
else
... whatever you want to do when it's not on the page
end
As an alternative to accepted answer you could consider #first method that accepts count argument as the number of expected matches or null to allow empty results as well
internet.first(:xpath, ..., count: nil)&.text
That returns element's text if one's found and nil otherwise. And there's no need to ignore rescued exception
See Capybara docs
Related
I am trying to write an XPath expression which can return the URL associated with the next page of a search.
The URL which leads to the next page of the search is always the href in the a tag following the tag span class="navCurrentPage" I have been trying to use a following-sibling term to pull the next URL. My search in the Chrome console is:
$x('//span[#class="navCurrentPage"][1]/following-sibling::a/#href[1]')
I thought by specifying #href[1] I would only get back one URL (thinking the [1] chooses the first element in list), but instead Chrome (and Scrapy) are returning four URLs. I don't understand why. Please help me to understand how to select the one URL that I am looking for.
Here is the URL where you can find the HTML giving me trouble:
https://www.yachtworld.com/core/listing/cache/searchResults.jsp?cit=true&slim=quick&ybw=&sm=3&searchtype=advancedsearch&Ntk=boatsEN&Ntt=&is=false&man=&hmid=102&ftid=101&enid=0&type=%28Sail%29&fromLength=35&toLength=50&fromYear=1985&toYear=2010&fromPrice=&toPrice=&luom=126¤cyid=100&city=&rid=100&rid=101&rid=104&rid=105&rid=107&rid=108&rid=112&rid=114&rid=115&rid=116&rid=128&rid=130&rid=153&pbsint=&boatsAddedSelected=-1
Thank you for the help.
Operator precedence: //x[1] means /descendant-or-self::node()/child::x[1] which finds every descendant x that is the first child of its parent. You want (//x)[1] which finds the first node among all the descendants named x.
xpath index will apply on all matching records, if you want to get only the first item, get the first instance.
$x('//span[#class="navCurrentPage"][1]/following-sibling::a/#href[1]').extract_first()
just add, .extract_first() or .get() to fetch the first item.
see the scrapy documentation here.
I've found this very helpful to make sure you have the bracket in the right place.
What is the XPath expression to find only the first occurrence?
also, the first occurrence may be [0] not [1]
I want to wait when required text will become present.
I use:
#browser.text('required text').when_present
And I get an error:
ArgumentError: wrong number of arguments (given 1, expected 0)
How it can be implemented? I wan't to wait only for text without any dependency to html element.
Using page-object gem without any dependency to html element:
#page.wait_until { #page.text.include? 'required text' }
Reference
The right syntax is #browser.text.include?('required text')
This should help you get rid of the error.
Rather than looking for text for the whole page, it may be more efficient to create an element for the text.
span(:ur_name, :text=>/partial string/)
or however you prefer defining it and then call the element level wait functions for the object. Ofcourse, you can also dynamically declare the element if the text is dynamic.
span(:ur_name, :text => "#{variable_containing_text}")
page.ur_name_element.when_present
A trivial benefit being this ensures your text is not displayed out of place, esp if dups are possible. If the element doesnt support text locator, you can still identify the element, check for text and then use the element wait.
better yet you can do it like this:
Watir::Wait.until(10) { #browser.text.include?('required text') }
the default value of timeout is 60 seconds if not set, its set to 10 seconds in the example above.
Reference
wait_until
As per the latest docs, use .wait_until( &:present? ):
#browser.text( "required text" ).wait_until( &:present? )
every time i want to get the Value of my DomAttr i get an TypeError:
My Code:
Wanted = page.getByXPath("//span[contains(.,'Some')]/parent::a/#href");
return this
[DomAttr[name=href value=URLSTRING]]
Now i want to geht the value (=URLSTRING) with Wanted.getNodeName();
but every Time i get the Error
Cannot find function getNodeValue in object [DomAttr[name=href value=
same when i use getValue
please help me
There are some things that make no sense in the code (particularly, because it is not complete). However, I think I can guess what the issue is.
getByXPath is actually returning a List (funny thing you missed the part of the code in which you specify it as a list and replaced it with a Wanted).
Note you should probably also have type warnings in the code too.
Now, you can see that the returned value is in square brackets. That means it is a List (confirming first assumption).
Finally, although you happened to miss that part of the code too, I guess you are directly applying the getValue to the list instead of the DomAttr elements in the list.
How to solve it: If you need more than 1 result iterate over the elements of the list (that Wanted word over there). If you need 1 result then user the getFirstByXPath method.
Were my guesses right?
I have the following line in a long loop
page = Nokogiri::HTML(open(topic[:url].first)).xpath('//ul[#class = "pages"]//li').first
Sometimes my Ruby application crashes raising the "End of file reached " exception in this line.
How can I resolve this problem? Just a begin;raise;end block?
Is a script that performs a forum backup, so is important that doesn't skip any thread.
Thanks in advance.
In addition to #Phrogz's excellent advice (in particular about at_css with the simpler expression), I would pull the raw xml [content] separately:
page = if (content = open(topic[:url].first)).strip.length > 0
Nokogiri::HTML(content).xpath('//ul[#class = "pages"]//li').first
end
I would suggest that you should first to fix the underlying issue so that you do not get this error.
Does the same URL always cause the problem? (Output it in your log files.) If so, perhaps you need to URI encode the URL.
Is it random, and therefor likely related to a connection hiccup or server problem? If so, you should rescue the specific error and then retry one or more times to get the crucial data.
Secondarily, you should know that the CSS syntax for that query is far simpler:
page = Nokogiri.HTML(...).at_css('ul.pages li')
Not only is this less than half the bytes, it allows for cases like <ul class="foo pages"> that the XPath would miss.
Using at_css (or at_xpath) is the same as .css(...).first, but is faster and simpler.
So I've created and published a Sinatra app to Heroku without any issues. I've even tested it locally with rackup to make sure it functions fine. There are a series of API calls to various places after a zip code is consumed from the URL, but Heroku just wants to tell me there is an server error.
I've added an error page that tries to give me more description, however, it tells me it can't perform a `count' for #, which I assume means hash. Here's the code that I think it's trying to execute...
if weather_doc.root.elements["weather"].children.count > 1
curr_temp = weather_doc.root.elements["weather/current_conditions/temp_f"].attributes["data"]
else
raise error(404, "Not A Valid Zip Code!")
end
If anyone wants to bang on it, it can be reached at, http://quiet-journey-14.heroku.com/ , but there's not much to be had.
Hash doesn't have a count method. It has a length method. If # really does refer to a hash object, then the problem is that you're calling a method that doesn't exist.
That # doesn't refer to Hash, it's the first character of #<Array:0x2b2080a3e028>. The part between the < and > is not shown in browsers (hiding the tags themselves), but visible with View Source.
Your real problem is not related to Ruby though, but to your navigation in the HTML or XML document (via DOM). Your statement
weather_doc.root.elements["weather"].children.count > 1
navigates the HTML/XML document, selecting the 'weather' elements, and (tries to) count the children. The result of the children call does not have a method count. Use length instead.
BTW, are you sure that the document contains a tag <weather>? Because that's what your're trying to select.
If you want to see what's behind #, try
raise probably_hash.class.to_s