EOF with Nokogiri - ruby

I have the following line in a long loop
page = Nokogiri::HTML(open(topic[:url].first)).xpath('//ul[#class = "pages"]//li').first
Sometimes my Ruby application crashes raising the "End of file reached " exception in this line.
How can I resolve this problem? Just a begin;raise;end block?
Is a script that performs a forum backup, so is important that doesn't skip any thread.
Thanks in advance.

In addition to #Phrogz's excellent advice (in particular about at_css with the simpler expression), I would pull the raw xml [content] separately:
page = if (content = open(topic[:url].first)).strip.length > 0
Nokogiri::HTML(content).xpath('//ul[#class = "pages"]//li').first
end

I would suggest that you should first to fix the underlying issue so that you do not get this error.
Does the same URL always cause the problem? (Output it in your log files.) If so, perhaps you need to URI encode the URL.
Is it random, and therefor likely related to a connection hiccup or server problem? If so, you should rescue the specific error and then retry one or more times to get the crucial data.
Secondarily, you should know that the CSS syntax for that query is far simpler:
page = Nokogiri.HTML(...).at_css('ul.pages li')
Not only is this less than half the bytes, it allows for cases like <ul class="foo pages"> that the XPath would miss.
Using at_css (or at_xpath) is the same as .css(...).first, but is faster and simpler.

Related

Ruby Capybara Element not found error handling

internet.find(:xpath, '/html/body/div[1]/div[10]/div[2]/div[2]/div[1]/div[1]/div[1]/div[5]/div/div[2]/a').text
I am looping through a series of pages and sometimes this xpath will not be available. How do I continue to the next url instead of throwing an error and stopping the program? Thanks.
First, stop using xpaths like that - they're going to be ultra-fragile and horrendous to read. Without seeing the HTML I can't give you a better one - but at least part way along there has to be an element with an id you can target instead.
Next, you could catch the exception returned from find and ignore it or better yet you could check if page has the element first
if internet.has_xpath?(...)
internet.find(:xpath, ...).text
...
else
... whatever you want to do when it's not on the page
end
As an alternative to accepted answer you could consider #first method that accepts count argument as the number of expected matches or null to allow empty results as well
internet.first(:xpath, ..., count: nil)&.text
That returns element's text if one's found and nil otherwise. And there's no need to ignore rescued exception
See Capybara docs

How to fix strings being frozen randomly

I'm running into an issue where my script is failing in random places with:
Error: can't modify frozen String: "Please use text available here - Test jira"
These lines caused the error:
description = "Please use text available here - #{#jira[:url]}"
unless previous_jira.nil?
description << <<~PREVIOUSJIRACREATED
Please close previous jira's:
#{previous_jira}
PREVIOUSJIRACREATED
end
I think it's a pretty simple line and I am not freezeing it on purpose or anything like that. But can't understand why I am getting the error.
I have string interpolation all over my code, and the script started failing randomly at different places. The above code is just one example. I was able to identify a couple of high offenders and placed begin/rescue blocks around them, but I am afraid my code is getting ugly with the rescue blocks all over.
I tried searching, but I only got articles explaining what freeze is and how to dup the object.
I am on Ruby 2.7.0p0.

Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing

I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.
The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:
geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)
kmldoc.css("//Placemark").each_with_index do |placemark, i|
puts i
tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
h = HorryParcel.new
h.owner_name = tds.shift.text
tds.shift
tds.each_slice(2) do |k, v|
col = k.text.downcase
eval("h.#{col} = v.text")
end
coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")}
points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
geo_shape = geofactory.polygon(geofactory.linear_ring(points))
proj_shape = geo_shape.projection
h.geo_shape = geo_shape
h.proj_shape = proj_shape
h.save
end
Anyway, I've tested this code with a much, much smaller sample of kml and it works.
However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.
I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.
Here's a sample of the kml (w/ just one placemark) if that helps.
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<name>justone.kml</name>
<Style id="PolyStyle00">
<LabelStyle>
<color>00000000</color>
<scale>0</scale>
</LabelStyle>
<LineStyle>
<color>ff0000ff</color>
</LineStyle>
<PolyStyle>
<color>00f0f0f0</color>
</PolyStyle>
</Style>
<Folder>
<name>justone</name>
<open>1</open>
<Placemark id="ID_010161">
<name>STUART CHARLES A JR</name>
<Snippet maxLines="0"></Snippet>
<description>""</description>
<styleUrl>#PolyStyle00</styleUrl>
<MultiGeometry>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</MultiGeometry>
</Placemark>
</Folder>
</Document>
</kml>
EDIT:
99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.
class ClassName
attr_reader :before, :after
def go
#before = Time.now
run_actual_code
#after = Time.now
puts "process took #{(#after - #before) seconds} to complete"
end
def run_actual_code
...
end
end
The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.
For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.
The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.
Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.
So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu
I actually ended up getting a copy of the data from a more accessible source, but I'm back here because I wanted to present a possible solution to the general problem. Less. Less was a built long time ago & is a part of unix by default in most cases.
http://en.wikipedia.org/wiki/Less_%28Unix%29
Not related to the stylesheet language ("LESS"), less is a text viewer (cannot edit files, only read them) that does not load the entire document it is reading until you have scanned through the entire thing yourself. I.e., it loads the first "page", so to speak, and waits for you to call for the next one.
If a ruby script could somehow pipe "pages" of text into...oh wait....the XML structure wouldn't allow it due to the fact that it wouldn't have the closing delimeters from the end of the undigested text file......So what you would have to do is some custom work on the front end, cut out those first couple parent brackets so that you can pluck out the XML children one by one and have the last closing parent brackets break the script because the parser will think it is finished and come across another closing bracket I guess.
I haven't tried this and don't have anything to try it on. But if I did, I'd probably try piping n-lot blocks of text into ruby (or python, etc) via less or something similar to it. Perhaps something more primitive than less I'm not sure

HtmlUnit getByXpath returns null

I am coding with Groovy, however, I don't believe its a language specific set of questions.
I actually have two questions
First Question
I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.
The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4
My code:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")
println title
This simply prints out: []
Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.
Second Question
I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[#id="gmi-ResViewSizer_img"]
How do I handle that?
First Answer:
/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.
Maybe something like this:
/html/body//div/h1/a
Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".
There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:
/html/body/div/div/div/div/img[1]
I had the same problem, I solved when I realize iframe tags on page, try call
((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...
where n is the position in frame in iframe collection. It's work for me !!!
Thanks a lot.

Issues with Sinatra and Heroku

So I've created and published a Sinatra app to Heroku without any issues. I've even tested it locally with rackup to make sure it functions fine. There are a series of API calls to various places after a zip code is consumed from the URL, but Heroku just wants to tell me there is an server error.
I've added an error page that tries to give me more description, however, it tells me it can't perform a `count' for #, which I assume means hash. Here's the code that I think it's trying to execute...
if weather_doc.root.elements["weather"].children.count > 1
curr_temp = weather_doc.root.elements["weather/current_conditions/temp_f"].attributes["data"]
else
raise error(404, "Not A Valid Zip Code!")
end
If anyone wants to bang on it, it can be reached at, http://quiet-journey-14.heroku.com/ , but there's not much to be had.
Hash doesn't have a count method. It has a length method. If # really does refer to a hash object, then the problem is that you're calling a method that doesn't exist.
That # doesn't refer to Hash, it's the first character of #<Array:0x2b2080a3e028>. The part between the < and > is not shown in browsers (hiding the tags themselves), but visible with View Source.
Your real problem is not related to Ruby though, but to your navigation in the HTML or XML document (via DOM). Your statement
weather_doc.root.elements["weather"].children.count > 1
navigates the HTML/XML document, selecting the 'weather' elements, and (tries to) count the children. The result of the children call does not have a method count. Use length instead.
BTW, are you sure that the document contains a tag <weather>? Because that's what your're trying to select.
If you want to see what's behind #, try
raise probably_hash.class.to_s

Resources