Why do I get bad output when using Nokogiri "search"? - ruby

I want to scrape data from specific divs on a CarFax report. However, when I search for divs, I always get this weird garbage output.
I tried search(#divId) , search(.divClass), and even tried to grab all divs with search('div'). Each time I get similar results: the div's content is partially truncated and the tags are all messed up.
This is the URL I am loading into my agent: https://gist.github.com/atkolkma/8024287
This is the code (user and pass ommited):
require "rubygems"
require "mechanize"
scraper = Mechanize.new
scraper.user_agent_alias = 'Mac Safari'
scraper.follow_meta_refresh = true
scraper.redirect_ok = true
scraper.get("http://www.carfaxonline.com")
form = scraper.page.forms.first
form.j_username = "******"
form.j_password = "*****"
scraper.submit(form)
scraper.get("http://www.carfaxonline.com/api/report?vin=1G1AT58H697144202&track=true")
puts scraper.page.search("#headerBodyType")
This is what the file returns when I run it:
</div>4 DRderBodyType">
What I expect is:
<div id="headerBodyType"> SEDAN 4 DR </div>
The strangest thing is, if I copy the HTML source, save it as a new file, upload it and search it, I get the correct output! I've uploaded the copied HTML to my chevy-pics dot com domain and run the following code:
scraper2 = Mechanize.new
scraper2.get("http://www.chevy-pics.com/test.html")
puts scraper2.page.search("#headerBodyType")
I get this as output, as expected:
<div id="headerBodyType"> SEDAN 4 DR </div>

I can reproduce this by changing the line endings on the file in by editor to Mac OS 9, which uses a single \r (carriage return) character. When you use puts on the resulting string the console returns to the start of the line each time this character is seen, but doesn’t start a new line. Each line therefore overwrites what was there before and you end up with the corrupted output you are seeing.
You should be able to tell if this is the case by using p instead of puts. You should see something like "<div id=\"headerBodyType\">\r SEDAN 4 DR\r </div>" as the output. Notice the \r characters used as newlines.
The actual result you get from the query is correct, it’s just displaying that is causing the problems. The solution is probably just to use gsub on the text to convert \r to the more normal \n. I don’t know the best place to do this, it might be possible to change the entire document before Mechanize hands it over to Nokogiri for parsing but I don’t know how you’d do that.
You may need to just change any results you get, as a start try:
puts scraper.page.search("#headerBodyType").to_s.gsub("\r", "\n")

Related

Qt : Reading the text file and Displaying in LineEdit

I have an input file and a batch file. When the batch file is executed using the System command,
a corresponding outfile is generated.
Now I want a particular text (position 350 to 357) from that outfile to be displayed on to my lineedit widget
Here is that part of my code:
system("C:/ORG_Class0178.bat")
Now the outfile will be generated
File.open("C:/ORG_Class0178_out.txt", 'r').each do |line|
var = line[350..357]
puts var
# To test whether the file is being read.
#responseLineEdit = Qt::LineEdit.new(self)
#responseLineEdit.setFont Qt::Font.new("Times NEw Roman", 12)
#responseLineEdit.resize 100,20
#responseLineEdit.move 210,395
#responseLineEdit.setText("#{var}")
end
When I do test whether the file is being read using puts statement, I get the exact required output in editor. However, the same text is not being displayed on LineEdit. Suggestions are welcome.
EDIT: A wired observation here. It works fine when I try to read the input file and display it , however it does not work with the output file. The puts statement does give the answer in editor confirming that output file does contain the required text. I am confused over this scenario.
There is nothing wrong with the code fragments shown.
Note that var is a local variable. Are the second and third code fragments in the same context? If they are in the same method, and var is not touched in-between, it will work.
If the fragments belong to different methods of the same class, than an instance variable (#var) will solve the problem.
If all that does not help, use Pry to chase the problem. Follow the link to find the pre-requisites and how to use. Place binding.pry in your code, and your program will stop at that line. Then inspect what your variables are doing.
try 'rb' instead of 'r'
File.open("C:/ORG_Class0178_out.txt", 'rb').each do |line|
var = line[350..357]
puts var

Ruby Nokogiri - How to prevent Nokogiri from printing HTML character entities

I have a html which I am parsing using Nokogiri and then generating a html out of this like this
htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////
File.open(output.html, 'w+') do |file|
file.write(h_doc)
end
Question is how to prevent NOkogiri from printing HTML character entities (< >, & ) in the final generated html file.
Instead of HTML character entities (< > & ) I want to print actual character (< ,> etc).
As an example it is printing the html like
<title><%= ("/emailclient=sometext") %></title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
So... you want Nokogiri to output incorrect or invalid XML/HTML?
Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:
<div>To write "&", you need to write "&amp;".</div>
This renders:
To write "&", you need to write "&".
If you had your way, you'd get this HTML:
<div>To write "&", you need to write "&".</div>
which would render as:
To write "&", you need to write "&".
Even worse in this scenario, say, in XHTML:
<div>Use the <script> tag for JavaScript</div>
if you replace the entities, you get undisplayable file, due to unclosed <script> tag:
<div>Use the <script> tag for JavaScript</div>
EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:
doc.traverse do |node|
if node.text?
node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
"\\1<%= \\2 %>\\3")
end
end
puts doc.to_html.gsub('<%=', '<%=').gsub('%>', '%>')
You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.
When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.
It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.
require 'nokogiri'
noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
xpath = '<selector_for_element>'
noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&&&&&')
puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &&&&&
As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.

Uploading and parsing text document in Rails

In my application, the user must upload a text document, the contents of which are then parsed by the receiving controller action. I've gotten the document to upload successfully, but I'm having trouble reading its contents.
There are several threads on this issue. I've tried more or less everything recommended on these threads, and I'm still unable to resolve the problem.
Here is my code:
file_data = params[:file]
contents = ""
if file_data.respond_to?(:read)
contents = file_data.read
else
if file_data.respond_to?(:path)
File.open(file_data, 'r').each_line do |line|
elts = line.split
#
#
end
end
end
So here are my problems:
file_data doesn't 'respond_to?' either :read or :path. According to some other threads on the topic, if the uploaded file is less than a certain size, it's interpreted as a string and will respond to :read. Otherwise, it should respond to :path. But in my code, it responds to neither.
If I try to take out the if statements and straight away attempt File.open(file_data, 'r'), I get an error saying that the file wasn't found.
Can someone please help me find out what's wrong?
PS, I'm really sorry that this is a redundant question, but I found the other threads unhelpful.
Are you actually storing the file? Because if you are not, of course it can't be found.
First, find out what you're actually getting for file_data by adding debug output of file_data.inspect. It maybe something you don't expect, especially if form isn't set up correctly (i.e. :multipart => true).
Rails should enclose uploaded file in special object providing uniform interface, so that something as simple as this should work:
file_data.read.each_line do |line|
elts = line.split
#
#
end

Filling a textform - String too small?

I currently have to do a job where I have to copy the code of a website into a textfield.
I'm using watir to do the browser handling. As far as I know, I can only fill the field using the set function, which means that I have to do something like
browser.text_field(:id => "text").set sitetext
With sitetext being the code of the website that I'm copying into it.
I've loaded the code from a file into an array before and then pushed it into the string (probably not the best choice, but easiest for me right now), using the following code.
contentArray=Array.new
inputFile=File.open("my-site.html")
inputFile.each{|line| contentArray<<line}
inputFile.close
Now when I execute the first command to fill in the text_field, it slowly types in all the letters (is there an easy way to speed this up?), but after 692 characters it stops in the middle of the sentence.
[I pasted the text that was entered into charcounter.com, that's how I know this number.]
Where is the problem? Is ruby giving my strings a limited size for some reason? Can I somehow lift this barrier?
Is there another way to fill the text_field?
Try the .value method
browser.text_field(:id => "text").value=(open('my-site.html') { |f| f.read })
OR
I'm thinking the misprinting of umlauts etc is something to do with the codepage settings on your machine and the file you're reading from. You might have to experiment going from one code page to another ... I'm guessing your source file is CP850 or perhaps even UTF-8 and I think you need western european to get umlauts... but being Australian I really have no idea =)
http://en.wikipedia.org/wiki/ISO_8859
e.g.
require 'iconv'
browser.text_field(:id => "text").value=(
Iconv.iconv('CP850', 'ISO-8859-1', open('my-site.html') { |f| f.read })
)

REXML is wrapping long lines. How do I switch that off?

I am creating an XML document using REXML
File.open(xmlFilename,'w') do |xmlFile|
xmlDoc = Document.new
# add stuff to the document...
xmlDoc.write(xmlFile,4)
end
Some of the elements contain quite a few arguments and hence, the according lines can get quite long. If they get longer than 166 chars, REXML inserts a line break. This is of course still perfectly valid XML, but my workflow includes some diffing and merging, which works best if each element is contained in one line.
So, is there a way to make REXML not insert these line-wrapping line breaks?
Edit: I ended up pushing the finished XML file through tidy as the last step of my script. If someone knew a nicer way to do this, I would still be grateful.
As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to fix it by overwriting the Formatters::Pretty class's write_text method so that it uses the configurable #width attribute instead of the hard-coded 80.
require "rubygems"
require "rexml/document"
include REXML
long_xml = "<root><tag>As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to *fix* it by overwriting the Formatters::Pretty class's write_text method.</tag></root>"
xml = Document.new(long_xml)
#fix bug in REXML::Formatters::Pretty
class MyPrecious < REXML::Formatters::Pretty
def write_text( node, output )
s = node.to_s()
s.gsub!(/\s/,' ')
s.squeeze!(" ")
#The Pretty formatter code mistakenly used 80 instead of the #width variable
#s = wrap(s, 80-#level)
s = wrap(s, #width-#level)
s = indent_text(s, #level, " ", true)
output << (' '*#level + s)
end
end
printer = MyPrecious.new(5)
printer.width = 1000
printer.compact = true
printer.write(xml, STDOUT)
Short answer: yes and no.
REXML uses different formatters based on the value you specify for indent. If you leave the default -1, it uses REXML::Formatters::Default. If you give it a value like 4, it uses REXML::Formatters::Pretty. The pretty formatter does have logic in it to wrap lines (though it looks like it wraps at 80, not 166), when dealing with text (not tags or attributes). For example, the contents of
<p> a paragraph tag </p>
would be wrapped at 80 characters, but
<a-tag with='a' long='list' of='attributes'/>
would not be wrapped.
Anyway the 80 is hard-coded in rexml/formatters/pretty.rb and not configurable. And if you use the default formatter with no indent, then it's mostly just a raw dump without added line breaks. You could try the transitive formatter (see docs for Document.write), but it's broken in some version of ruby and might require a code hack. It probably isn't what you want anyway.
You might try taking a look at Builder::XmlMarkup from the builder gem.

Resources