I want to scrape data from specific divs on a CarFax report. However, when I search for divs, I always get this weird garbage output.
I tried search(#divId) , search(.divClass), and even tried to grab all divs with search('div'). Each time I get similar results: the div's content is partially truncated and the tags are all messed up.
This is the URL I am loading into my agent: https://gist.github.com/atkolkma/8024287
This is the code (user and pass ommited):
require "rubygems"
require "mechanize"
scraper = Mechanize.new
scraper.user_agent_alias = 'Mac Safari'
scraper.follow_meta_refresh = true
scraper.redirect_ok = true
scraper.get("http://www.carfaxonline.com")
form = scraper.page.forms.first
form.j_username = "******"
form.j_password = "*****"
scraper.submit(form)
scraper.get("http://www.carfaxonline.com/api/report?vin=1G1AT58H697144202&track=true")
puts scraper.page.search("#headerBodyType")
This is what the file returns when I run it:
</div>4 DRderBodyType">
What I expect is:
<div id="headerBodyType"> SEDAN 4 DR </div>
The strangest thing is, if I copy the HTML source, save it as a new file, upload it and search it, I get the correct output! I've uploaded the copied HTML to my chevy-pics dot com domain and run the following code:
scraper2 = Mechanize.new
scraper2.get("http://www.chevy-pics.com/test.html")
puts scraper2.page.search("#headerBodyType")
I get this as output, as expected:
<div id="headerBodyType"> SEDAN 4 DR </div>
I can reproduce this by changing the line endings on the file in by editor to Mac OS 9, which uses a single \r (carriage return) character. When you use puts on the resulting string the console returns to the start of the line each time this character is seen, but doesn’t start a new line. Each line therefore overwrites what was there before and you end up with the corrupted output you are seeing.
You should be able to tell if this is the case by using p instead of puts. You should see something like "<div id=\"headerBodyType\">\r SEDAN 4 DR\r </div>" as the output. Notice the \r characters used as newlines.
The actual result you get from the query is correct, it’s just displaying that is causing the problems. The solution is probably just to use gsub on the text to convert \r to the more normal \n. I don’t know the best place to do this, it might be possible to change the entire document before Mechanize hands it over to Nokogiri for parsing but I don’t know how you’d do that.
You may need to just change any results you get, as a start try:
puts scraper.page.search("#headerBodyType").to_s.gsub("\r", "\n")
I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.
The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:
geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)
kmldoc.css("//Placemark").each_with_index do |placemark, i|
puts i
tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
h = HorryParcel.new
h.owner_name = tds.shift.text
tds.shift
tds.each_slice(2) do |k, v|
col = k.text.downcase
eval("h.#{col} = v.text")
end
coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")}
points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
geo_shape = geofactory.polygon(geofactory.linear_ring(points))
proj_shape = geo_shape.projection
h.geo_shape = geo_shape
h.proj_shape = proj_shape
h.save
end
Anyway, I've tested this code with a much, much smaller sample of kml and it works.
However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.
I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.
Here's a sample of the kml (w/ just one placemark) if that helps.
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<name>justone.kml</name>
<Style id="PolyStyle00">
<LabelStyle>
<color>00000000</color>
<scale>0</scale>
</LabelStyle>
<LineStyle>
<color>ff0000ff</color>
</LineStyle>
<PolyStyle>
<color>00f0f0f0</color>
</PolyStyle>
</Style>
<Folder>
<name>justone</name>
<open>1</open>
<Placemark id="ID_010161">
<name>STUART CHARLES A JR</name>
<Snippet maxLines="0"></Snippet>
<description>""</description>
<styleUrl>#PolyStyle00</styleUrl>
<MultiGeometry>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</MultiGeometry>
</Placemark>
</Folder>
</Document>
</kml>
EDIT:
99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.
class ClassName
attr_reader :before, :after
def go
#before = Time.now
run_actual_code
#after = Time.now
puts "process took #{(#after - #before) seconds} to complete"
end
def run_actual_code
...
end
end
The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.
For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.
The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.
Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.
So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu
I actually ended up getting a copy of the data from a more accessible source, but I'm back here because I wanted to present a possible solution to the general problem. Less. Less was a built long time ago & is a part of unix by default in most cases.
http://en.wikipedia.org/wiki/Less_%28Unix%29
Not related to the stylesheet language ("LESS"), less is a text viewer (cannot edit files, only read them) that does not load the entire document it is reading until you have scanned through the entire thing yourself. I.e., it loads the first "page", so to speak, and waits for you to call for the next one.
If a ruby script could somehow pipe "pages" of text into...oh wait....the XML structure wouldn't allow it due to the fact that it wouldn't have the closing delimeters from the end of the undigested text file......So what you would have to do is some custom work on the front end, cut out those first couple parent brackets so that you can pluck out the XML children one by one and have the last closing parent brackets break the script because the parser will think it is finished and come across another closing bracket I guess.
I haven't tried this and don't have anything to try it on. But if I did, I'd probably try piping n-lot blocks of text into ruby (or python, etc) via less or something similar to it. Perhaps something more primitive than less I'm not sure
I am learning Ruby and trying to manipulate Excel data.
my goal:
To be able to extract email addresses from an excel file and place them in a text file one per line and add a comma to the end.
my ideas:
i think my answer lies in the use of spreadsheet and File.new.
What I am looking for is direction. I would like to hear any tips or rather hints to accomplish my goal. thanks
Please do not post exact code only looking for direction would like to figure it out myself...
thanks, karen
UPDATE::
So, regex seems to be able to find all matching strings and store them into an array. I´m having some trouble setting that up but should be able to figure it out....but for right now to get started I will extract only the column labeled "E Mail"..... the question I have now is:
`parse_csv = CSV.parse(read_csv, :headers => true)`
The default value for :skip_blanks is set to false.. I need to set it to true but nowhere can I find the correct syntax for doing so... I was assumming something like
`parse_csv = CSV.parse(read_csv, :headers => true :skip_blanks => true)`
But no.....
save your excel file as csv (comma separated value) and work with Ruby's libraries
besides spreadsheet (which can read and write), you can read Excel and other file types with with RemoteTable.
gem install remote_table
and
require 'remote_table'
t = RemoteTable.new('/path/to/file.xlsx', headers: :first_row)
when you write the CSV, as #aug2uag says, you can use ruby's standard library (no gem install required):
require 'csv'
puts [name, email].to_csv
Personally, I'd keep it as simple as possible and use a CSV.
Here is some pseudocode of how that would work:
read in your file line by line
extract your fields using regex, or cell count (depending on how consistent the email address location is), and insert into an arry
iterate through the array and write the values in the fashion you wish (to console, or file)
The code in the comment you had is a great start, however, puts will only write to console, not file. You will also need to figure out how you are going to know you are getting the email address.
Hope this helps.
In my application, the user must upload a text document, the contents of which are then parsed by the receiving controller action. I've gotten the document to upload successfully, but I'm having trouble reading its contents.
There are several threads on this issue. I've tried more or less everything recommended on these threads, and I'm still unable to resolve the problem.
Here is my code:
file_data = params[:file]
contents = ""
if file_data.respond_to?(:read)
contents = file_data.read
else
if file_data.respond_to?(:path)
File.open(file_data, 'r').each_line do |line|
elts = line.split
#
#
end
end
end
So here are my problems:
file_data doesn't 'respond_to?' either :read or :path. According to some other threads on the topic, if the uploaded file is less than a certain size, it's interpreted as a string and will respond to :read. Otherwise, it should respond to :path. But in my code, it responds to neither.
If I try to take out the if statements and straight away attempt File.open(file_data, 'r'), I get an error saying that the file wasn't found.
Can someone please help me find out what's wrong?
PS, I'm really sorry that this is a redundant question, but I found the other threads unhelpful.
Are you actually storing the file? Because if you are not, of course it can't be found.
First, find out what you're actually getting for file_data by adding debug output of file_data.inspect. It maybe something you don't expect, especially if form isn't set up correctly (i.e. :multipart => true).
Rails should enclose uploaded file in special object providing uniform interface, so that something as simple as this should work:
file_data.read.each_line do |line|
elts = line.split
#
#
end
I'm modding a TextMate bundle even though I'm a complete beginner at Ruby. The problem I'm trying to solve is the issue of moving the caret to a certain positition after the command has made its output.
Basically what happens is this:
I hit a key combo which triggers a command to filter through the document and inserts text at the relevant places, then exits with replacing the document with the new filtered text.
What I want to happen next is for the caret to move back to where it originally was. I was pretty happy when I found the TextMate.go_to function, but I can only get it partly to work. The function:
positionY = ENV['TM_LINE_NUMBER']
positionX = ENV['TM_LINE_INDEX']
...
TextMate.go_to :line => positionY, :column => positionX; #column no worky
I can get the caret to the right line, but the column parameter isn't working. I've tried shifting them about and even doing the function with just the column param, but no luck. I've also tried with a hard coded integer, but the positionX param prints the correct line index, so I doubt there's anything there.
This is the only documentation I've found on this method, but I took a look in the textmate.rb and to my untrained eyes it seems I'm using it properly.
I know this can be achieved by macros, but I want to avoid that if possible.
I also know that you can use markers if you choose "Insert as snippet" but then I'd have to clear the document first, and I haven't really figured out how to do this either without using the "Replace document" option.
Anyone?
Let's look at the source code of the bindings:
def go_to(options = {})
default_line = options.has_key?(:file) ? 1 : ENV['TM_LINE_NUMBER']
options = {:file => ENV['TM_FILEPATH'], :line => default_line, :column => 1}.merge(options)
if options[:file]
`open "txmt://open?url=file://#{e_url options[:file]}&line=#{options[:line]}&column=#{options[:column]}"`
else
`open "txmt://open?line=#{options[:line]}&column=#{options[:column]}"`
end
end
Rather hackishly, the binding sets up a txmt:// URL and calls open on it in the shell.
So the first thing to do would be constructing an open URL and typing it into Terminal/your browser to see if TextMate is respecting the column parameter. If that works then perhaps there is a bug in your version's implementation of Textmate.go_to.