Parsing large file with SaxMachine seems to be loading the whole file into memory - ruby

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks
Update to add code sample
class FeedImporter
class FeedListing
include ::SAXMachine
element :id
element :title
element :description
element :url
def to_hash
{}.tap do |hash|
self.class.column_names.each do |key|
hash[key] = send(key)
end
end
end
end
class Feed
include ::SAXMachine
elements :listing, :as => :listings, :class => FeedListing
end
def perform
open('~/feeds/large_feed.xml') do |file|
# I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
puts 'Parsing'
feed = Feed.parse(file)
# We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
puts 'Importing'
feed.listings.each do |listing|
Listing.import(listing.to_hash)
end
end
end
end
As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.
The output looks like this:
Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(

Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory
reader = Nokogiri::XML::Reader(file)
while reader.read
if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
listing = FeedListing.parse(reader.outer_xml)
Listing.import(listing.to_hash)
end
end
If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:
require 'rubygems'
require 'nokogiri'
# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
def element?
node_type == TYPE_ELEMENT
end
def end_element?
node_type == TYPE_END_ELEMENT
end
def opens?(name)
element? && self.name == name
end
def closes?(name)
(end_element? && self.name == name) ||
(self_closing? && opens?(name))
end
def skip_until_close
raise "node must be TYPE_ELEMENT" unless element?
name_to_close = self.name
if self_closing?
# DONE!
else
level = 1
while read
level += 1 if opens?(name_to_close)
level -= 1 if closes?(name_to_close)
return if level == 0
end
end
end
def each_outer_xml(name, &block)
while read
if opens?(name)
yield(outer_xml)
skip_until_close
end
end
end
end
once you have it monkey-patched, it's easy to deal with each listing individually:
open('~/feeds/large_feed.xml') do |file|
reader = Nokogiri::XML::Reader(file)
reader.each_outer_xml('listing') do |outer_xml|
listing = FeedListing.parse(outer_xml)
Listing.import(listing.to_hash)
end
end

Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.
Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:
gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'

I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine
Good news: there is a new maintainer that is planning on merging my changes.
Myself and the new maintainer have been using my fork without issue for a year now.

You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb
To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.

Related

Parsing Large xml file [duplicate]

So I'm attempting to parse a 400k+ line XML file using Nokogiri.
The XML file has this basic format:
<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
<DisorderList count="6760">
*** Repeated Many Times ***
<Disorder id="17601">
<OrphaNumber>166024</OrphaNumber>
<Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
<DisorderSignList count="18">
<DisorderSign>
<ClinicalSign id="2040">
<Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
</ClinicalSign>
<SignFreq id="640">
<Name lang="en">Very frequent</Name>
</SignFreq>
</DisorderSign>
</Disorder>
*** Repeated Many Times ***
</DisorderList>
</JDBOR>
Here is the code I've created to parse and return each DisorderSign id and name into a database:
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
symptomsList.push([signId, name])
end
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
This works perfect on the test files I've used, although they were much smaller, around 10000 lines.
When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.
Thank you for any help.
I see a few possible problems. First of all, this:
#doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
#doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
#data = { }
elsif(name == 'ClinicalSign')
#key = :sign
#data[#key] = ''
elsif(name == 'SignFreq')
#key = :freq
#data[#key] = ''
elsif(name == 'Name')
#in_name = true
end
end
def characters(str)
#data[#key] += str if(#key && #in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump #data into the database here.
#data = nil
elsif(name == 'ClinicalSign')
#key = nil
elsif(name == 'SignFreq')
#key = nil
elsif(name == 'Name')
#in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump #data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
A SAX Parser is definitly what you want to be using. If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy.
An example for what you are trying to do --
require 'saxerator'
parser = Saxerator.parser(Temp.xml)
parser.for_tag(:DisorderSign).each do |sign|
signId = sign[:ClinicalSign][:id]
name = sign[:ClinicalSign][:name]
Symtom(:name => name, :id => signId).create!
end
You're likely running out of memory because symptomsList is getting too large in memory size. Why not perform the SQL within the xpath loop?
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end
It's possible too that the file is just too large for the buffer to handle. In that case you could chop it up into smaller temp files and process them individually.
You can also use Nokogiri::XML::Reader. It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, e.x.
class NodeHandler < Struct.new(:node)
def process
# Node processing logic
#e.x.
signId = node.at('ClinicalSign').attribute('id').text()
name = node.at('ClinicalSign').element_children().text()
end
end
Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
NodeHandler.new(
Nokogiri::XML(node.outer_xml).at('./DisorderSign')
).process
end
end
Based on this blog

How do I test reading a file?

I'm writing a test for one of my classes which has the following constructor:
def initialize(filepath)
#transactions = []
File.open(filepath).each do |line|
next if $. == 1
elements = line.split(/\t/).map { |e| e.strip }
transaction = Transaction.new(elements[0], Integer(1))
#transactions << transaction
end
end
I'd like to test this by using a fake file, not a fixture. So I wrote the following spec:
it "should read a file and create transactions" do
filepath = "path/to/file"
mock_file = double(File)
expect(File).to receive(:open).with(filepath).and_return(mock_file)
expect(mock_file).to receive(:each).with(no_args()).and_yield("phrase\tvalue\n").and_yield("yo\t2\n")
filereader = FileReader.new(filepath)
filereader.transactions.should_not be_nil
end
Unfortunately this fails because I'm relying on $. to equal 1 and increment on every line and for some reason that doesn't happen during the test. How can I ensure that it does?
Global variables make code hard to test. You could use each_with_index:
File.open(filepath) do |file|
file.each_with_index do |line, index|
next if index == 0 # zero based
# ...
end
end
But it looks like you're parsing a CSV file with a header line. Therefore I'd use Ruby's CSV library:
require 'csv'
CSV.foreach(filepath, col_sep: "\t", headers: true, converters: :numeric) do |row|
#transactions << Transaction.new(row['phrase'], row['value'])
end
You can (and should) use IO#each_line together with Enumerable#each_with_index which will look like:
File.open(filepath).each_line.each_with_index do |line, i|
next if i == 1
# …
end
Or you can drop the first line, and work with others:
File.open(filepath).each_line.drop(1).each do |line|
# …
end
If you don't want to mess around with mocking File for each test you can try FakeFS which implements an in memory file system based on StringIO that will clean up automatically after your tests.
This way your test's don't need to change if your implementation changes.
require 'fakefs/spec_helpers'
describe "FileReader" do
include FakeFS::SpecHelpers
def stub_file file, content
FileUtils.mkdir_p File.dirname(file)
File.open( file, 'w' ){|f| f.write( content ); }
end
it "should read a file and create transactions" do
file_path = "path/to/file"
stub_file file_path, "phrase\tvalue\nyo\t2\n"
filereader = FileReader.new(file_path)
expect( filereader.transactions ).to_not be_nil
end
end
Be warned: this is an implementation of most of the file access in Ruby, passing it back onto the original method where possible. If you are doing anything advanced with files you may start running into bugs in the FakeFS implementation. I got stuck with some binary file byte read/write operations which weren't implemented in FakeFS quite how Ruby implemented them.

Parse/read Large XML file with minimal memory footprint

I have a very large XML file (300mb) of the following format:
<data>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
</data>
Now I need to read it and iterate through the point nodes doing something for each. Currently I'm doing it with Nokogiri like this:
require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
save_id(item.xpath("./id").text)
end
However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).
Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?
EDIT
The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each point that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one point at a time, process it, and then move on to the next "forgetting" the previous.
Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
inside_element 'point' do
for_element 'id' do puts "ID: #{inner_xml}" end
for_element 'time' do puts "Time: #{inner_xml}" end
end
end
Someone should make this a gem, perhaps me ;)
Use Nokogiri::XML::SAX::Parser (event-driven parser) and Nokogiri::XML::SAX::Document:
require 'nokogiri'
class IDCollector < Nokogiri::XML::SAX::Document
attr :ids
def initialize
#ids = []
#inside_id = false
end
def start_element(name, attrs)
# NOTE: This is simplified. You need some kind of stack manipulations
# (push in start_element / pop in end_element)
# to correctly pick `.//data/point/id` elements.
#inside_id = true if name == 'id'
end
def end_element(name)
#inside_id = false
end
def cdata_block string
#ids << string if #inside_id
end
end
collector = IDCollector.new
parser = Nokogiri::XML::SAX::Parser.new(collector)
parser.parse(File.open('large_file.xml'))
p collector.ids # => ["1371308", "1371308", "1371308"]
According to the documentation,
Nokogiri::XML::SAX::Parser: is a SAX style parser that reads its
input as it deems necessary.
You can also use Nokogiri::XML::SAX::PushParser if you need more control over the file input.
If you use jruby, you can take advantage of vtd-xml, which has the most efficient in memory model, 3~5x more efficient than DOM..
http://vtd-xml.sf.net

Parsing Large XML with Nokogiri

So I'm attempting to parse a 400k+ line XML file using Nokogiri.
The XML file has this basic format:
<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
<DisorderList count="6760">
*** Repeated Many Times ***
<Disorder id="17601">
<OrphaNumber>166024</OrphaNumber>
<Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
<DisorderSignList count="18">
<DisorderSign>
<ClinicalSign id="2040">
<Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
</ClinicalSign>
<SignFreq id="640">
<Name lang="en">Very frequent</Name>
</SignFreq>
</DisorderSign>
</Disorder>
*** Repeated Many Times ***
</DisorderList>
</JDBOR>
Here is the code I've created to parse and return each DisorderSign id and name into a database:
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
symptomsList.push([signId, name])
end
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
This works perfect on the test files I've used, although they were much smaller, around 10000 lines.
When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.
Thank you for any help.
I see a few possible problems. First of all, this:
#doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
#doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
#data = { }
elsif(name == 'ClinicalSign')
#key = :sign
#data[#key] = ''
elsif(name == 'SignFreq')
#key = :freq
#data[#key] = ''
elsif(name == 'Name')
#in_name = true
end
end
def characters(str)
#data[#key] += str if(#key && #in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump #data into the database here.
#data = nil
elsif(name == 'ClinicalSign')
#key = nil
elsif(name == 'SignFreq')
#key = nil
elsif(name == 'Name')
#in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump #data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
A SAX Parser is definitly what you want to be using. If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy.
An example for what you are trying to do --
require 'saxerator'
parser = Saxerator.parser(Temp.xml)
parser.for_tag(:DisorderSign).each do |sign|
signId = sign[:ClinicalSign][:id]
name = sign[:ClinicalSign][:name]
Symtom(:name => name, :id => signId).create!
end
You're likely running out of memory because symptomsList is getting too large in memory size. Why not perform the SQL within the xpath loop?
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end
It's possible too that the file is just too large for the buffer to handle. In that case you could chop it up into smaller temp files and process them individually.
You can also use Nokogiri::XML::Reader. It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, e.x.
class NodeHandler < Struct.new(:node)
def process
# Node processing logic
#e.x.
signId = node.at('ClinicalSign').attribute('id').text()
name = node.at('ClinicalSign').element_children().text()
end
end
Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
NodeHandler.new(
Nokogiri::XML(node.outer_xml).at('./DisorderSign')
).process
end
end
Based on this blog

How to crawl the right way?

I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
#urls = Array.new
#ID = Array.new
#titleSv = Array.new
#titleEn = Array.new
#identifier = Array.new
#typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called #urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/#href').each do |links|
#urls << links.content
end
Loop throw the #urls array, and grab every element node that I want to grab with xpath.
#urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
#ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..#ID.length - 1).each do |index|
row << [#ID[index], #titleSv[index], #titleEn[index], #identifier[index], #typeOfLevel[index], #typeOfResponsibleBody[index], #courseTyp[index], #credits[index], #degree[index], #preAcademic[index], #subjectCodeVhs[index], #descriptionSv[index], #lastedited[index], #expires[index]]
end
end
It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.
Libraries
You're using a lot of libraries here that probably aren't necessary.
You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).
Use Hashes
Instead of storing data at index in 15 arrays, have one set of hashes.
For instance,
items = Set.new
doc.xpath('//a/#href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Collect the data, then write to file
Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.
Be DRY
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Move what's common to a method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
Now you can combine these techniques to make for much cleaner codes
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
I hope this helps!
It won't work without your fixings. And I believe you should do like #Ian Bishop said to refactor your parsing code
require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'
class Links < Pioneer::Base
include REXML
def locations
["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
end
def processing(req)
doc = Nokogiri::HTML(req.response.response)
htmldoc.xpath('//a/#href').map do |links|
links.content
end
end
end
class Crawler < Pioneer::Base
include REXML
def locations
Links.new.start.flatten
end
def processing(req)
xmldoc = REXML::Document.new(req.respone.response)
root = xmldoc.root
id = root.attributes["id"]
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
title = e.text.to_s
CSV.open("eduction_normal.csv", "a") do |f|
f << [id, title ...]
end
end
end
end
Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)
If you really want to speed it up, you're going to have to go concurrent.
One of the simplest ways is to install JRuby and then run your application with one small modification: install either the 'peach' or 'pmap' gems and then change your items.each to items.peach(n) (parallel each), where n is the number of threads. You'll need at least one thread per CPU core, but if you put I/O in your loop then you'll want more.
Also, use Nokogiri, it's much faster. Ask a separate Nokogiri question if you need to solve something specific with Nokogiri. I'm sure it can do what you need.

Resources