I am using Ruby 1.9.3p0. The program that I wrote uses a lot of Memory when I run it for more than 4 hours. I am using the following gems:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'cgi'
require 'domainatrix'
This following code is run for more than 10.000 times, and I suspect it may cause a leak.
File.open('output.txt', 'a') do |file|
output.each_line do |item|
item = item.match(/^[^\s]+/)
item = item.to_s
if item = item.match(/[a-zA-Z0-9\-_]+\..+\.[a-zA-Z]+$/)
item = item.to_s
if item.length > 1
#puts "item: #{item}"
#item = item.to_s
item = Domainatrix.parse(item)
puts "subdomain: #{item.subdomain}"
if (item.domain == domain)
file.puts item.subdomain
puts item.subdomain
end
end
end
end
end
On the other hand, I am using a hash table to store every link.
What do you think may cause the Ruby to use a lot of memory?
UPDATE
Also I believe File.open should be closed after it is used. Is it true?
first don't require 'rubygems' not required in ruby 1.9.
you forgot to use ')' after if condition.
if (item.domain == domain
yes File.open closes the file.
It shall cause a syntax error before running.
Related
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.priceangels.com/site-map.html"
doc = Nokogiri::HTML(open(url))
doc.css('.lav1').each do |item|
puts item.text
end
doc.css('.masonry-brick').each do |item|
puts item.text
end
This is my first time using nokogiri. The first each loop behaves as expected. The second each loop fails to find any matches.
Does Nokogiri not recognise class names with dashes (hyphens)?
How do I get nokogiri to find the '.masonry-brick' classes?
doc.css("ul.sitemap-item a").each do |me|
puts me.text
end
Is this what you were looking for?
also
<div class="hello world">
doc.css("div[#class='hello world']")
You can use that if you're having problems with spaces.
EDIT: My original question was way off, my apologies. Mark Reed has helped me find out the real problem, so here it is.
Note that this code works:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
source_url = "www.flickr.com"
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts textarea
create_file.close
Which is really awesome, but I need it to do this to ~110 URLs, not just Flickr. Here's my loop that isn't working:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
File.open('sources.txt').each_line do |source_url|
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts "#{textarea}"
create_file.close
end
What am I doing wrong with my loop?
Ok, now you're looping over the lines of the input file. When you do that, you get strings that end in a newilne. So you're trying to create a file with a newline in the middle of its name, which is not legal in Windows.
Just chomp the string:
File.open('sources.txt').each_line do |source_url|
source_url.chomp!
# ... rest of code goes here ...
You can also use File#foreach instead of File#open.each_line:
File.foreach('sources.txt') do |source_url|
source_url.chomp!
# ... rest of code goes here
You're putting your parentheses in the wrong place:
create_file = File.open(variable, 'w')
I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
#urls = Array.new
#ID = Array.new
#titleSv = Array.new
#titleEn = Array.new
#identifier = Array.new
#typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called #urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/#href').each do |links|
#urls << links.content
end
Loop throw the #urls array, and grab every element node that I want to grab with xpath.
#urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
#ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..#ID.length - 1).each do |index|
row << [#ID[index], #titleSv[index], #titleEn[index], #identifier[index], #typeOfLevel[index], #typeOfResponsibleBody[index], #courseTyp[index], #credits[index], #degree[index], #preAcademic[index], #subjectCodeVhs[index], #descriptionSv[index], #lastedited[index], #expires[index]]
end
end
It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.
Libraries
You're using a lot of libraries here that probably aren't necessary.
You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).
Use Hashes
Instead of storing data at index in 15 arrays, have one set of hashes.
For instance,
items = Set.new
doc.xpath('//a/#href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Collect the data, then write to file
Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.
Be DRY
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Move what's common to a method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
Now you can combine these techniques to make for much cleaner codes
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
I hope this helps!
It won't work without your fixings. And I believe you should do like #Ian Bishop said to refactor your parsing code
require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'
class Links < Pioneer::Base
include REXML
def locations
["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
end
def processing(req)
doc = Nokogiri::HTML(req.response.response)
htmldoc.xpath('//a/#href').map do |links|
links.content
end
end
end
class Crawler < Pioneer::Base
include REXML
def locations
Links.new.start.flatten
end
def processing(req)
xmldoc = REXML::Document.new(req.respone.response)
root = xmldoc.root
id = root.attributes["id"]
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
title = e.text.to_s
CSV.open("eduction_normal.csv", "a") do |f|
f << [id, title ...]
end
end
end
end
Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)
If you really want to speed it up, you're going to have to go concurrent.
One of the simplest ways is to install JRuby and then run your application with one small modification: install either the 'peach' or 'pmap' gems and then change your items.each to items.peach(n) (parallel each), where n is the number of threads. You'll need at least one thread per CPU core, but if you put I/O in your loop then you'll want more.
Also, use Nokogiri, it's much faster. Ask a separate Nokogiri question if you need to solve something specific with Nokogiri. I'm sure it can do what you need.
I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks
Update to add code sample
class FeedImporter
class FeedListing
include ::SAXMachine
element :id
element :title
element :description
element :url
def to_hash
{}.tap do |hash|
self.class.column_names.each do |key|
hash[key] = send(key)
end
end
end
end
class Feed
include ::SAXMachine
elements :listing, :as => :listings, :class => FeedListing
end
def perform
open('~/feeds/large_feed.xml') do |file|
# I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
puts 'Parsing'
feed = Feed.parse(file)
# We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
puts 'Importing'
feed.listings.each do |listing|
Listing.import(listing.to_hash)
end
end
end
end
As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.
The output looks like this:
Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory
reader = Nokogiri::XML::Reader(file)
while reader.read
if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
listing = FeedListing.parse(reader.outer_xml)
Listing.import(listing.to_hash)
end
end
If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:
require 'rubygems'
require 'nokogiri'
# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
def element?
node_type == TYPE_ELEMENT
end
def end_element?
node_type == TYPE_END_ELEMENT
end
def opens?(name)
element? && self.name == name
end
def closes?(name)
(end_element? && self.name == name) ||
(self_closing? && opens?(name))
end
def skip_until_close
raise "node must be TYPE_ELEMENT" unless element?
name_to_close = self.name
if self_closing?
# DONE!
else
level = 1
while read
level += 1 if opens?(name_to_close)
level -= 1 if closes?(name_to_close)
return if level == 0
end
end
end
def each_outer_xml(name, &block)
while read
if opens?(name)
yield(outer_xml)
skip_until_close
end
end
end
end
once you have it monkey-patched, it's easy to deal with each listing individually:
open('~/feeds/large_feed.xml') do |file|
reader = Nokogiri::XML::Reader(file)
reader.each_outer_xml('listing') do |outer_xml|
listing = FeedListing.parse(outer_xml)
Listing.import(listing.to_hash)
end
end
Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.
Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:
gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine
Good news: there is a new maintainer that is planning on merging my changes.
Myself and the new maintainer have been using my fork without issue for a year now.
You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb
To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.
I am writing some Ruby code, not Rails, and I need to handle something like this:
found 1 match
found 2 matches
I have Rails installed so maybe I might be able to add a require clause at the top of the script, but does anyone know of a RUBY method that pluralizes strings? Is there a class I can require that can deal with this if the script isn't Rails but I have Rails installed?
Edit: All of these answers were close but I checked off the one that got it working for me.
Try this method as a helper when writing Ruby, not Rails, code:
def pluralize(number, text)
return text.pluralize if number != 1
text
end
Actually all you need to do is
require 'active_support/inflector'
and that will extend the String type.
you can then do
"MyString".pluralize
which will return
"MyStrings"
for 2.3.5 try:
require 'rubygems'
require 'active_support/inflector'
should get it, if not try
sudo gem install activesupport
and then the requires.
Inflector is overkill for most situations.
def x(n, singular, plural=nil)
if n == 1
"1 #{singular}"
elsif plural
"#{n} #{plural}"
else
"#{n} #{singular}s"
end
end
Put this in common.rb, or wherever you like your general utility functions and...
require "common"
puts x(0, 'result') # 0 results
puts x(1, 'result') # 1 result
puts x(2, 'result') # 2 results
puts x(0, 'match', 'matches') # 0 matches
puts x(1, 'match', 'matches') # 1 match
puts x(2, 'match', 'matches') # 2 matches
I personally like the linguistics gem that is definitely not rails related.
# from it's frontpage
require 'linguistics'
Linguistics.use :en
"box".en.plural #=> "boxes"
"mouse".en.plural #=> "mice"
# etc
This works for me (using ruby 2.1.1 and actionpack 3.2.17):
~$ irb
>> require 'action_view'
=> true
>> include ActionView::Helpers::TextHelper
=> Object
>> pluralize(1, 'cat')
=> "1 cat"
>> pluralize(2, 'cat')
=> "2 cats"
require 'active_support'
require 'active_support/inflector'
inf = ActiveSupport::Inflector::Inflections.new
to get the inflector, not sure how you use it
my solution:
# Custom pluralize - will return text without the number as the default pluralize.
def cpluralize(number, text)
return text.pluralize if number != 1
return text.singularize if number == 1
end
So you can have 'review' returned if you call cpluralize(1, 'reviews')
Hope that helps.
I've defined a helper function for that, I use it for every user editable model's index view :
def ovyka_counter(array, name=nil, plural=nil)
name ||= array.first.class.human_name.downcase
pluralize(array.count, name, plural)
end
then you can call it from the view :
<% ovyka_counter #posts %>
for internationalization (i18n), you may then add this to your locale YAML files :
activerecord:
models:
post: "Conversation"