Ruby/Mechanize Any way to drain RAM? -> failed to allocate memory - ruby

I've built a code which vote for me on a website...
The Ruby script works quite well but after few minuts this script stop with this errors : link of the screen-shot
So I've inspected the windows task manager and the memory alocate to the ruby.exe grow after each loop !
here is the incriminate peace of code :
class VoteWebsite
def self.main
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
while $stop!=true
page = agent.get 'http://website.com/vote.php'
reports_divs = page.search(".//div[#class='Pad1Color2']")
tds = reports_divs.search("td")
i = 3;j = 0;ouiboucle=0;voteboucle=0
while i < tds.length
result = tds[i].to_s.scan(/<font class="ColorRed"><b>(.*?) :<\/b><\/font>/)
type = result[0].to_s[2..-3]
k=i
case type
when "Type of vote"
j= i+1;i += 4
result2 = tds[j].to_s.scan(/<div id="btn(.*?)">/)
id = result2[0].to_s[2..-3]
monvote=define_vote($vote_type, tds[k].to_s, $vote_auto)
page2 = agent.get 'http://website.com/AJAX_Vote.php?id='+id+'&vote='+monvote
voteboucle+=1
.
.
.
else
.
.
.
end
end
end
end
end
VoteWebsite.main
I think that declaring all the variables inside the method to Global variable should fix this probleme but the code is quite big and there is planty of variables inside this method.
So is there any way (any Ruby instruction) to drain all this variable at the end of each loop ?

The problem came, in fact, from the history of mechanize see this answer or the Mechanize::History.clear method or even just set the Mechanize::History.max_size attribute to a reasonable value.
#!/usr/bin/env ruby
require 'mechanize'
class GetContent
def initialize url
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
agent.history.max_size=0
while true
page = agent.get url
end
end
end
myPage = GetContent.new('http://www.nypost.com/')
hope it helps !

You can always force the garbage collector to kick in:
GC.start
As a note, this doesn't look very Ruby. Packing multiple statements on to one line using ; is bad form, and using $ type variables is probably a relic of it being ported from something else.
Remember that $-prefixed variables are global variables in Ruby and can cause tons of problems if used carelessly and should be reserved for very specific circumstances. The best alternative is an #-prefixed instance variable, or if you must, a declared CONSTANT.

Related

Why is this ruby code returning a blank page instead of filling it up with user names?

I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.

How should I use recursive method in ruby

I wrote a simple web scrawler using Mechanize, now I'm stuck at how to get next page recursively, below is the code.
def self.generate_page #generate a Mechainze page object,the first page
agent = Mechanize.new
url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
page = agent.get(url)
page
end
def self.next_page(n_page) #get next page recursively by click next tag showed in each pages
puts n_page
# if I dont use puts , I get nothing , when using puts, I get
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>
#which are the results I want
np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
result = next_page(np) unless np.nil?
result # here the value is empty, I dont know what is worng
end
def self.get_page # trying to pass the result of next_page() method
puts next_page(generate_page)
# it seems result is never passed here,
end
I followed these two links What is recursion and how does it work?
and Ruby recursive function
but still cant figure out what's wrong.. hope someone can help me out.. Thanks
There are a few issues with your code:
You shouldn't be calling Mechanize.new more than once.
From a stylistic perspective, you are doing too many nil checks.
Unless you have a preference for recursion, it'll probably be easier to do it iteratively.
To have your next_page method return an array containing every link page in the chain, you could write this:
# you should store the mechanize agent as a global variable
Agent = Mechanize.new
# a helper method to DRY up the code
def click_to_next_page(page)
Agent.click(n_page.link_with(:text=>/next/))
end
# repeatedly visits next page until none exists
# returns all seen pages as an array
def get_all_next_pages(n_page)
results = []
np = click_to_next_page(n_page)
results.push(np)
until !np
np = click_to_next_page(np)
np && results.push(np)
end
results
end
# testing it out (i'm not actually running this)
base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
root_page = Agent.get(base_url)
next_pages = get_all_next_pages(root_page)
puts next_pages

warning: already initialized

I'm new to coding in ruby and I am wondering why I get a warning when running the code below.
I checked a few answers to similar questions but can't seem to make it work for me.
Would you know why this happening and how to fix it?
Thank you so much!
Here is the warning I get in the terminal
test_Amazon.rb:9: warning: already initialized constant PAGE_URL
test_Amazon.rb:9: warning: previous definition of PAGE_URL was here
Here is the code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
for $i in (1..5)
PAGE_URL = "http://www.amazon.com/Best-Sellers/zgbs/automotive/?pg=#$i"
page = Nokogiri::HTML(open(PAGE_URL))
page.css(".zg_itemWrapper").each do |item|
price = item.at_css(".zg_price .price").text
asin = item.at_css(".zg_title a")[:href].split("/")[5].chomp
product_name = item.at_css(".zg_title a")[:href].split("/")[3]
puts "#{asin} #{price} #{product_name}"
end
end
Uppercase variables are in fact constants. You get this warning when you change the value of a constant. To avoid this warning in your example use a local variable instead of a constant to store the URL:
5.times do |i|
page_url = "http://www.amazon.com/Best-Sellers/zgbs/automotive/?pg=#{i+1}"
page = Nokogiri::HTML(open(page_url))
page.css(".zg_itemWrapper").each do |item|
...
end
end
Another thing you should avoid is global variables like $i. There is almost never a reason to have a variable that is globally accessible in your whole codebase.

Ruby EOFError with open-uri and loop

I'm attempting to build a web crawler and ran into a bit of a snag. Basically what I'm doing is extracting the links from a web page and pushing each link to a queue. Whenever the Ruby interpreter hits this section of code:
links.each do |link|
url_frontier.push(link)
end
I receive the following error:
/home/blah/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock': end of file reached (EOFError)
If I comment out the above block of code I get no errors. Please, any help would be appreciated. Here is the rest of the code:
require 'open-uri'
require 'net/http'
require 'uri'
class WebCrawler
def self.Spider(root)
eNDCHARS = %{.,'?!:;}
num_documents = 0
token_list = []
url_repository = Hash.new
url_frontier = Queue.new
url_frontier.push(root.to_s)
while !url_frontier.empty? && num_documents < 10
url = url_frontier.pop
if !url_repository.has_key?(url)
document = open(url)
html = document.read
# extract url's
links = URI.extract(html, ['http']).collect { |u| eNDCHARS.index(u[-1]) ? u.chop : u }
links.each do |link|
url_frontier.push(link)
end
# tokenize
Tokenizer.tokenize(document).each do |word|
token_list.push(IndexStructures::Term.new(word, url))
end
# add to the repository
url_repository[url] = true
num_documents += 1
end
end
# sort by term (primary) and document id (secondary) in reverse to aid in the construction of the inverted index
return num_documents, token_list.sort_by! { |term| [term.term, term.document_id]}.reverse!
end
end
I encountered the same error but with Watir-webdriver, running firefox in headless mode. What I found out was, if I was running two of my applications in parallel and I destroy "headless" in one of the applications, it automatically kills the other one as well with the exact error you quoted. Though my situation is not the same as yours, I think the issue is related to prematurely closing the file handle externally while your application is still using it. I removed the destroy command from my application and the error disappeared.
Hope this helps.

ruby: multiple identical or synced instances of mechanize?

As far as I know, I read elsewhere that ruby mechanize is not thread save. Thus, to accelerate some 'gets', I opted to instantiate several independent Mechanize objects and use them in parallel. This seems to work OK
BTW, I would like to make all instances as similar as possible, as similar as sharing 'everything' they could know (cookies, etc).
Is there any way to make deep copies of an already 'configured' Mechanize object. My aim is to only configure one of them and copy make clones of it.
For instance, if I can create a Mechanize object like this (only an example, but suppose there are a lot more of configured attributes):
agent = Mechanize.new { |a| a.read_timeout = 20; a.max_history = 1 }
How can I get copies of that don't interfere each other while 'get'ing?.
agent2 = agent.dup # are not thread save copies
agent2 = Marshal.load(Marshal.dump(agent)) # thorws an error
This appears to work until you change a value for max_history or read_timeout.
class Mechanize
def clone
Mechanize.new do |a|
a.cookie_jar = cookie_jar
a.max_history = max_history
a.read_timeout = read_timeout
end
end
end
Testing:
agent1 = Mechanize.new { |a| a.max_history = 30; a.read_timeout = 30 }
agent2 = agent1.clone
agent2.max_history == 30 # true
agent2.cookie_jar == agent1.cookie_jar # true

Resources