I'm new to coding in ruby and I am wondering why I get a warning when running the code below.
I checked a few answers to similar questions but can't seem to make it work for me.
Would you know why this happening and how to fix it?
Thank you so much!
Here is the warning I get in the terminal
test_Amazon.rb:9: warning: already initialized constant PAGE_URL
test_Amazon.rb:9: warning: previous definition of PAGE_URL was here
Here is the code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
for $i in (1..5)
PAGE_URL = "http://www.amazon.com/Best-Sellers/zgbs/automotive/?pg=#$i"
page = Nokogiri::HTML(open(PAGE_URL))
page.css(".zg_itemWrapper").each do |item|
price = item.at_css(".zg_price .price").text
asin = item.at_css(".zg_title a")[:href].split("/")[5].chomp
product_name = item.at_css(".zg_title a")[:href].split("/")[3]
puts "#{asin} #{price} #{product_name}"
end
end
Uppercase variables are in fact constants. You get this warning when you change the value of a constant. To avoid this warning in your example use a local variable instead of a constant to store the URL:
5.times do |i|
page_url = "http://www.amazon.com/Best-Sellers/zgbs/automotive/?pg=#{i+1}"
page = Nokogiri::HTML(open(page_url))
page.css(".zg_itemWrapper").each do |item|
...
end
end
Another thing you should avoid is global variables like $i. There is almost never a reason to have a variable that is globally accessible in your whole codebase.
Related
I am trying to fetch results from google and saving them to a file. But the results are getting repeated.
Also when I save them to file, only the last one link is getting printed to file.
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.com/videohp')
google_form = page.form('f')
google_form.q = 'ruby'
page = agent.submit(google_form, google_form.buttons.first)
linky = page.links
for link in linky do
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
$url=strList[1].gsub("h%3Fv%3D", "h?v=")
$heading = link.text
$res = $url
if ($url.to_s.include? "webcache")
next
elsif ($url.to_s.include? "channel")
next
end
puts $res
end
end
for link in linky do
File.open("aaa.htm", 'w') { |file| file.write($res) }
end
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.com/videohp')
google_form = page.form('f')
google_form.q = 'ruby'
page = agent.submit(google_form, google_form.buttons.first)
linky = page.links
for link in linky do
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
$url=strList[1].gsub("h%3Fv%3D", "h?v=")
$heading = link.text
$res = $url
if ($url.to_s.include? "webcache")
next
elsif ($url.to_s.include? "channel")
next
end
puts $res
File.open("aaa.htm", 'w') { |file| file.write($res) }
end
end
This is really two questions and it's clear you're just starting out with Ruby- you will get better with practice but it would help to keep reading up on the fundamentals of the language, this looks a bit like PHP written in Ruby.
First up the links are quite probably showing up multiple times because they are present more than once in the page. You aren't doing anything to catch that.
Secondly you have a global variable ( these tend to cause problems and should only really be used if you can't find an alternative ) which you are putting each URL into, but every time you do that, you overwrite what you had before. So every time you go $res = $url you are overwriting whatever was in $res with the last $url you got.
If you made an array instead of having the single value $res ( it can be a local variable too ) then you could just use myArray.push(url) to add each new url to it.
When you have got all the urls in your array, you could use myArray.uniq to get rid of the duplicates before you write it out to your file.
It looks like you don't really know Ruby.
Please do not use global variables unless you really need them - in this case you don't, it's not PHP. Simple assignment is enough. :)
To iterate through collection, use dedicated #each method. In your case you'd like to filter collection of links and leave those that match your needs valid_links = links.filter { |link| ... }.
Return false if they don't match your needs, return true if they match your statements.
In the File.open, you need to go through the collection inside File.open block (you will have valid_links to go through).
I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.
I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.
I am using Ruby 1.9.3p0. The program that I wrote uses a lot of Memory when I run it for more than 4 hours. I am using the following gems:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'cgi'
require 'domainatrix'
This following code is run for more than 10.000 times, and I suspect it may cause a leak.
File.open('output.txt', 'a') do |file|
output.each_line do |item|
item = item.match(/^[^\s]+/)
item = item.to_s
if item = item.match(/[a-zA-Z0-9\-_]+\..+\.[a-zA-Z]+$/)
item = item.to_s
if item.length > 1
#puts "item: #{item}"
#item = item.to_s
item = Domainatrix.parse(item)
puts "subdomain: #{item.subdomain}"
if (item.domain == domain)
file.puts item.subdomain
puts item.subdomain
end
end
end
end
end
On the other hand, I am using a hash table to store every link.
What do you think may cause the Ruby to use a lot of memory?
UPDATE
Also I believe File.open should be closed after it is used. Is it true?
first don't require 'rubygems' not required in ruby 1.9.
you forgot to use ')' after if condition.
if (item.domain == domain
yes File.open closes the file.
It shall cause a syntax error before running.
I've built a code which vote for me on a website...
The Ruby script works quite well but after few minuts this script stop with this errors : link of the screen-shot
So I've inspected the windows task manager and the memory alocate to the ruby.exe grow after each loop !
here is the incriminate peace of code :
class VoteWebsite
def self.main
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
while $stop!=true
page = agent.get 'http://website.com/vote.php'
reports_divs = page.search(".//div[#class='Pad1Color2']")
tds = reports_divs.search("td")
i = 3;j = 0;ouiboucle=0;voteboucle=0
while i < tds.length
result = tds[i].to_s.scan(/<font class="ColorRed"><b>(.*?) :<\/b><\/font>/)
type = result[0].to_s[2..-3]
k=i
case type
when "Type of vote"
j= i+1;i += 4
result2 = tds[j].to_s.scan(/<div id="btn(.*?)">/)
id = result2[0].to_s[2..-3]
monvote=define_vote($vote_type, tds[k].to_s, $vote_auto)
page2 = agent.get 'http://website.com/AJAX_Vote.php?id='+id+'&vote='+monvote
voteboucle+=1
.
.
.
else
.
.
.
end
end
end
end
end
VoteWebsite.main
I think that declaring all the variables inside the method to Global variable should fix this probleme but the code is quite big and there is planty of variables inside this method.
So is there any way (any Ruby instruction) to drain all this variable at the end of each loop ?
The problem came, in fact, from the history of mechanize see this answer or the Mechanize::History.clear method or even just set the Mechanize::History.max_size attribute to a reasonable value.
#!/usr/bin/env ruby
require 'mechanize'
class GetContent
def initialize url
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
agent.history.max_size=0
while true
page = agent.get url
end
end
end
myPage = GetContent.new('http://www.nypost.com/')
hope it helps !
You can always force the garbage collector to kick in:
GC.start
As a note, this doesn't look very Ruby. Packing multiple statements on to one line using ; is bad form, and using $ type variables is probably a relic of it being ported from something else.
Remember that $-prefixed variables are global variables in Ruby and can cause tons of problems if used carelessly and should be reserved for very specific circumstances. The best alternative is an #-prefixed instance variable, or if you must, a declared CONSTANT.