I am trying to parse the URL shown in the doc variable below. My issue is with the job variable. When I return it, it returns every job title on the page instead of that specific job title for the given review. Does anyone have advice how to return the specific job title I'm referring to?
require 'nokogiri'
require 'open-uri'
# Perform a google search
doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))
reviews = []
current_review = Hash.new
doc.css('.employerReview').each do |item|
pro = item.parent.css('p:nth-child(1) .notranslate').text
con = item.parent.css('p:nth-child(2) .notranslate').text
job = item.parent.css('.review-microdata-heading .i-occ').text
puts job
advice = item.parent.css('p:nth-child(3) .notranslate').text
current_review = {'pro' => pro, 'con' => con, 'advice' => advice}
reviews << current_review
end
Looks like item.parent is #MainCol in each case, in other words the entire column.
Changing item.parent.css to item.css should solve your problem.
Related
I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.
I am trying to retrieve data from [GitHub Archive]: https://www.githubarchive.org/ and is having trouble retrieving data when I add a range. It works when I use http://data.githubarchive.org/2015-01-01-15.json.gz, but getting a `open_http': 404 Not Found (OpenURI::HTTPError) message when using http://data.githubarchive.org/2015-01-01-{0..23}.json.gz.
Using curl http://data.githubarchive.org/2015-01-01-{0..23}.json.gz seems to be working.
Basically, my goal is to write a program to retrieve the top 42 most active repositories over a certain time range.
Here's my code, please let me know I'm using the API incorrectly or code issues.
require 'open-uri'
require 'zlib'
require 'yajl'
require 'pry'
require 'date'
events = Hash.new(0)
type = 'PushEvent'
after = '2015-01-01T13:00:00Z'
before = '2015-01-02T03:12:14-03:00'
f_after_time = DateTime.parse(after).strftime('%Y-%m-%d-%H')
f_after_time = DateTime.parse(before).strftime('%Y-%m-%d-%H')
base = 'http://data.githubarchive.org/'
# query = '2015-01-01-15.json.gz'
query = '2015-01-01-{0..23}.json.gz'
url = base + query
uri = URI.encode(url)
gz = open(uri)
js = Zlib::GzipReader.new(gz).read
Yajl::Parser.parse(js) do |event|
if event['type'] == type
if event['repository']
repo_name = event['repository']['url'].gsub('https://github.com/', '')
events[repo_name] +=1
elsif event['repo'] #to account for older api
repo_name = event['repo']['url'].gsub('https://github.com/', '')
events[repo_name] +=1
end
end
end
# Sort events based on # of events and return top 42 repos
sorted_events = events.sort_by {|_key, value| value}.reverse.first(42)
sorted_events.each { |e| puts "#{e[0]} - #{e[1]} events" }
I believe brackets are not allowed in URL, so maybe you should try urlencoding it?
So basically I asked a question earlier and got an answer which solved that question but I now realise I need more help as I have spent the last few hours trying to fix this but keep getting "nil:Nilclass: error. Basically I need to go through every show listed on this site (so through each letter, and through each page this letter has) and get the following:
1. The shows title (this part I have done)
2. then either copy the page url for each show and add "/episodes/" to the end of it, or click the show and then the episode tab and copy that url.
This is what I have so far:
require 'mechanize'
shows = Array.new
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[#class="alphabet"]//li[not(contains(#class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
agent.page.search('//li[#class="show"]/a').each { |show_link| shows << show_link.text }
while next_page_link = agent.page.at('//div[#class="_pagination"]//a[#class="next"]') do
agent.get next_page_link[:href]
agent.page.search('//li[#class="show"]/a').each { |show_link| shows << show_link.text }
end
end
require 'pp'
pp shows
So an end result would like something like the following:
Title: Game of Thrones
URL: http://www.tv.com/shows/game-of-thrones/episodes/
I have tried everything (even writing it from scratch) but just can't seem to add the extra parts so I was hoping someone here my be able to help me do so. Thanks
How about this
require 'mechanize'
shows = {}
base_uri = "http://www.tv.com/"
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[#class="alphabet"]//li[not(contains(#class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
letter = letter_link.text.upcase
shows[letter] = agent.page.search('//li[#class="show"]/a').map{ |show_link| {show_link.text => base_uri << show_link[:href].to_s << 'episodes/'} }
while next_page_link = agent.page.at('//div[#class="_pagination"]//a[#class="next"]') do
agent.get next_page_link[:href]
shows[letter] << agent.page.search('//li[#class="show"]/a').map{ |show_link| {show_link.text => base_uri << show_link[:href].to_s << 'episodes/'} }
end
shows[letter].flatten!
end
puts shows
This will create the following structure Hash[letter] => Array[{ShowName => LinktoEpisodes}]
Example:
{"A"=>[{"A Show"=>"http://www.tv.com/shows/a-show/episodes/"},{"Another Show"=>"http://www.tv.com/shows/another-show/episodes/"},...],"B"=>[B SHOWS....],....}
Hope this helps.
I am using Nokogiri to parse html. For the website shown, I am trying to create an array of hashes where each hash will contain the pros, cons, and advice sections for a given review shown on the site. I am having trouble doing this and was hoping for some advice here. When I return a certain element, I don't get the right content shown on the site. Any ideas?
require 'open-uri'
require 'nokogiri'
# Perform a google search
doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))
reviews = []
current_review = Hash.new
doc.css('.employerReview').each do |item|
pro = item.parent.css('p:nth-child(1) .notranslate').text
con = item.parent.css('p:nth-child(2) .notranslate').text
advice = item.parent.css('p:nth-child(3) .notranslate').text
current_review = {'pro' => pro, 'con' => con, 'advice' => advice}
reviews << current_review
end
Try this instead:
reviews = []
doc.css('.employerReview').each do |item|
pro, con, advice = item.css('.description .notranslate text()').map(&:to_s)
reviews << {'pro' => pro, 'con' => con, 'advice' => advice}
end
It's also preferred with ruby to use symbol keys, so unless you need them to be strings, I'd do
reviews << { pro: pro, con: con, advice: advice }
At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')