I am looking to do scraping of the website https://www.bananatic.com/es/forum/games/
and extract the tags "name", "views" and "replies".
I have a big problem to get the non-empty content of the "name" tag. Can you help me?
I need to save only the elements that do have text.
This is my code, I have three variables:
per save what is inside the replies.
pir save what is inside the views
res saves what is inside the names.
Each array should contain only the elements that they have. something but in the names the writings [" "] are saved and I want them not to be saved in my array.
require 'nokogiri'
require 'open-uri'
require 'pp'
require 'csv'
unless File.readable?('data.html')
url = 'https://www.bananatic.com/de/forum/games/'
data = URI.open(url).read
File.open('data.html', 'wb') { |f| f << data }
end
data = File.read('data.html')
document = Nokogiri::HTML(data)
per = document.xpath('//div[#class="replies"]/text()[string-length(normalize-space(.)) > 0]')
.map { |node| node.to_s[/\d+/] }
p per
pir = document.xpath('//div[#class="views"]/text()[string-length(normalize-space(.)) > 0]')
.map { |node| node.to_s[/\w+/] }
p pir
links2 = document.css('.topics ul li div')
res = links2.map do |lk|
name = lk.css('.name p a').inner_text
[name]
end
p res
To fix it I have added a regular expression, however I have failed in the attempt.
I just replace .inner_textwith .to_s[/\w+/], but I don't get it.
ππΌ
Now I have an array with null values ββand some letters "a" that I don't know where they appear.
This Might Help XPath and CSS.
For your CSS check this out: https://kittygiraudel.github.io/selectors-explained/
The following will get you what you are looking for
document.xpath('//div[#class="topics"]/ul/li//div[#class="name"]/a[#class="js-link avatar"]/text()').map {|node| node.to_s.strip}`.
If you want to understand where your array is coming from take 1 step back and just print out lk.css('.name p a').to_s but the real issue is your selectors are just off.
All that being said looking at the construct of the page you would be better off with something like this:
require 'nokogiri'
require 'open-uri'
url = "https://www.bananatic.com/de/forum/games/"
doc = Nokogiri::HTML(URI.open(url))
# Set a root node set to start from
topics = doc.xpath('//div[#class="topics"]/ul/li')
# loop the set
details = topics.filter_map do |topic|
next unless topic.at_xpath('.//div[#class="name"]') # skip ones without the needed info
# Map details into a Hash
{name: topic.at_xpath('.//div[#class="name"]/a[#class="js-link avatar"]/text()').to_s.strip,
post_year: topic.at_xpath('.//div[#class="name"]/text()[string-length(normalize-space(.)) > 0]').to_s[/\d{4}/],
replies: topic.at_xpath('.//div[#class="replies"]/text()').to_s.strip,
views: topic.at_xpath('.//div[#class="views"]/text()').to_s.strip
}
end
The result of details would be:
[{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"236"},
{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"164"},
{:name=>"EdgarAllen", :post_year=>"2022", :replies=>"0", :views=>"1"},
{:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"0"},
{:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"1"},
{:name=>"tokyobreez", :post_year=>"2021", :replies=>"2", :views=>"18"},
{:name=>"matrix12334", :post_year=>"2022", :replies=>"0", :views=>"2"},
{:name=>"juggalohomie420", :post_year=>"2017", :replies=>"3", :views=>"89"},
{:name=>"Imas86", :post_year=>"2022", :replies=>"2", :views=>"2"},
{:name=>"SmilesImposterr", :post_year=>"2021", :replies=>"1", :views=>"17"},
{:name=>"bebb", :post_year=>"2019", :replies=>"7", :views=>"22"},
{:name=>"IMBANANAZ", :post_year=>"2016", :replies=>"1", :views=>"4"},
{:name=>"IWantSteamKeys", :post_year=>"2021", :replies=>"1", :views=>"4"},
{:name=>"gamormoment", :post_year=>"2021", :replies=>"1", :views=>"47"},
{:name=>"Lovestruck", :post_year=>"2021", :replies=>"3", :views=>"46"},
{:name=>"KillerBotAldwin1", :post_year=>"2021", :replies=>"1", :views=>"95"},
{:name=>"purplevestynstr", :post_year=>"2020", :replies=>"1", :views=>"13"},
{:name=>"Janabanana", :post_year=>"2021", :replies=>"3", :views=>"3"},
{:name=>"apache724", :post_year=>"2017", :replies=>"3", :views=>"33"},
{:name=>"MrsSue66", :post_year=>"2021", :replies=>"1", :views=>"38"}]
Related
I'm trying to find a way to pull content directly below a header tag and group it into an array based on the header text.
I think I found a solution that is VERY similar to this but it won't work and I'm wondering if that's because the website I'm scraping from does not have the 'li' objects grouped into 'ul' tags.
My code:
require 'Nokogiri'
require 'open-uri'
BASE_URL = "https://www.hornellanimalshelter.org/donate.html"
doc = Nokogiri::HTML(open(BASE_URL))
cats = doc.search('.box-09_cnt h4')
cats_and_items = cats.map{ |cat|
items = cat.next_element.search('li')
{name: cat.text, items: items.map(&:text)}
}
=> [{:name=>"Toys & Enrichment", :items=>[]}, {:name=>"Office
Supplies", :items=>[]}, {:name=>"Cleaning Supplies", :items=>[]},
{:name=>"Food & Treats", :items=>[]}, {:name=>"Kennel Care", :items=>
[]}, {:name=>"& More!", :items=>[]}]
As you can see above - it won't pull any of the li but it seems to work fine with something simple like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
</ul>
EOT
states = doc.search('h4')
states_and_cities = states.map{ |state|
cities = state.next_element.search('li a')
[state.text, cities.map(&:text)]
}.to_h
states_and_cities
# => {"Alabama"=>["auburn", "birmingham"],
# "Alaska"=>["anchorage / mat-su", "fairbanks"]}
Any thoughts? Much appreciated in advance!
Something like this maybe (untested):
data = doc.search('h4').map do |h4|
[h4.text, h4.search('+ ul li').map(&:text)]
end
and then to get a hash:
h = Hash[data]
When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }
I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk
It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.
I might be crazy, but I have been trying to gather all my favorite news sites and scrap them into one ruby file. I would like to use these sites to scrape headlines and hopefully create a custom page for my site. Now so far i have been able to scrape the headlines from all three site individually. I am looking to use all three url into an array and use Nokogiri just once. Can anyone help me ?
require 'nokogiri'
require 'open-uri'
url = 'http://www.engadget.com'
data = Nokogiri::HTML(open(url))
#feeds = data.css('.post')
#feeds.each do |feed|
puts feed.css('.headline').text.strip
end
url2 = 'http://www.modmyi.com'
data2 = Nokogiri::HTML(open(url2))
#modmyi = data2.css('.title')
#modmyi.each do |mmi|
puts mmi.css('span').text
end
url3 = 'http://www.cnn.com/specials/last-50-stories'
data3 = Nokogiri::HTML(open(url3))
#cnn = data3.css('.cd__content')
#cnn.each do |cn|
puts cn.css('.cd__headline').text
end
You might want to extract the loading of the document and the extraction of the titles into its own class:
require 'nokogiri'
require 'open-uri'
class TitleLoader < Struct.new(:url, :outher_css, :inner_css)
def titles
load_posts.map { |post| extract_title(post) }
end
private
def read_document
Nokogiri::HTML(open(url))
end
def load_posts
read_document.css(outher_css)
end
def extract_title(post)
post.css(inner_css).text.strip
end
end
And than use that class like this:
urls = [
['http://www.engadget.com', '.post', '.headline'],
['http://www.modmyi.com', '.title', 'span'],
['http://www.cnn.com/specials/last-50-stories', '.cd__content', '.cd__headline']
]
urls.map { |args| TitleLoader.new(*args).titles }.flatten
What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.