Given a subreddit like /r/pics, how can I scrape all the images in Ruby?
I looked through Reddit's API, but there doesn't seem to be anything for this. But a site like "redditery" is already doing this - http://www.redditery.com/r/aww
Check out nokogiri it will be able to perform this task.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.reddit.com/r/aww"))
doc.css('div#siteTable').css('a').each {|x| puts x['href']}
That should output links to images (This code isn't tested but should be pretty close)
Related
Thank y'all all within the community and the moderators for being so cool and willing to help so quickly! Just wanted to lead in with that. So I could really use some help with this basic automation script that I am running. I am trying to select the search bar on Google.com and enter some text. I have gotten some help from friends but they were stuck as well. But it learning this I was hoping to get some help from the experts and just ask these questions that I have because Google ain't got shit!
1) How to select the search field and enter text.
Mine looks something like this I've tried xpath, different values, id's classes.
require 'ruby'
require 'watir-webdriver'
browser = Browser::browser.new :firefox
browser.goto 'http://google.com'
browser.text_field(:value => 'Search').set('google search')
2) When I inspect the element and find unique characteristics to that value (i.e. href, id, title, class, name), which are the ones that I can actually utilize to call either the button, text_field, or link?
3) I understand html and css pretty well. Can someone please explain how to properly utilize xpath?
Y'all rock, I feel like there are tons of people out there who have these same questions as I do, and can't find the damn answers anywhere, so I ask all of you automation experts, would you mind dropping a knowledge bomb and learnin us?
There are a number of errors in this script. Here's a working version:
# require 'ruby' # don't require ruby
require 'watir-webdriver' # corrected typo in gem name
browser = Watir::Browser.new :firefox # corrected Browser::browser.new
browser.goto 'http://google.com'
browser.text_field(:title => 'Search').set('google search') # changed :value to :title
In terms of identifying page elements, it's generally considered good practice to use the id attribute since it's unique to the page. You can use the attributes that you've listed, but they have to exist as attributes for the given HTML element. AFAIK, watir-webdriver supports using the majority of standard HTML tag attributes for location and can also locate elements based on their index, via regular expression, and by combining multiple locators. For example, you could substitute any of these in the script above:
browser.text_field(:title => /Search/).set('google search')
browser.text_field(:class => 'gsfi').set('google search')
browser.text_field(:id => 'lst-ib').set('google search')
browser.text_field(:name=> 'q', :class => 'gsfi').set('google search')
If you haven't already, I'd suggest checking out http://watirwebdriver.com/ and https://github.com/watir/watir/wiki. And if you're curious about using xpath to find tricky elements, check out
https://jkotests.wordpress.com/2012/08/28/locate-element-via-custom-attribute-css-and-xpath/.
I need to full out the difference from two different text file on the internet. I tried look at the answer and all of the answer direct me to nokogiri.
Any solution on how to pull out the data difference using nokogiri in ruby? or is there any better way to do this?
You can use diff-lcs gem.
require 'diff/lcs'
require 'open-uri'
text1 = URI.parse('http://www.example.org/text1.txt').read
text2 = URI.parse('http://www.example.org/text2.txt').read
diff = Diff::LCS.diff(text1, text2)
Unfortunately, as you declined to provide an example of output even after several people asked you about it, I can't say much more than this.
The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc).
The basic anemome loop worked well:
require 'anemone'
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
But (because of the site size?) the terminal froze after a while. Therefore, I installed MongoDB and used the following
require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'
$stdout = File.new('sitemap.json','w')
Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
anemone.storage = Anemone::Storage.MongoDB
anemone.on_every_page do |page|
puts page.url
end
end
It's running now, but I'll be very surprised if there's output in the json file when I get back in the morning - I've never used MongoDB before and the part of the anemone docs about using storage weren't clear (to me at least). Can anyone who's done this before give me some tips?
If anyone out there needs <= 100,000 URLs, the Ruby Gem Spidr is a great way to go.
This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.
Anemone is not a maintained library and fails on many edge cases.
Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.
http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/
Does ruby have real multithreading?
You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.
I had some mild success going from Anemone to Nutch but your mileage may vary.
I want to extract all urls from a folder using ruby but i have no idea about this please someone help me.I have expand lot of time on google but i could not find any suggetion
Thx
Ruby's URI class can scan a document and return all URLS. Look at the extract method.
Wrap that in a loop that scans your directory using Dir::glob or Dir::entries and reads each file using File.read.
If you want, you can write a quick parser-based scanner using Nokogiri, but it's probably going to have the same results. URI's method is easier.
You can use Nokogiri to parse and search HTML documents.
> require 'nokogiri'
> require 'open-uri'
> doc = Nokogiri::HTML(open("http://www.example.com"))
> doc.css("a").map{|node| node.attr("href")}
=> ["http://www.iana.org/domains/special"]
Is there any gem in ruby to generate a summary of an url similar to what facebook does when you post a link.
None that I'm aware of, but it should't be too hard to roll your own. In the simplest case you can just require 'open-uri' and then use the open method to retrieve the contents of the site, or go for one of the HTTP libraries.
Once you got the document, all you have to do is use something like Nokogori or Hpricot to get the title, first paragraph of text and an image and you are done.
Generating a thumbnail isn't a straightforward task. The page has to be rendered, the window captured, shrunk down, then stored or returned. While it would be possible for a gem to do it, there would be significant overhead.
There are websites that can create the thumbnails, then you can reference the image:
Websnapr
Webthumb
ShrinkTheWeb
iWEBTOOL
I haven't tried them, but there's a good page discussing the first two on The Accidental Technologist.
If you need some text from the page, its simple to grab some, but making it be sensible is a different problem:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
page_text = doc.text
print page_text.gsub(/\s+/, ' ').squeeze(' ')[0..99]
# >> IANA — Example domains Domains Numbers Protocols About IANA Example Domains As described in RFC 2606