Extracting all URLs from a page using Ruby [duplicate] - ruby

This question already has answers here:
How to extract URLs from text
(6 answers)
Closed 8 years ago.
I am trying to extract all the URLs from the raw output of some Ruby code:
require 'open-uri'
reqt = open("http://www.google.com").read
reqt.each_line { |line|
if line =~/http/ then
puts URI.extract(line)
end }
What am I doing wrong? I am getting extra lines along with URLs.

Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:
require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[#href]').each do |a|
puts a.attr('href')
end
But if you really want to find only the absolute URLs, add a simple condition:
puts a.attr('href') if a.attr('href') =~ /^http/i

You can do this instead:
require 'open-uri'
reqt = open("http://www.google.com").read
urls = reqt.scan(/[[:lower:]]+:\/\/[^\s"]+/)

Related

Ruby crawl site, add URL parameter

I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end

scanning a webpage for urls with ruby and regex

I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.
How do I get just the urls?
require 'net/http'
require 'uri'
uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)
If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:
require 'nokogiri'
require 'open-uri'
# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))
# Extract all a-elements (HTML links)
all_links = doc.css('a')
# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
sort.delete_if { |h| h.empty? }
# Print out some of them
puts links.grep(/store/)
http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...

Capture something specific within a string with a Regular Expression

Not quite sure what I should do at this point..I am utilizing a regular expression to capture JSON within the HTML on a website. I've initialized a for loop to go through everyline in the array to find {"page_cur":
I've attempted to push it to an array with the following code.
line.push(json_text)
The result was the entire sites source code. What am I doing wrong?
require 'mechanize'
require 'json'
mechanize = Mechanize.new
url = mechanize.get('http://www.hypem.com/')
page = Array.new
page = url.body.split(/\n/)
json_text = Array.new
#look through every line of code
for line in page
#find {"page_cur":
next unless line =~ /^\s*\{"page_cur":/
#delete </script> tag on the last line
line.sub! /<.script>/, ''
#push into array?
end
If you are trying to push into the json_text array it should be json_text.push(line)

File Creation/Loop Problems in Ruby

EDIT: My original question was way off, my apologies. Mark Reed has helped me find out the real problem, so here it is.
Note that this code works:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
source_url = "www.flickr.com"
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts textarea
create_file.close
Which is really awesome, but I need it to do this to ~110 URLs, not just Flickr. Here's my loop that isn't working:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
File.open('sources.txt').each_line do |source_url|
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts "#{textarea}"
create_file.close
end
What am I doing wrong with my loop?
Ok, now you're looping over the lines of the input file. When you do that, you get strings that end in a newilne. So you're trying to create a file with a newline in the middle of its name, which is not legal in Windows.
Just chomp the string:
File.open('sources.txt').each_line do |source_url|
source_url.chomp!
# ... rest of code goes here ...
You can also use File#foreach instead of File#open.each_line:
File.foreach('sources.txt') do |source_url|
source_url.chomp!
# ... rest of code goes here
You're putting your parentheses in the wrong place:
create_file = File.open(variable, 'w')

Ruby Regex Help

I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps

Resources