This question already has answers here:
How to extract URLs from text
(6 answers)
Closed 8 years ago.
I am trying to extract all the URLs from the raw output of some Ruby code:
require 'open-uri'
reqt = open("http://www.google.com").read
reqt.each_line { |line|
if line =~/http/ then
puts URI.extract(line)
end }
What am I doing wrong? I am getting extra lines along with URLs.
Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:
require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[#href]').each do |a|
puts a.attr('href')
end
But if you really want to find only the absolute URLs, add a simple condition:
puts a.attr('href') if a.attr('href') =~ /^http/i
You can do this instead:
require 'open-uri'
reqt = open("http://www.google.com").read
urls = reqt.scan(/[[:lower:]]+:\/\/[^\s"]+/)
Related
I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end
I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.
How do I get just the urls?
require 'net/http'
require 'uri'
uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)
If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:
require 'nokogiri'
require 'open-uri'
# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))
# Extract all a-elements (HTML links)
all_links = doc.css('a')
# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
sort.delete_if { |h| h.empty? }
# Print out some of them
puts links.grep(/store/)
http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...
Not quite sure what I should do at this point..I am utilizing a regular expression to capture JSON within the HTML on a website. I've initialized a for loop to go through everyline in the array to find {"page_cur":
I've attempted to push it to an array with the following code.
line.push(json_text)
The result was the entire sites source code. What am I doing wrong?
require 'mechanize'
require 'json'
mechanize = Mechanize.new
url = mechanize.get('http://www.hypem.com/')
page = Array.new
page = url.body.split(/\n/)
json_text = Array.new
#look through every line of code
for line in page
#find {"page_cur":
next unless line =~ /^\s*\{"page_cur":/
#delete </script> tag on the last line
line.sub! /<.script>/, ''
#push into array?
end
If you are trying to push into the json_text array it should be json_text.push(line)
EDIT: My original question was way off, my apologies. Mark Reed has helped me find out the real problem, so here it is.
Note that this code works:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
source_url = "www.flickr.com"
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts textarea
create_file.close
Which is really awesome, but I need it to do this to ~110 URLs, not just Flickr. Here's my loop that isn't working:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
File.open('sources.txt').each_line do |source_url|
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts "#{textarea}"
create_file.close
end
What am I doing wrong with my loop?
Ok, now you're looping over the lines of the input file. When you do that, you get strings that end in a newilne. So you're trying to create a file with a newline in the middle of its name, which is not legal in Windows.
Just chomp the string:
File.open('sources.txt').each_line do |source_url|
source_url.chomp!
# ... rest of code goes here ...
You can also use File#foreach instead of File#open.each_line:
File.foreach('sources.txt') do |source_url|
source_url.chomp!
# ... rest of code goes here
You're putting your parentheses in the wrong place:
create_file = File.open(variable, 'w')
I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps