I have a requirement to read a series of URLs from a text file and then retrieve the pages and output a list of links.
The code has issues whenever the input URLs contain fragment identifiers (#). I tried escaping these with %23 but this didn't seem to help.
The error given is from OpenURI and is 404.
#requirements
require 'nokogiri'
require 'open-uri'
#opening each line in input text file
line_num=0
text=File.open('input.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
print "#{line_num += 1} #{line}"
open('output.txt', 'a') { |f|
f.puts "#{line_num} #{line}"
}
uri = URI.parse(URI.encode(line.strip))
page = Nokogiri::HTML(open(uri))
links = page.css("div.product-carousel-container a")
#loop through links if present
e = 0
while e < links.length
open('output.txt', 'a') { |f|
f.puts links[e]["href"]
}
e += 1
end
end
Problem
Fragment part of a URI should not be sent to server.
From Wikipedia: Fragment Identifier
The fragment identifier functions differently than the rest of the URI: namely, its processing is exclusively client-side with no participation from the web server — of course the server typically helps to determine the MIME type, and the MIME type determines the processing of fragments. When an agent (such as a Web browser) requests a web resource from a Web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.
Solution
Strip fragment part of a URI before passing it to open.
require "uri"
u = URI.parse "http://example.com#fragment"
u.fragment = nil
u.to_s #=> "http://example.com"
You're 90% of the way there. The client is responsible for processing the fragment.
Your code is already using URI to parse the string, so let the parsed object remove the fragment:
require 'open-uri'
uri = URI.parse('http://foo.com/index.html#bar')
uri # => #<URI::HTTP http://foo.com/index.html#bar>
uri.fragment = nil
uri # => #<URI::HTTP http://foo.com/index.html>
Related
This is a follow-up question to this post.
I am new to Ruby and want to create a script that will search a file for a pattern. However, I want to only replace part of it, i.e. remove all http:// patterns matches but only when they are followed by a valid url.
If "valid url" means that the string is parseable as an URL, then you might try using URI.parse. For example:
require 'uri'
IO.readlines(input_file).each do |line|
line.gsub(%r;(https?://\S+);) do |url|
URI.parse(url) && '' rescue url
end
end
However, the URI module is very lax. You'll find strings like not-an-uri are considered valid "generic" URIs.
You might want to check whether the captured URL can be fetched and returns a successful HTTP status. That is significantly more resource intensive, so operating over a large input file would be very slow. It also could be considered a security risk.
require 'uri'
require 'net/http'
def valid_url?(url)
uri = URI.parse(url)
Net::HTTP.get_response(uri).is_a? Net::HTTPSuccess
rescue
return false
end
IO.readlines(input_file).each do |line|
line.gsub(%r;(https?://\S+);) do |url|
valid_url?(url) ? '' : url
end
end
I need to open some webpages using open-uri in ruby and then parse the content of those pages using Nokogori.
I just did:
require 'open-uri'
content_file = open(user_input_url)
This worked for: http://www.google.co.in and http://google.co.in but fails when user give inputs like www.google.co.in or google.co.in.
One thing i can do for such inputs i can append http:// and https:// and return the content of the page that opens. But this seems like a big hack to me.
Is there any better way to achieve this in ruby(i.e converting these user_inputs to valid open_uri urls).
uri = URI("www.google.com")
if uri.instance_of?(URI::Generic)
uri = URI::HTTP.build({:host => uri.to_s})
end
content_file = open(uri)
There are other ways as well see ref: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/uri/rdoc/URI/HTTP.html
Prepend the scheme if not present and then use URI which will check the URL validity:
require 'uri'
url = 'www.google.com/a/b?c=d#e'
url.prepend "http://" unless url.start_with?('http://', 'https://')
url = URI(url) # it will raise error if the url is not valid
open url
Unfortunately, an "object oriented" version of what you need is more verbose and even more hackish:
require 'uri'
case url = URI.parse 'www.google.com/a/b?c=d#e'
when URI::HTTP, URI::HTTPS
# no-op
when URI::Generic
# We need to split u.path at the first '/', since URI::Generic interprets
# 'www.google.com/a/b' as a single path
host, path = url.path.split '/', 2
url = URI::HTTP.build host: host ,
path: "/#{path}" ,
query: url.query ,
fragment: url.fragment
else
raise "unsupported url class (#{url.class}) for #{url}"
end
open url
If you accept suggestions, don't break your head too much on this: I faced this matter often and I'm quite sure there aren't "polished" ways to do it
You need to prepend http to the urls, without an explicit scheme the uri could be anything, e.g. a local file. A uri is not necessarily an http url.
You can check either by using the URI class or by using a regex:
user_input_url = URI.parse(user_input_url).scheme ?
user_input_url :
"http://#{user_input_url}"
user_input_url = user_input_url =~ /https?:\/\// ?
user_input_url :
"http://#{user_input_url}"
def instance_to_hash(instance)
hash = {}
instance.instance_variables.each {|var| hash[var[1..-1].to_sym] = instance.instance_variable_get(var) }
hash
end
def url_compile(url)
# if url without 'http://', 'https://', '//' at start of string
# then prepend '//'
url.prepend '//' unless url.start_with?('http://', 'https://', '//')
uri = URI(url)
if uri.instance_of?(URI::Generic) # if scheme nil then assume it HTTPS
uri = URI::HTTPS.build(instance_to_hash(uri))
end
uri
end
I have written code in Ruby that will display the timeline for a specific user. I would like to write code to be able to just search twitter to just find every user that has mentioned a word. My code is currently:
require 'rubygems'
require 'oauth'
require 'json'
# Now you will fetch /1.1/statuses/user_timeline.json,
# returns a list of public Tweets from the specified
# account.
baseurl = "https://api.twitter.com"
path = "/1.1/statuses/user_timeline.json"
query = URI.encode_www_form(
"q" => "Obama"
)
address = URI("#{baseurl}#{path}?#{query}")
request = Net::HTTP::Get.new address.request_uri
# Print data about a list of Tweets
def print_timeline(tweets)
tweets.each do |tweet|
require 'date'
d = DateTime.parse(tweet['created_at'])
puts " #{tweet['text'].delete ","} , #{d.strftime('%d.%m.%y')} , #{tweet['user']['name']}, #{tweet['id']}"
end
end
# Set up HTTP.
http = Net::HTTP.new address.host, address.port
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# If you entered your credentials in the first
# exercise, no need to enter them again here. The
# ||= operator will only assign these values if
# they are not already set.
consumer_key = OAuth::Consumer.new(
"")
access_token = OAuth::Token.new(
"")
# Issue the request.
request.oauth! http, consumer_key, access_token
http.start
response = http.request request
# Parse and print the Tweet if the response code was 200
tweets = nil
puts "Text,Date,Name,id"
if response.code == '200' then
tweets = JSON.parse(response.body)
print_timeline(tweets)
end
nil
How would I possibly change this code to search all of twitter for a specific word?
The easiest approach would be to use 'Twitter' gem. Refer to this Link for more information and the result type of the search results. Once you have all the correct authorization attribute in place (oAuth-Token,oAuth-secret, etc) you should be able to search as
Twitter.search('Obama')
or
Twitter.search('Obama', options = {})
Let us know, if that worked for you or not.
p.s. - Please mark the post as answered if it helped you. Else put a comment back with what is missing.
The Twitter API suggests the URI your should be using for global search is https://api.twitter.com/1.1/search/tweets.json and this means:
Your base_url component would be https://api.twitter.com
Your path component would be /1.1/search/tweets.json
Your query component would be the text you are searching for.
The query part takes a lot of values depending upon the API spec. Refer to the specification and you can change it as per your requirement.
Tip: Try to use irb (I'd recommend pry) REPL which makes it a lot easier to explore APIs. Also, checkout the Faraday gem which can be easier to use than the default HTTP library in Ruby IMO.
I have a list of files on a server and would like to load and parse only the ID3 from each file.
The code below loads the entire file, which is (obviously) very time consuming when batched.
require 'mp3info'
require 'open-uri'
uri = "http://blah.com/blah.mp3"
Mp3Info.open(open(uri)) do |mp3|
puts mp3.tag.title
puts mp3.tag.artist
puts mp3.tag.album
puts mp3.tag.tracknum
end
Well this solution works for id3v2 (the current standard). ID3V1 doesn't have the metadata at the beginning of the file, so it wouldn't work in those cases.
This reads the first 4096 bytes of the file, which is arbitrary. As far as I could tell from the ID3 documentation, there is no limit to the size, but 4kb was when I stopped getting parsing errors in my library.
I was able to build a simple dropbox audio player, which can be seen here:
soundstash.heroku.com
and open-sourced the code here: github.com/miketucker/Dropbox-Audio-Player
require 'open-uri'
require 'stringio'
require 'net/http'
require 'uri'
require 'mp3info'
url = URI.parse('http://example.com/filename.mp3') # turn the string into a URI
http = Net::HTTP.new(url.host, url.port)
req = Net::HTTP::Get.new(url.path) # init a request with the url
req.range = (0..4096) # limit the load to only 4096 bytes
res = http.request(req) # load the mp3 file
child = {} # prepare an empty array to store the metadata we grab
Mp3Info.open( StringIO.open(res.body) ) do |m| #do the parsing
child['title'] = m.tag.title
child['album'] = m.tag.album
child['artist'] = m.tag.artist
child['length'] = m.length
end
How can I check if a string is a valid URL?
For example:
http://hello.it => yes
http:||bra.ziz, => no
If this is a valid URL how can I check if this is relative to a image file?
Notice:
As pointed by #CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).
Use the URI module distributed with Ruby:
require 'uri'
if url =~ URI::regexp
# Correct URL
end
Like Alexander Günther said in the comments, it checks if a string contains a URL.
To check if the string is a URL, use:
url =~ /\A#{URI::regexp}\z/
If you only want to check for web URLs (http or https), use this:
url =~ /\A#{URI::regexp(['http', 'https'])}\z/
Similar to the answers above, I find using this regex to be slightly more accurate:
URI::DEFAULT_PARSER.regexp[:ABS_URI]
That will invalidate URLs with spaces, as opposed to URI.regexp which allows spaces for some reason.
I have recently found a shortcut that is provided for the different URI rgexps. You can access any of URI::DEFAULT_PARSER.regexp.keys directly from URI::#{key}.
For example, the :ABS_URI regexp can be accessed from URI::ABS_URI.
The problem with the current answers is that a URI is not an URL.
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
Since URLs are a subset of URIs, it is clear that matching specifically for URIs will successfully match undesired values. For example, URNs:
"urn:isbn:0451450523" =~ URI::regexp
=> 0
That being said, as far as I know, Ruby doesn't have a default way to parse URLs , so you'll most likely need a gem to do so. If you need to match URLs specifically in HTTP or HTTPS format, you could do something like this:
uri = URI.parse(my_possible_url)
if uri.kind_of?(URI::HTTP) or uri.kind_of?(URI::HTTPS)
# do your stuff
end
I prefer the Addressable gem. I have found that it handles URLs more intelligently.
require 'addressable/uri'
SCHEMES = %w(http https)
def valid_url?(url)
parsed = Addressable::URI.parse(url) or return false
SCHEMES.include?(parsed.scheme)
rescue Addressable::URI::InvalidURIError
false
end
This is a fairly old entry, but I thought I'd go ahead and contribute:
String.class_eval do
def is_valid_url?
uri = URI.parse self
uri.kind_of? URI::HTTP
rescue URI::InvalidURIError
false
end
end
Now you can do something like:
if "http://www.omg.wtf".is_valid_url?
p "huzzah!"
end
For me, I use this regular expression:
/\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
Option:
i - case insensitive
x - ignore whitespace in regex
You can set this method to check URL validation:
def valid_url?(url)
return false if url.include?("<script")
url_regexp = /\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
url =~ url_regexp ? true : false
end
To use it:
valid_url?("http://stackoverflow.com/questions/1805761/check-if-url-is-valid-ruby")
Testing with wrong URLs:
http://ruby3arabi - result is invalid
http://http://ruby3arabi.com - result is invalid
http:// - result is invalid
http://test.com\n<script src=\"nasty.js\"> (Just simply check "<script")
127.0.0.1 - not support IP address
Test with correct URLs:
http://ruby3arabi.com - result is valid
http://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com/article/1 - result is valid
https://www.ruby3arabi.com/websites/58e212ff6d275e4bf9000000?locale=en - result is valid
In general,
/^#{URI::regexp}$/
will work well, but if you only want to match http or https, you can pass those in as options to the method:
/^#{URI::regexp(%w(http https))}$/
That tends to work a little better, if you want to reject protocols like ftp://.
This is a little bit old but here is how I do it. Use Ruby's URI module to parse the URL. If it can be parsed then it's a valid URL. (But that doesn't mean accessible.)
URI supports many schemes, plus you can add custom schemes yourself:
irb> uri = URI.parse "http://hello.it" rescue nil
=> #<URI::HTTP:0x10755c50 URL:http://hello.it>
irb> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"http",
"query"=>nil,
"port"=>80,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
irb> uri = URI.parse "http:||bra.ziz" rescue nil
=> nil
irb> uri = URI.parse "ssh://hello.it:5888" rescue nil
=> #<URI::Generic:0x105fe938 URL:ssh://hello.it:5888>
[26] pry(main)> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"ssh",
"query"=>nil,
"port"=>5888,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
See the documentation for more information about the URI module.
You could also use a regex, maybe something like http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm assuming this regex is correct (I haven't fully checked it) the following will show the validity of the url.
url_regex = Regexp.new("((https?|ftp|file):((//)|(\\\\))+[\w\d:\##%/;$()~_?\+-=\\\\.&]*)")
urls = [
"http://hello.it",
"http:||bra.ziz"
]
urls.each { |url|
if url =~ url_regex then
puts "%s is valid" % url
else
puts "%s not valid" % url
end
}
The above example outputs:
http://hello.it is valid
http:||bra.ziz not valid