Sanitizing URL strings

Sanitizing URL strings - ruby

Say we have a string
url = "http://example.com/foo/baz/../../."
Obviously, we know from the Unix shell that ../../. essentially means to go up two directories. Hence, this URL is really going to http://example.com/. My question is, given these ../ characters in a string, how can we sanitize the URL string to point at the actual resource?
For example:
url = "http://example.com/foo/baz/../../hello.html"
url = process(url)
url = "http://example.com/hello.html"
Another:
url = "http://example.com/foo/baz/../."
url = process(url)
url = "http://example.com/foo/"
Keep in mind, the function still as to be able to take in normal URLs (ie. http://example.com) and return them as is if there is nothing to sanitize

The addressable gem can do this.
require 'addressable'
Addressable::URI.parse("http://example.com/foo/baz/../../hello.html").normalize.to_s
#=> "http://example.com/hello.html"

#!/usr/bin/env ruby
# ======
## defs:
def process(url)
url_components = url.split('/')
url_components2 = url_components.dup
current_index = 0
url_components.each do |component|
if component == '..'
url_components2.delete_at(current_index)
url_components2.delete_at(current_index-1)
current_index -= 1
elsif
component == '.'
url_components2.delete_at(current_index)
else
current_index += 1
end
end
url_resolved = url_components2.join('/')
return url_resolved
end
# =======
## tests:
urls = [
"http://example.com/foo/baz/../../.",
"http://example.com/foo/baz/../../hello.html",
"http://example.com/foo/baz/../."
]
urls.each do |url|
print url, ' => '
puts process(url)
end

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.

Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.

You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Why doesn't my web-crawling method find all the links?

I'm trying to create a simple web-crawler, so I wrote this:
(Method get_links take a parent link from which we will seek)
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
(Method search_links, takes an array from get_links method and search at this array)
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
This method finds most of links from the website, but not all.
What did I do wrong? Which algorithm I should use?

Some comments about your code:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
Explanation for
hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath('//a/#href') uses the attribute syntax of XPath to directly get to the href attributes of a elements
.map(&:to_s) is an abbreviated notation for .map { |item| item.to_s }
.delete_if(&:empty?) uses the same abbreviated notation
And comments about the second function:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end

You should probably be using mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
Otherwise you'll have trouble converting relative urls to absolute properly.

Dynamically check if a field in JSON is nil without using eval

Here's an extract of the code that I am using:
def retrieve(user_token, quote_id, check="quotes")
end_time = Time.now + 15
match = false
until Time.now > end_time || match
#response = http_request.get(quote_get_url(quote_id, user_token))
eval("match = !JSON.parse(#response.body)#{field(check)}.nil?")
end
match.eql?(false) ? nil : #response
end
private
def field (check)
hash = {"quotes" => '["quotes"][0]',
"transaction-items" => '["quotes"][0]["links"]["transactionItems"]'
}
hash[check]
end
I was informed that using eval in this manner is not good practice. Could anyone suggest a better way of dynamically checking the existence of a JSON node (field?). I want this to do:
psudo: match = !JSON.parse(#response.body) + dynamic-path + .nil?

Store paths as arrays of path elements (['quotes', 0]). With a little helper function you'll be able to avoid eval. It is, indeed, completely inappropriate here.
Something along these lines:
class Hash
def deep_get(path)
path.reduce(self) do |memo, path_element|
return unless memo
memo[path_element]
end
end
end
path = ['quotes', 0]
hash = JSON.parse(response.body)
match = !hash.deep_get(path).nil?

ruby cgi wont return method calls, but will return parameters

my environment: ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux]
The thing is, I can make ruby return the parameters sent over GET. but when i'm trying to use them as arguements to my methods in if/else, ruby wont return anything and I end up with a blank page.
ph and pm return correctly:
http://127.0.0.1/cgi-bin/test.rb?hostname=node00.abit.dk&macadd=23:14:41:51:63
returns:
node00.abit.dk 23:14:41:51:63
Connection to the database (MySQL) works fine
When I test the method newHostName it outputs correctly:
puts newHostName
returns (which is correct)
node25.abit.dk
the code:
#!/usr/bin/ruby
require 'cgi'
require 'sequel'
require 'socket'
require 'timeout'
DB = Sequel.connect(:adapter=>'mysql', :host=>'localhost', :database=>'nodes', :user=>'nodeuser', :password=>'...')
#cgi-part to work
#takes 2 parameters:
#hostname & macadd
cgi = CGI.new
puts cgi.header
p = cgi.params
ph = p['hostname']
pm = p['macadd']
def nodeLookup(hostnameargv)
hostname = DB[:basenode]
h = hostname[:hostname => hostnameargv]
h1 = h[:hostname]
h2 = h[:macadd]
ary = [h1, h2]
return ary
end
def lastHostName()
#TODO: replace with correct sequel-code and NOT raw SQL
DB.fetch("SELECT hostname FROM basenode ORDER BY id DESC LIMIT 1") do |row|
return row[:hostname]
end
end
def newHostName()
org = lastHostName
#Need this 'hack' to make ruby grep for the number
#nodename e.g 'node01.abit.dk'
var1 = org[4]
var2 = org[5]
var3 = var1 + var2
sum = var3.to_i + 1
#puts sum
sum = "node" + sum.to_s + ".abit.dk"
return sum
end
def insertNewNode(newhost, newmac)
newnode = DB[:basenode]
newnode.insert(:hostname => newhost, :macadd => newmac)
return "#{newnode.count}"
end
#puts ph
#puts pm
#puts newHostName
cgi.out() do
cgi.html do
begin
if ph == "node00.abit.dk"
puts newHostName
else
puts nodeLookup(ph)
end
end
end
end
I feel like im missing something here. Any help is very much appreciated!
//M00kaw

What about modify last lines of your code as followed? CGI HTML generation methods take a block and yield the return value of the block as their content. So you should make newHostName or nodeLookup(ph) as the return value of the block passed to cgi.html(), rather than puts sth, which prints the content to your terminal and return nil. That's why cgi.html() got an empty string (nil.to_s).
#puts newHostName
cgi.out() do
cgi.html do
if ph == "node00.abit.dk"
newHostName
else
nodeLookup(ph)
end
end
end
p.s. It's conventional to indent your ruby code with 2 spaces :-)

Exract path and url from config file via regex

I have a gitmodules file like this:
[submodule "dotfiles/vim/bundle/cucumber"]
path = dotfiles/vim/bundle/cucumber
url = git://github.com/tpope/vim-cucumber.git
[submodule "dotfiles/vim/bundle/Command-T"]
path = dotfiles/vim/bundle/Command-T
url = git://github.com/vim-scripts/Command-T.git
What I want to do is to for each submodule get path and url as a hash or other structure which will keep data like:
submodule: cucumber (path -> 'path', url -> 'url')
How can I do it with regex? Or maybe there is more efficient way of parsing this kind of files?

This file format is something of a standard and so I imagine there is a gem or other code floating around that will parse it. On the other hand, it's easy to parse and encapsulated little text problems like this are "the fun part" of development, so why not reinvent the wheel? It's kind of like playing a game...
require 'pp'
def scangc
result = h = {}
open '../.gitconfig', 'r' do |f|
while s = f.gets
s.strip!
if s[0..0] == '['
result[s[1..-2].to_sym] = h = Hash.new
next
end
raise 'expected =' unless s['=']
a = s.strip.split /\s+=\s+/
h[a[0].to_sym] = a[1]
end
end
pp result
end
scangc

I would do it like this in python:
import re
x = """[submodule "dotfiles/vim/bundle/cucumber"]
path = dotfiles/vim/bundle/cucumber
url = git://github.com/tpope/vim-cucumber.git
[submodule "dotfiles/vim/bundle/Command-T"]
path = dotfiles/vim/bundle/Command-T
url = git://github.com/vim-scripts/Command-T.git"""
submodules = re.findall("\[submodule.*/(.*)\"\]",x)
paths = re.findall("path\s*=\s*(.*)",x)
urls = re.findall("url\s*=\s*(.*)",x)
group = zip(submodules,zip(paths,urls))
submodule_dict = dict([(z[0],{'path':z[1][0],'url':z[1][1]}) for z in group])
Which creates submodule_dict as
{'Command-T': {'path': 'dotfiles/vim/bundle/Command-T',
'url': 'git://github.com/vim-scripts/Command-T.git'},
'cucumber': {'path': 'dotfiles/vim/bundle/cucumber',
'url': 'git://github.com/tpope/vim-cucumber.git'}}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sanitizing URL strings - ruby

The addressable gem can do this. require 'addressable' Addressable::URI.parse("http://example.com/foo/baz/../../hello.html").normalize.to_s #=> "http://example.com/hello.html"

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

Why doesn't my web-crawling method find all the links?

Dynamically check if a field in JSON is nil without using eval

ruby cgi wont return method calls, but will return parameters

Exract path and url from config file via regex

Categories

Resources