Related
I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }
I was wondering if anyone can help me understanding the Ruby code below? I'm pretty new to Ruby programming and having trouble understanding the meaning of each functions.
When I run this with my twitter username and password as parameter, I get a stream of twitter feed samples. What do I need to do with this code to only display the hashtags?
I'm trying to gather the hashtags every 30 seconds, then sort from least to most occurrences of the hashtags.
Not looking for solutions, but for ideas. Thanks!
require 'eventmachine'
require 'em-http'
require 'json'
usage = "#{$0} <user> <password>"
abort usage unless user = ARGV.shift
abort usage unless password = ARGV.shift
url = 'https://stream.twitter.com/1/statuses/sample.json'
def handle_tweet(tweet)
return unless tweet['text']
puts "#{tweet['user']['screen_name']}: #{tweet['text']}"
end
EventMachine.run do
http = EventMachine::HttpRequest.new(url).get :head => { 'Authorization' => [ user, password ] }
buffer = ""
http.stream do |chunk|
buffer += chunk
while line = buffer.slice!(/.+\r?\n/)
handle_tweet JSON.parse(line)
end
end
end
puts "#{tweet['user']['screen_name']}: #{tweet['text']}"
That line shows you a user name followed by the content of the tweet.
Let's take a step back for a sec.
Hash tags appear inside the tweet's content--this means they're inside tweet['text']. A hash tag always takes the form of a # followed by a bunch of non-space characters. That's really easy to grab with a regex. Ruby's core API facilitates that via String#scan. Example:
"twitter is short #foo yawn #bar".scan(/\#\w+/) # => ["#foo", "#bar"]
What you want is something like this:
def handle_tweet(tweet)
return unless tweet['text']
# puts "#{tweet['user']['screen_name']}: #{tweet['text']}" # OLD
puts tweet['text'].scan(/\#\w+/).to_s
end
tweet['text'].scan(/#\w+/) is an array of strings. You can do whatever you want with that array. Supposing you're new to Ruby and want to print the hash tags to the console, here's a brief note about printing arrays with puts:
puts array # => "#foo\n#bar"
puts array.to_s # => '["#foo", "#bar"]'
#Load Libraries
require 'eventmachine'
require 'em-http'
require 'json'
# Looks like this section assumes you're calling this from commandline.
usage = "#{$0} <user> <password>" # $0 returns the name of the program
abort usage unless user = ARGV.shift # Return first argument passed when program called
abort usage unless password = ARGV.shift
# The URL
url = 'https://stream.twitter.com/1/statuses/sample.json'
# method which, when called later, prints out the tweets
def handle_tweet(tweet)
return unless tweet['text'] # Ensures tweet object has 'text' property
puts "#{tweet['user']['screen_name']}: #{tweet['text']}" # write the result
end
# Create an HTTP request obj to URL above with user authorization
EventMachine.run do
http = EventMachine::HttpRequest.new(url).get :head => { 'Authorization' => [ user, password ] }
# Initiate an empty string for the buffer
buffer = ""
# Read the stream by line
http.stream do |chunk|
buffer += chunk
while line = buffer.slice!(/.+\r?\n/) # cut each line at newline
handle_tweet JSON.parse(line) # send each tweet object to handle_tweet method
end
end
end
Here's a commented version of what the source is doing. If you just want the hashtag, you'll want to rewrite handle_tweet to something like this:
handle_tweet(tweet)
tweet.scan(/#\w/) do |tag|
puts tag
end
end
I'm trying to find a robust method of joining partial url path segments together. Is there a quick way to do this?
I tried the following:
puts URI::join('resource/', '/edit', '12?option=test')
I expect:
resource/edit/12?option=test
But I get the error:
`merge': both URI are relative (URI::BadURIError)
I have used File.join() in the past for this but something does not seem right about using the file library for urls.
URI's api is not neccearily great.
URI::join will work only if the first one starts out as an absolute uri with protocol, and the later ones are relative in the right ways... except I try to do that and can't even get that to work.
This at least doesn't error, but why is it skipping the middle component?
URI::join('http://somewhere.com/resource', './edit', '12?option=test')
I think maybe URI just kind of sucks. It lacks significant api on instances, such as an instance #join or method to evaluate relative to a base uri, that you'd expect. It's just kinda crappy.
I think you're going to have to write it yourself. Or just use File.join and other File path methods, after testing all the edge cases you can think of to make sure it does what you want/expect.
edit 9 Dec 2016 I figured out the addressable gem does it very nicely.
base = Addressable::URI.parse("http://example.com")
base + "foo.html"
# => #<Addressable::URI:0x3ff9964aabe4 URI:http://example.com/foo.html>
base = Addressable::URI.parse("http://example.com/path/to/file.html")
base + "relative_file.xml"
# => #<Addressable::URI:0x3ff99648bc80 URI:http://example.com/path/to/relative_file.xml>
base = Addressable::URI.parse("https://example.com/path")
base + "//newhost/somewhere.jpg"
# => #<Addressable::URI:0x3ff9960c9ebc URI:https://newhost/somewhere.jpg>
base = Addressable::URI.parse("http://example.com/path/subpath/file.html")
base + "../up-one-level.html"
=> #<Addressable::URI:0x3fe13ec5e928 URI:http://example.com/path/up-one-level.html>
Have uri as URI::Generic or subclass of thereof
uri.path += '/123'
Enjoy!
06/25/2016 UPDATE for skeptical folk
require 'uri'
uri = URI('http://ioffe.net/boris')
uri.path += '/123'
p uri
Outputs
<URI::HTTP:0x2341a58 URL:http://ioffe.net/boris/123>
Run me
The problem is that resource/ is relative to the current directory, but /edit refers to the top level directory due to the leading slash. It's impossible to join the two directories without already knowing for certain that edit contains resource.
If you're looking for purely string operations, simply remove the leading or trailing slashes from all parts, then join them with / as the glue.
The way to do it using URI.join is:
URI.join('http://example.com', '/foo/', 'bar')
Pay attention to the trailing slashes. You can find the complete documentation here:
http://www.ruby-doc.org/stdlib-1.9.3/libdoc/uri/rdoc/URI.html#method-c-join
As you noticed, URI::join won't combine paths with repeated slashes, so it doesn't fit the part.
Turns out it doesn't require a lot of Ruby code to achieve this:
module GluePath
def self.join(*paths, separator: '/')
paths = paths.compact.reject(&:empty?)
last = paths.length - 1
paths.each_with_index.map { |path, index|
_expand(path, index, last, separator)
}.join
end
def self._expand(path, current, last, separator)
if path.start_with?(separator) && current != 0
path = path[1..-1]
end
unless path.end_with?(separator) || current == last
path = [path, separator]
end
path
end
end
The algorithm takes care of consecutive slashes, preserves start and end slashes, and ignores nil and empty strings.
puts GluePath::join('resource/', '/edit', '12?option=test')
outputs
resource/edit/12?option=test
Use this code:
File.join('resource/', '/edit', '12?option=test').
gsub(File::SEPARATOR, '/').
sub(/^\//, '')
# => resource/edit/12?option=test
example with empty strings:
File.join('', '/edit', '12?option=test').
gsub(File::SEPARATOR, '/').
sub(/^\//, '')
# => edit/12?option=test
Or use this if possible to use segments like resource/, edit/, 12?option=test and where http: is only a placeholder to get a valid URI. This works for me.
URI.
join('http:', 'resource/', 'edit/', '12?option=test').
path.
sub(/^\//, '')
# => "resource/edit/12"
A not optimized solution. Note that it doesn't take query params into account. It only handles paths.
class URL
def self.join(*str)
str.map { |path|
new_path = path
# Check the first character
if path[0] == "/"
new_path = new_path[1..-1]
end
# Check the last character
if path[-1] != "/"
new_path += "/"
end
new_path
}.join
end
end
This question is nearly a decade old, yet it seems that there is no perfect solution posted.
A handful of posted answers fail to handle multiple //, e.g. stuff like path = path[1..-1] if path.start_with?('/')
Answers that simply call File.join(*paths) seem to be the accepted "Ruby way," yet they fail in cases where you pass a URI object, e.g. File.join(URI.join('some/path')) fails with TypeError: no implicit conversion of URI::Generic into String.
Below is what I ended up using:
module UrlHelper
def self.join(*paths)
# yes, Ruby's stdlib really does lack a functional join method for URLs
File.join(*paths.map(&:to_s))
end
end
You can use File.join('resource/', '/edit', '12?option=test')
I improved #Maximo Mussini's script to make it works gracefully:
SmartURI.join('http://example.com/subpath', 'hello', query: { token: secret })
=> "http://example.com/subpath/hello?token=secret"
https://gist.github.com/zernel/0f10c71f5a9e044653c1a65c6c5ad697
require 'uri'
module SmartURI
SEPARATOR = '/'
def self.join(*paths, query: nil)
paths = paths.compact.reject(&:empty?)
last = paths.length - 1
url = paths.each_with_index.map { |path, index|
_expand(path, index, last)
}.join
if query.nil?
return url
elsif query.is_a? Hash
return url + "?#{URI.encode_www_form(query.to_a)}"
else
raise "Unexpected input type for query: #{query}, it should be a hash."
end
end
def self._expand(path, current, last)
if path.starts_with?(SEPARATOR) && current != 0
path = path[1..-1]
end
unless path.ends_with?(SEPARATOR) || current == last
path = [path, SEPARATOR]
end
path
end
end
You can use this:
URI.join('http://exemple.com', '/a/', 'b/', 'c/', 'd')
=> #<URI::HTTP http://exemple.com/a/b/c/d>
URI.join('http://exemple.com', '/a/', 'b/', 'c/', 'd').to_s
=> "http://exemple.com/a/b/c/d"
See: http://ruby-doc.org/stdlib-2.4.1/libdoc/uri/rdoc/URI.html#method-c-join-label-Synopsis
My understanding of URI::join is that it thinks like a web browser does.
To evaluate it, point your mental web browser to the first parameter, and keep clicking links until you browse to the last parameter.
For example, URI::join('http://example.com/resource/', '/edit', '12?option=test'), you would browse like this:
http://example.com/resource/, click a link to /edit (a file at the root of the site)
http://example.com/edit, click a link to 12?option=test (a file in the same directory as edit)
http://example.com/12?option=test
If the first link were /edit/ (with a trailing slash), or /edit/foo, then the next link would be relative to /edit/ rather than /.
This page possibly explains it better than I can: Why is URI.join so counterintuitive?
This is my simple take on this problem, just splitting up all the path segments and join them together again. This only works if you're only working with relative path segments, but if that's all you want to do this is handy.
def join_paths *paths
paths.map{|p| p.split('/')}
.flatten
.reject(&:empty?)
.compact
.join('/')
end
Then you can use it like so:
join_paths 'foo/', '/bar', 'a/b/c', 'd' #=> "foo/bar/a/b/c/d"
I'm currently using Mongrel to develop a custom web application project.
I would like Mongrel to use a defined Http Handler based on a regular expression. For example, everytime someone calls a url like http://test/bla1.js or http://test/bla2.js the same Http handler is called to manage the request.
My code so far looks a like that:
http_server = Mongrel::Configurator.new :host => config.get("http_host") do
listener :port => config.get("http_port") do
uri Regexp.escape("/[a-z0-9]+.js"), :handler => BLAH::CustomHandler.new
uri '/ui/public', :handler => Mongrel::DirHandler.new("#{$d}/public/")
uri '/favicon', :handler => Mongrel::Error404Handler.new('')
trap("INT") { stop }
run
end
end
As you can see, I am trying to use a regex instead of a string here:
uri Regexp.escape("/[a-z0-9]+.js"), :handler => BLAH::CustomHandler.new
but that does not work. Any solution?
Thanks for that.
You should consider creating a Rack application instead. Rack is:
the standard for Ruby web applications
used internally by all popular Ruby web frameworks (Rails, Merb, Sinatra, Camping, Ramaze, ...)
much easier to extend
ready to be run on any application server (Mongrel, Webrick, Thin, Passenger, ...)
Rack has a URL mapping DSL, Rack::Builder, which allows you to map different Rack applications to particular URL prefixes. You typically save it as config.ru, and run it with rackup.
Unfortunately, it does not allow regular expressions either. But because of the simplicity of Rack, it is really easy to write an "application" (a lambda, actually) that will call the proper app if the URL matches a certain regex.
Based on your example, your config.ru may look something like this:
require "my_custom_rack_app" # Whatever provides your MyCustomRackApp.
js_handler = MyCustomRackApp.new
default_handlers = Rack::Builder.new do
map "/public" do
run Rack::Directory.new("my_dir/public")
end
# Uncomment this to replace Rack::Builder's 404 handler with your own:
# map "/" do
# run lambda { |env|
# [404, {"Content-Type" => "text/plain"}, ["My 404 response"]]
# }
# end
end
run lambda { |env|
if env["PATH_INFO"] =~ %r{/[a-z0-9]+\.js}
js_handler.call(env)
else
default_handlers.call(env)
end
}
Next, run your Rack app on the command line:
% rackup
If you have mongrel installed, it will be started on port 9292. Done!
You have to inject new code into part of Mongrel's URIClassifier, which is otherwise blissfully unaware of regular expression URIs.
Below is one way of doing just that:
#
# Must do the following BEFORE Mongrel::Configurator.new
# Augment some of the key methods in Mongrel::URIClassifier
# See lib/ruby/gems/XXX/gems/mongrel-1.1.5/lib/mongrel/uri_classifier.rb
#
Mongrel::URIClassifier.class_eval <<-EOS, __FILE__, __LINE__
# Save original methods
alias_method :register_without_regexp, :register
alias_method :unregister_without_regexp, :unregister
alias_method :resolve_without_regexp, :resolve
def register(uri, handler)
if uri.is_a?(Regexp)
unless (#regexp_handlers ||= []).any? { |(re,h)| re==uri ? h.concat(handler) : false }
#regexp_handlers << [ uri, handler ]
end
else
# Original behaviour
register_without_regexp(uri, handler)
end
end
def unregister(uri)
if uri.is_a?(Regexp)
raise Mongrel::URIClassifier::RegistrationError, "\#{uri.inspect} was not registered" unless (#regexp_handlers ||= []).reject! { |(re,h)| re==uri }
else
# Original behaviour
unregister_without_regexp(uri)
end
end
def resolve(request_uri)
# Try original behaviour FIRST
result = resolve_without_regexp(request_uri)
# If a match is not found with non-regexp URIs, try regexp
if result[0].blank?
(#regexp_handlers ||= []).any? { |(re,h)| (m = re.match(request_uri)) ? (result = [ m.pre_match + m.to_s, (m.to_s == Mongrel::Const::SLASH ? request_uri : m.post_match), h ]) : false }
end
result
end
EOS
http_server = Mongrel::Configurator.new :host => config.get("http_host") do
listener :port => config.get("http_port") do
# Can pass a regular expression as URI
# (URI must be of type Regexp, no escaping please!)
# Regular expression can match any part of an URL, start with "^/..." to
# anchor match at URI beginning.
# The way this is implemented, regexp matches are only evaluated AFTER
# all non-regexp matches have failed (mostly for performance reasons.)
# Also, for regexp URIs, the :in_front is ignored; adding multiple handlers
# to the same URI regexp behaves as if :in_front => false
uri /^[a-z0-9]+.js/, :handler => BLAH::CustomHandler.new
uri '/ui/public', :handler => Mongrel::DirHandler.new("#{$d}/public/")
uri '/favicon', :handler => Mongrel::Error404Handler.new('')
trap("INT") { stop }
run
end
end
Seems to work just fine with Mongrel 1.1.5.
This is a newbie question as I am attempting to learn Ruby by myself, so apologies if it sounds like a silly question!
I am reading through the examples of why's (poignant) guide to ruby and am in chapter 4. I typed the code_words Hash into a file called wordlist.rb
I opened another file and typed the first line as require 'wordlist.rb' and the rest of the code as below
#Get evil idea and swap in code
print "Enter your ideas "
idea = gets
code_words.each do |real, code|
idea.gsub!(real, code)
end
#Save the gibberish to a new file
print "File encoded, please enter a name to save the file"
ideas_name = gets.strip
File::open( 'idea-' + ideas_name + '.txt', 'w' ) do |f|
f << idea
end
When I execute this code, it fails with the following error message:
C:/MyCode/MyRubyCode/filecoder.rb:5: undefined local variable or method `code_words' for main:Object (NameError)
I use Windows XP and Ruby version ruby 1.8.6
I know I should be setting something like a ClassPath, but not sure where/how to do so!
Many thanks in advance!
While the top-level of all files are executed in the same context, each file has its own script context for local variables. In other words, each file has its own set of local variables that can be accessed throughout that file, but not in other files.
On the other hand, constants (CodeWords), globals ($code_words) and methods (def code_words) would be accessible across files.
Some solutions:
CodeWords = {:real => "code"}
$code_words = {:real => "code"}
def code_words
{:real => "code"}
end
An OO solution that is definitely too complex for this case:
# first file
class CodeWords
DEFAULT = {:real => "code"}
attr_reader :words
def initialize(words = nil)
#words = words || DEFAULT
end
end
# second file
print "Enter your ideas "
idea = gets
code_words = CodeWords.new
code_words.words.each do |real, code|
idea.gsub!(real, code)
end
#Save the gibberish to a new file
print "File encoded, please enter a name to save the file"
ideas_name = gets.strip
File::open( 'idea-' + ideas_name + '.txt', 'w' ) do |f|
f << idea
end
I think the problem might be that the require executes the code in another context, so the runtime variable is no longer available after the require.
What you could try is making it a constant:
CodeWords = { :real => 'code' }
That will be available everywhere.
Here is some background on variable scopes etc.
I was just looking at the same example and was having the same problem.
What I did was change the variable name in both files from code_words to $code_words .
This would make it a global variable and thus accesible by both files right?
My question is: wouldn't this be a simpler solution than making it a constant and having to write CodeWords = { :real => 'code' } or is there a reason not to do it ?
A simpler way would be to use the Marshal.dump feature to save the code words.
# Save to File
code_words = {
'starmonkeys' => 'Phil and Pete, those prickly chancellors of the New Reich',
'catapult' => 'chucky go-go', 'firebomb' => 'Heat-Assisted Living',
'Nigeria' => "Ny and Jerry's Dry Cleaning (with Donuts)",
'Put the kabosh on' => 'Put the cable box on'
}
# Serialize
f = File.open('codewords','w')
Marshal.dump(code_words, f)
f.close
Now at the beginning of your file you would put this:
# Load the Serialized Data
code_words = Marshal.load(File.open('codewords','r'))
Here's the easy way to make sure you can always include a file that's in the same directory as your app, put this before the require statement
$:.unshift File.dirname(__FILE__)
$: is the global variable representing the "CLASSPATH"