Ruby regular expression to extract key values - ruby

I have string like below
case1:
str = "type=\"text/xsl\" href=\"http://skdjf.sdjhshf/CDA0000=.xsl\""
case2:
str = "href=\"http://skdjf.sdjhshf/CDA0000=.xsl\" type=\"text/xsl\""
I need to extract the values like
type -> text/xsl
href -> http://skdjf.sdjhshf/CDA0000=.xsl
Here is my regular expression that fails.
str.match(/type="(.*)"/)[1]
#this works in second case
=>"text/xsl"
str.match(/http="(.*)"/)[1]
#this works in first case
=>"http://skdjf.sdjhshf/CDA0000=.xsl"
In failure cases the whole string is matched.
Any idea?

Agree with John Watts comment. Use something like nokogiri to parse XML - it is a breeze. If you still want to stick with regex parsing you could do something like:
str.split(' ').map{ |part| part.match( /(.+)="(.+)"/ )[1..2] }
and you will get results as below:
> str = "type=\"text/xsl\" href=\"http://skdjf.sdjhshf/CDA0000=.xsl\""
=> "type=\"text/xsl\" href=\"http://skdjf.sdjhshf/CDA0000=.xsl\""
> str2 = "href=\"http://skdjf.sdjhshf/CDA0000=.xsl\" type=\"text/xsl\""
=> "href=\"http://skdjf.sdjhshf/CDA0000=.xsl\" type=\"text/xsl\""
> str.split(' ').map{ |part| part.match( /(.+)="(.+)"/ )[1..2] }
=> [["type", "text/xsl"], ["href", "http://skdjf.sdjhshf/CDA0000=.xsl"]]
> str2.split(' ').map{ |part| part.match( /(.+)="(.+)"/ )[1..2] }
=> [["href", "http://skdjf.sdjhshf/CDA0000=.xsl"], ["type", "text/xsl"]]
that you can put in a hash or wherever wou want to have it.
With nokogiri you can get hold of a node and then do something like node['href'] in your case. Probably much easier.

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Extract url params in ruby

I would like to extract parameters from url. I have following path pattern:
pattern = "/foo/:foo_id/bar/:bar_id"
And example url:
url = "/foo/1/bar/2"
I would like to get {foo_id: 1, bar_id: 2}. I tried to convert pattern into something like this:
"\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)"
I failed on first step when I wanted to replace backslash in url:
formatted = pattern.gsub("/", "\/")
Do you know how to fix this gsub? Maybe you know better solution to do this.
EDIT:
It is plain Ruby. I am not using RoR.
As I said above, you only need to escape slashes in a Regexp literal, e.g. /foo\/bar/. When defining a Regexp from a string it's not necessary: Regexp.new("foo/bar") produces the same Regexp as /foo\/bar/.
As to your larger problem, here's how I'd solve it, which I'm guessing is pretty much how you'd been planning to solve it:
PATTERN_PART_MATCH = /:(\w+)/
PATTERN_PART_REPLACE = '(?<\1>.+?)'
def pattern_to_regexp(pattern)
expr = Regexp.escape(pattern) # just in case
.gsub(PATTERN_PART_MATCH, PATTERN_PART_REPLACE)
Regexp.new(expr)
end
pattern = "/foo/:foo_id/bar/:bar_id"
expr = pattern_to_regexp(pattern)
# => /\/foo\/(?<foo_id>.+?)\/bar\/(?<bar_id>.+?)/
str = "/foo/1/bar/2"
expr.match(str)
# => #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
Try this:
regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
matches = "/foo/1/bar/2".match(regex)
Hash[matches.names.zip(matches[1..-1])]
IRB output:
2.3.1 :032 > regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
=> /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
2.3.1 :033 > matches = "/foo/1/bar/2".match(regex)
=> #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
2.3.1 :034 > Hash[matches.names.zip(matches[1..-1])]
=> {"foo_id"=>"1", "bar_id"=>"2"}
I'd advise reading this article on how Rack parses query params. The above works for your example you gave, but is not extensible for other params.
http://codefol.io/posts/How-Does-Rack-Parse-Query-Params-With-parse-nested-query
This might help you, the foo id and bar id will be dynamic.
require 'json'
#url to scan
url = "/foo/1/bar/2"
#scanning ids from url
id = url.scan(/\d/)
#gsub method to replacing values from url
url_with_id = url.gsub(url, "{foo_id: #{id[0]}, bar_id: #{id[1]}}")
#output
=> "{foo_id: 1, bar_id: 2}"
If you want to change string to hash
url_hash = eval(url_with_id)
=>{:foo_id=>1, :bar_id=>2}

Extracting a part of a string in Ruby

I know this is an easy question, but I want to extract one part of a string with rails.
I would do this like Java, by knowing the beginning and end character of the string and extract it, but I want to do this by ruby way, that's why I need your help.
My string is:
STACK OVER AND FLOW
And I want the numerical values between quotation marks => 99999 and the value of the link => STACK OVER AND FLOW
How should I parse this string in ruby ?
Thanks.
If you need to parse html:
> require 'nokogiri'
> str = %q[STACK OVER AND FLOW]
> doc = Nokogiri.parse(str)
> link = doc.at('a')
> link.text
=> "STACK OVER AND FLOW"
> link['href'][/(\d+)/, 1]
=> "99999"
http://nokogiri.org/
This should work if you have only one link in string
str = %{STACK OVER AND FLOW }
num = str.match(/href=".*?'(\d*)'.*?/)[1].to_i
name = str.match(/>(.*?)</)[1].strip
Way to get both at a time:
str = "STACK OVER AND FLOW "
num, name = str.scan(/launchRemote\('(\d+)'[^>]+>\s*(.*?)\s*</).first
# => ["99999", "STACK OVER AND FLOW"]

How to replace every occurrence of a pattern in a string using Ruby?

I have an XML file which is too big. To make it smaller, I want to replace all tags and attribute names with shorter versions of the same thing.
So, I implemented this:
string.gsub!(/<(\w+) /) do |match|
case match
when 'Image' then 'Img'
when 'Text' then 'Txt'
end
end
puts string
which deletes all opening tags but does not do much else.
What am I doing wrong here?
Here's another way:
class String
def minimize_tags!
{"image" => "img", "text" => "txt"}.each do |from,to|
gsub!(/<#{from}\b/i,"<#{to}")
gsub!(/<\/#{from}>/i,"<\/#{to}>")
end
self
end
end
This will probably be a little easier to maintain, since the replacement patterns are all in one place. And on strings of any significant size, it may be a lot faster than Kevin's way. I did a quick speed test of these two methods using the HTML source of this stackoverflow page itself as the test string, and my way was about 6x faster...
Here's the beauty of using a parser such as Nokogiri:
This lets you manipulate selected tags (nodes) and their attributes:
require 'nokogiri'
xml = <<EOT
<xml>
<Image ImagePath="path/to/image">image comment</Image>
<Text TextFont="courier" TextSize="9">this is the text</Text>
</xml>
EOT
doc = Nokogiri::XML(xml)
doc.search('Image').each do |n|
n.name = 'img'
n.attributes['ImagePath'].name = 'path'
end
doc.search('Text').each do |n|
n.name = 'txt'
n.attributes['TextFont'].name = 'font'
n.attributes['TextSize'].name = 'size'
end
print doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <img path="path/to/image">image comment</img>
# >> <txt font="courier" size="9">this is the text</txt>
# >> </xml>
If you need to iterate through every node, maybe to do a universal transformation on the tag-name, you can use doc.search('*').each. That would be slower than searching for individual tags, but might result in less code if you need to change every tag.
The nice thing about using a parser is it'll work even if the layout of the XML changes since it doesn't care about whitespace, and will work even if attribute order changes, making your code more robust.
Try this:
string.gsub!(/(<\/?)(\w+)/) do |match|
tag_mark = $1
case $2
when /^image$/i
"#{tag_mark}Img"
when /^text$/i
"#{tag_mark}Txt"
else
match
end
end

Getting portion of href attribute using hpricot

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.
You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1
What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end

Resources