Ruby - Reading and editing XML file - ruby

I am writing a Ruby (1.9.3) script that reads XML files from a folder and then edit it if necessary.
My issue is that I was given XML files converted by Tidy but its ouput is a little strange, fo example:
<?xml version="1.0" encoding="utf-8"?>
<XML>
<item>
<ID>000001</ID>
<YEAR>2013</YEAR>
<SUPPLIER>Supplier name test,
Coproration</SUPPLIER>
...
As you can see the has and extra CRLF. I dont know why it has this behaviour but I am addressing it with a ruby script. But am having trouble as I need to see either if the last character of the line is ">" or if the first is "<" so that I can see if there is something wrong with the markup.
I have tried:
Dir.glob("C:/testing/corrected/*.xml").each do |file|
puts file
File.open(file, 'r+').each_with_index do |line, index|
first_char = line[0,1]
if first_char != "<"
//copy this line to the previous line and delete this one?
end
end
end
I also feel like I should be copying the original file content as I read it to another temporary file and then overwrite. Is that the best "way"? Any tips are welcome as I do not have much experience in altering a files content.
Regards

Does that extra \n always appear in the <SUPPLIER> node? As others have suggested, Nokogiri is a great choice for parsing XML (or HTML). You could iterate through each <SUPPLIER> node and remove the \n character, then save the XML as a new file.
require 'nokogiri'
# read and parse the old file
file = File.read("old.xml")
xml = Nokogiri::XML(file)
# replace \n and any additional whitespace with a space
xml.xpath("//SUPPLIER").each do |node|
node.content = node.content.gsub(/\n\s+/, " ")
end
# save the output into a new file
File.open("new.xml", "w") do |f|
f.write xml.to_xml
end

Related

Append new lines to a csv from json.parse

more sysadmin (chef) than ruby guy, so this may be a five minute fix.
I am working on a task where i write a ruby script that pulls json data from multiple files, parses it, and writes the desired fields to a single .csv file. Basically pulling metadata about aws accounts and putting it in an accountant friendly format.
Got a lot of help from another stackoverflow on how to solve the problem for a single file, json.parse help.
My issue is that I am trying to pull the same data from multiple JSON files in an array. I can get it to loop through each file with the code below.
require 'csv'
require "json"
delim_file = CSV.open("delimited_test.csv", "w")
aws_account_list = %w(example example2)
aws_account_list.each do |account|
json_file = File.read(account.to_s + "_aws.json")
parsed_json = JSON.parse(json_file)
delim_file = CSV.open("delimited_test.csv", "w")
# This next line could be a problem if you ran this code multiple times
delim_file << ["EbsOptimized", "PrivateDnsName", "KeyName", "AvailabilityZone", "OwnerId"]
parsed_json['Reservations'].each do |inner_json|
inner_json['Instances'].each do |instance_json|
delim_file << [[instance_json['EbsOptimized'].to_s, instance_json['PrivateDnsName'], instance_json['KeyName'], instance_json['Placement']['AvailabilityZone'], inner_json['OwnerId']],[]]
end
delim_file.close
end
end
However, whenever I do it, it overwrites every time to the same single row in the .csv file. I have tried adding a \n string to the end of the array, converting the array to a string with hashes and doing a \n, but all that does is add a line to the same row that it overwrites.
How would I go about writing that it reads each json file, then appending each files metadata to a new row? This looks like a simple case of writing the right loop, but I can't figure it out.
You declared your file like this:
delim_file = CSV.open("delimited_test.csv", "w")
To fix your issue, all you have to do is change "w" to "a":
delim_file = CSV.open("delimited_test.csv", "a")
See the docs for IO#new for a description of the available file modes. In short, w creates an empty file at the filename, overwriting anyothers, and writes to that. a only creates the file if it doesn't exist, and appends otherwise. Because you have it currently at w, it'll overwrite it each time you run the script. With a, it'll append to what's already there.
You need to open file in append mode, use
delim_file = CSV.open("delimited_test.csv", "a")
'a' Write-only, starts at end of file if file exists, otherwise creates a new file for writing.
'a+' Read-write, starts at end of file if file exists, otherwise creates a new file for reading and writing'

Zip::ZipFile: How to modify contents of inner textfiles without unpacking zip?

Cheers,
as a beginner to ruby, I am currently in the process of solving my smaller-world problems with ruby, to get accustomed to it. Right now I am trying to modify the contents of a text file within a zip container.
the Structure is
ZIP
>> diretory/
>> mytext.text
And I am able to iterate over the contents
Zip::ZipFile.open(file_path) do |zipfile|
files = zipfile.select(&:file?)
files.each do |zip_entry|
## ....?
end
end
...but I find it very difficult to modify the text file without unpacking it.
Any help appreciated!
So with the help of Ben, here's one solution:
require "rubygems"
require "zip/zip"
zip_file_name = "src/test.zip"
Zip::ZipFile.open(zip_file_name) do |zipfile|
files = zipfile.select(&:file?)
files.each do |zip_entry|
contents = zipfile.read(zip_entry.name)
zipfile.get_output_stream(zip_entry.name){ |f| f.puts contents + ' added some text' }
end
zipfile.commit
end
I though I had tried this before - anyways. Thanks a lot!
This snip bit adds " added some text" to the end of myFile.txt.
Zip::File.open(file_path) do |zipfile|
contents = zipfile.read('myFile.txt')
zipfile.get_output_stream('myFile.txt') { |f| f.puts contents + ' added some text' }
end
For some reason, the modifications to the zip file aren't saved if the writing (the call to get_output_stream) is done while using each to iterate over the archive's files.
Edit: To modify files while iterating over them via each, open the archive with Zip::ZipFile.open (see Chris's answer for an example).
Hopefully, this snip bit will help point you in the right direction.

Invalid characters before my XML in Ruby

When I look in an XML file, it looks fine, and starts with <?xml version="1.0" encoding="utf-16le" standalone="yes"?>
But when I read it in Ruby and print it to stout, there are two ?s in front of that: ??<?xml version="1.0" encoding="utf-16le" standalone="yes"?>
Where do these come from, and how do I remove them? Parsing it like this with REXML fails immediately. Removing the first to characters and then parsing it, gives me this error:
REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16le" s>
What is the right way to handle this?
Edit: Below is my code. The ftp.get downloads the xml from an ftp server. (I wonder if that might be relevant.)
xml = ftp.get
puts xml
until xml[0,1] == "<" # to remove the 2 invalid characters
puts xml[0,2]
xml.slice! 0
end
puts xml
document = REXML::Document.new(xml)
The last puts prints the correct xml. But because of the two invalid characters, I've got the feeling something else went wrong. It shouldn't be necessary to remove anything. I'm at a loss what the problem might be, though.
Edit 2: I'm using Net::FTP to download the XML, but with this new method that lets me read the contents into a string instead of a file:
class Net::FTP
def gettextcontent(remotefile, &block) # :yield: line
f = StringIO.new()
begin
retrlines("RETR " + remotefile) do |line|
f.puts(line)
yield(line) if block
end
ensure
f.close
return f
end
end
end
Edit 3: It seems to be caused by StringIO (in Ruby 1.8.7) not supporting unicode. I'm not sure if there's a workaround for that.
Those 2 characters are most likely a unicode bom: bytes that tell whoever is reading the file what the byte order is.
As long as you know what the encoding of the file is, it should be safe to strip them - they aren't actual content
To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.
The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.
So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.
Together, that solves all my problems.

Remove strings begining with 'AUTO_INCREMENT=' in 2 files

I am trying to create a ruby script that loads 2 .sql files and removes all strings that begin with 'AUTO_INCREMENT='
There are multiple occurrences of this in my .sql files and all I want is them to be removed from both files.
Thanks for any help or input as I am new to ruby and decided to give it a try.
Given the right regexp (the one below might not be the most correct given the syntax), and the answer given there to a similar question, it is rather straightforward to put a script together:
file_names = ['file1.sql', 'file2.sql']
file_names.each do |file_name|
text = File.read(file_name)
File.open(file_name, 'wb') do
|file|
file.write(text.gsub(/\s*AUTO_INCREMENT\s*(\=\s*[0-9]+)?/, ""))
end
end
Have you tried using Regex for this? If you want to remove the whole line, you could simply match ^AUTO_INCREMENT=.+$ and replace it with an empty string. That pattern should match an entire line beginning with AUTO_INCREMENT.
Here's a good site to learn Regex if you aren't familiar with it:
Hope that works for you.
You should read up on IO, String, Array for more details on methods you can use.
Here's how you might read, modify, and save the contents of one file:
# Opens a file for reading.
file = File.open("file1.txt")
# Reads all the contents into the string 'contents'.
contents = file.read
file.close
# Splits contents into an array of strings, one for each line.
lines = contents.split("\n")
# Delete any lines that start with AUTO_INCREMENT=
lines.reject! { |line| line =~ /^AUTO_INCREMENT=/ }
# Join the lines together into one string again.
new_contents = lines.join("\n")
# Open file for writing.
file = File.open("file1.txt", "w")
# Save new contents.
file.write(new_contents)
file.close

My file is getting shorter and I don't know why

I have a requirement where I need to edit part of xml file and save it, but in my code some part of the xml file it not saving.I want to modify <mtn:ttl>4</mtn:ttl> to <mtn:ttl>9</mtn:ttl>, this part is getting modified in the below code but while writting/saving only part of file is getting chaged or the format of the file is getting chaged, can any one tell me how to solve this? original xml file size is 79kb but after editing and saving its becoming 78kb...
require "rexml/text"
require "rexml/document"
include REXML
File.open("c://conf//cad-mtn-config.xml") do |config_file|
# Open the document and edit the file
config = Document.new(config_file)
if testField.to_s.match(/<mtn:ttl>/)
config.root.elements[4].elements[11].elements[1].elements[1].elements[1].elements[8].text="9"
# Write the result to a new file.
formatter = REXML::Formatters::Default.new
File.open("c://mtn-3//mtn-2.2//conf//cad-mtn-config.xml", 'w') do |result|
formatter.write(config, result)
end
end
end
It looks like your trying to use regular expressions, why not just use rexml? The only requirement is that you need to know where the namespace is located online. Note if it were not mtn:ttl and just ttl you would not need the namespace.
require 'rexml/document'
file_path="path to file"
contents=File.new(file_path).read
xml_doc=REXML::Document.new(contents)
xml_doc.add_namespace('mtn',"http://url to mtn namespace")
xml_doc.root.elements.each('mtn:ttl') do |element|
element.text="9"
end
File.open(file_path,"w") do |data|
data<<xml_doc
end

Resources