Invalid characters before my XML in Ruby - ruby

When I look in an XML file, it looks fine, and starts with <?xml version="1.0" encoding="utf-16le" standalone="yes"?>
But when I read it in Ruby and print it to stout, there are two ?s in front of that: ??<?xml version="1.0" encoding="utf-16le" standalone="yes"?>
Where do these come from, and how do I remove them? Parsing it like this with REXML fails immediately. Removing the first to characters and then parsing it, gives me this error:
REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16le" s>
What is the right way to handle this?
Edit: Below is my code. The ftp.get downloads the xml from an ftp server. (I wonder if that might be relevant.)
xml = ftp.get
puts xml
until xml[0,1] == "<" # to remove the 2 invalid characters
puts xml[0,2]
xml.slice! 0
end
puts xml
document = REXML::Document.new(xml)
The last puts prints the correct xml. But because of the two invalid characters, I've got the feeling something else went wrong. It shouldn't be necessary to remove anything. I'm at a loss what the problem might be, though.
Edit 2: I'm using Net::FTP to download the XML, but with this new method that lets me read the contents into a string instead of a file:
class Net::FTP
def gettextcontent(remotefile, &block) # :yield: line
f = StringIO.new()
begin
retrlines("RETR " + remotefile) do |line|
f.puts(line)
yield(line) if block
end
ensure
f.close
return f
end
end
end
Edit 3: It seems to be caused by StringIO (in Ruby 1.8.7) not supporting unicode. I'm not sure if there's a workaround for that.

Those 2 characters are most likely a unicode bom: bytes that tell whoever is reading the file what the byte order is.
As long as you know what the encoding of the file is, it should be safe to strip them - they aren't actual content

To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.
The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.
So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.
Together, that solves all my problems.

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

Stripping byte order mark from zip stream in ruby

I know that File supposedly takes encoding: 'bom|utf-8' but afaict there's no equivalent for streams. My server is getting a zip file containing one csv that has the bom. It seems silly to save the csv as a file vs just using CSV.new(Zip::InputStream::open(zip_file).get_next_entry.get_input_stream), but afaict none of those can detect and strip the byte order mark (bom) and CSV fails trying to parse the header if the bom is there.
I see that CSV.new takes encoding as an option, but, in 2.3.0 at least, it doesn't recognize bom (ArgumentError: unknown encoding name - bom)
Looks like handling the BOM is implemented in IO - maybe you can wrap your zip stream around an IO object?
https://ruby-doc.org/core-2.3.1/IO.html#method-c-new-label-Open+Mode
Since you can rewind streams, the answer is to get the first chars, see if they're the bom, if they are, consume them; otherwise, rewind the stream.
BYTE_ORDER_MARKS_LENGTHS =
{"\xEF".bytes.first => 2, "\xFE".bytes.first => 1, "\xFF".bytes.first => 1}
# checks if input_stream starts with a byte order mark and if so skips over it
def skip_bom(input_stream)
entry = BYTE_ORDER_MARKS_LENGTHS[input_stream.read(1).bytes.first]
if entry
input_stream.read(entry)
else
input_stream.rewind
end
end
My situation was similar, but I also needed to remove extra double-quotes:
Zip::File.open(zipfolder) do |zipfile|
zipfile.each do |zip_entry|
zip_entry.get_input_stream.each_line do |line|
line_without_bom_or_quotes = line.force_encoding('UTF-8').gsub('"', '')
row = CSV.parse_line(line_without_bom_or_quotes)
puts "DETAIL: #{row.inspect}"
end
end
end

Ruby writing zip file works on Mac but not on windows / How to recieve zip file in Net::HTTP

actually i'm writing a ruby script which accesses an API based on HTTP-POST calls.
The API returns a zip file containing textdocuments when i call it with specific POST-Parameters. At the moment i'm doing that with the Net::HTTP Package.
Now my problem:
It seems to return the zip-file as a string as far as i know. I can see "PK" (which i suppose is part of the PK-Header of zip-files) and the text from the documents.
And the Content-Type Header is telling me "application/x-zip-compressed; name="somename.zip"".
When i save the zip file like so:
result = comodo.get_cert("<somenumber>")
puts result['Content-Type']
puts result.inspect
puts result.body
File.open("test.zip", "w") do |file|
file.write result.body
end
I can unzip it on my macbook without further problems. But when i run the same code on my Win10 PC it tells me that the file is corrupt or not a ZIP-file.
Has it something to do with the encoding? Can i change it, so it's working on both?
Or is it a complete wrong approach on how to recieve a zip-file from a POST-request?
PS:
My ruby-version on Mac:
ruby 2.2.3p173
My ruby-version on Windows:
ruby 2.2.4p230
Many thanks in advance!
The problem is due to the way Windows handles line endings (\r\n for Windows, whereas OS X and other Unix based operating systems use just \n). When using File.open, using the mode of just w makes the file subject to line ending changes, so any occurrences of byte 0x0A (or \n) are converted into bytes 0x0D 0x0A (or \r\n), which effectively breaks the zip.
When opening the file for write, use the mode wb instead, as this will suppress any line ending changes.
http://ruby-doc.org/core-2.2.0/IO.html#method-c-new-label-IO+Open+Mode
Many thanks! Just as you posted the solution i found it out myself..
So much trouble because of one missing 'b' :/
Thank you very much!
The solution (see Ben Y's answer):
result = comodo.get_cert("<somenumber>")
puts result['Content-Type']
puts result.inspect
puts result.body
File.open("test.zip", "wb") do |file|
file.write result.body
end

convert xml to utf-8 encoding

I have an xml that starts with
<?xml version='1.0' encoding='ISO-8859-8'?>
when I attempt to do
Hash.from_xml(my_xml)
I get a #<REXML::ParseException: No close tag for /root/response/message> (REXML::ParseException)
in the message tag there is indeed characters in the above encoding. I need to parse that XML, so I am guessing that I need to convert it all to utf-8 or something else that the parse will like.
Is there a way to do this? (other uses like with Nokogiri are also good)
Nokogiri seems to do the right thing:
# test.xml
<?xml version='1.0' encoding='ISO-8859-8'?>
<what>
<body>דה</body>
</what>
xml = Nokogiri::XML(File.read 'test.xml')
puts xml.at_xpath('//body').content
# => "דה"
You can also tell Nokogiri what encoding to use (e.g., Nokogiri::XML(File.read('test.xml'), nil, 'ISO-8859-8')), but that doesn't seem to be necessary here.
If that doesn't help, you might want to check that your XML is well-formed.
You can then convert the XML to UTF-8 if you like:
xml2 = xml.serialize(:encoding => 'UTF-8') {|c| c.format.as_xml }
If you just want to convert your Nokogiri XML to a hash, take a look at some of the solutions here: Convert a Nokogiri document to a Ruby Hash, or you can just do: Hash.from_xml(xml2).

Ruby - Reading and editing XML file

I am writing a Ruby (1.9.3) script that reads XML files from a folder and then edit it if necessary.
My issue is that I was given XML files converted by Tidy but its ouput is a little strange, fo example:
<?xml version="1.0" encoding="utf-8"?>
<XML>
<item>
<ID>000001</ID>
<YEAR>2013</YEAR>
<SUPPLIER>Supplier name test,
Coproration</SUPPLIER>
...
As you can see the has and extra CRLF. I dont know why it has this behaviour but I am addressing it with a ruby script. But am having trouble as I need to see either if the last character of the line is ">" or if the first is "<" so that I can see if there is something wrong with the markup.
I have tried:
Dir.glob("C:/testing/corrected/*.xml").each do |file|
puts file
File.open(file, 'r+').each_with_index do |line, index|
first_char = line[0,1]
if first_char != "<"
//copy this line to the previous line and delete this one?
end
end
end
I also feel like I should be copying the original file content as I read it to another temporary file and then overwrite. Is that the best "way"? Any tips are welcome as I do not have much experience in altering a files content.
Regards
Does that extra \n always appear in the <SUPPLIER> node? As others have suggested, Nokogiri is a great choice for parsing XML (or HTML). You could iterate through each <SUPPLIER> node and remove the \n character, then save the XML as a new file.
require 'nokogiri'
# read and parse the old file
file = File.read("old.xml")
xml = Nokogiri::XML(file)
# replace \n and any additional whitespace with a space
xml.xpath("//SUPPLIER").each do |node|
node.content = node.content.gsub(/\n\s+/, " ")
end
# save the output into a new file
File.open("new.xml", "w") do |f|
f.write xml.to_xml
end

Resources