Preserving whitespace / line breaks with REXML - ruby

I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.

I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.

Related

Loading data from YAML in ruby changing the encoding/byte structure of data?

I am trying to write a method to remove some blacklisted characters like bom characters using their UTF-8 values. I am successful to achieve this by creating a method in String class with the following logic,
def remove_blacklist_utf_chars
self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
self
end
Now to make it useful across the applications and reusable I create a config in a yml file. The yml structure is something like,
:blacklist_utf_chars:
:zero_width_space: '"\u{200b}"'
(Edit) Also as suggested by Drenmi this didn't work,
:blacklist_utf_chars:
:zero_width_space: \u{200b}
The problem I am facing is that the method remove_blacklist_utf_chars does not work when I load the utf-encoding of blacklist characters from yml file
But when I directly pass these in the method and not via the yml file the method works.
So basically
self.force_encoding("UTF-8").gsub!("\u{200b}".force_encoding("UTF-8"), "") -- works.
but,
self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "") -- doesn't work.
I printed the value of config[:blacklist_utf_chars][:zero_width_space] and its equal to "\u{200b}"
I got this idea by referring: https://stackoverflow.com/a/5011768/2362505.
Now I am not sure how what exactly is happening when the blacklist chars list is loaded via yml in ruby code.
EDIT 2:
On further investigation I observed that there is an extra \ getting added while reading the hash from the yaml.
So,
puts config[:blacklist_utf_chars][:zero_width_space].dump
prints:
"\\u{200b}"
But then if I just define the yaml as:
:blacklist_utf_chars:
:zero_width_space: 200b
and do,
ch = "\u{#{config[:blacklist_utf_chars][:zero_width_space]}}"
self.force_encoding("UTF-8").gsub!(ch.force_encoding("UTF-8"), "")
I get
/Users/harshsingh/dir/to/code/utils.rb:121: invalid Unicode escape (SyntaxError)
The "\u{200b}" syntax is used for escaping Unicode characters in Ruby source code. It won’t work inside Yaml.
The equivalent syntax for a Yaml document is the similar "\u200b" (which also happens to be valid in Ruby). Note the lack of braces ({}), and also the double quotes are required, otherwise it will be parsed as literal \u200b.
So your Yaml file should look like this:
:blacklist_utf_chars:
:zero_width_space: "\u200b"
If you puts the value, and get the output "\u{200b}", it means the quotes are included in your string. I.e., you're actually calling:
self.force_encoding("UTF-8").gsub!('"\u{200b}"'.config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
Try changing your YAML file to:
:blacklist_utf_chars:
:zero_width_space: \u{200b}

replace the first or nth line of file with ruby

How would I replace the first line of a text file or xml file using ruby? I'm having problems replicating a strange xml API and need to edit the document instruction after I create the XML file. It is strange that I have to do this, but in this case it is necessary.
If you are editing XML, use a tool specially designed for the task. sub, gsub and regex are not good choices if the XML being manipulated is not under your control.
Use Nokogiri to parse the XML, locate nodes and change them, then emit the updated XML.
There are many examples on SO showing how to do this, plus the tutorials on the Nokogiri site.
There are a couple different ways you can do this:
Use ARGF (assuming that your ruby program takes a file name as a command line parameter)
ruby -e "puts ARGF.to_a[n]" yourfile.xml
Open the file regularly then read n lines
File.open("yourfile") { |f|
line = nil
n.times { line = f.gets }
puts line
}
This approach is less intensive on memory, as only a single line is considered at a time, it is also the simplest method.
Use IO.readlines() (will only work if the entire file will fit in memory!)
IO.readlines("yourfile")[n]
IO.readlines(...) will read every line from your file into an array.
Where n in all the above examples is the nth line of your file.

Freemarker Interpolation stripping whitespace?

I seem to be having issues with leading/trailing spaces in textareas!
If the last user has typed values into a textarea with leading/trailing spaces across multiple lines, they all disappear with exception to one space in the beginning & end.
Example:
If the textbox had the following lines: (quotes present only to help illustrate spaces)
" 3.0"
" 2.2 "
"0.3 "
it would be saved in the backend as
"<textarea id=... > 3.0/n 2.2 /n0.3 </textarea>"
My template (for this part) is fairly straightforward (entire template, not as easy...): ${label} ${textField}
When I load up the values again, I notice getTextField() is properly getting the desired string, quoted earlier... But when I look at the html page it's showing
" 3.0"
"2.2"
"0.3 "
And of course when "View Sourcing" it doesn't have the string seen in getTextField()
What I've tried:
Ensure the backend has setWhitespaceStripping(false); set
Adding the <#ftl strip_whitespace=false>
Adding the <#nl> on the same line as ${textField}
No matter what I've tried, I'm not having luck keeping the spaces after the interpolation.
Any help would be very appreciated!
Maybe you are inside a <#compress>...</#compress> (or <#compress>...</#compress>) block. Those filter the whole output on runtime and reduce whitespace regardless where it comes from. I recommend not using this directive. It makes the output somewhat smaller, but it has runtime overhead, and can corrupt output in cases like this.
FreeMarker interpolations don't remove whitespace from the inserted value, or change the value in any way. Except, if you are lexically inside an <#escape ...>....</#escape>, block, that will be automatically applied. But it's unlikely that you have an escaping expression that corrupts whitespace. But to be sure., you can check if there's any <#escape ...> in the same template file (no need to check elsewhere, as it's not a runtime directive).
strip_whitespace and #nt are only removing white-space during parsing (that's before execution), so they are unrelated.
You can also check if the whitespace is still there in the inserted value before inserting like this:
${textField?replace(" ", "[S]")?replace("\n", "[N]")?replace("\t", "[T]")}
If you find that they were already removed that probably means that they were already removed before the value was put into the data-model. So then if wasn't FreeMarker.

Why are there blank nodes/attributes when using LibXML Ruby?

Using the Gem libxml-ruby, when we parse XML like so:
document = LibXML::XML::Parser.string( xmlData ).parse
for n in document.root.children
# Do something
end
What we actually get is something like this:
root
-node empty
-node with data
-node empty
Same thing with attributes, there's a blank one padding between those we actually care about. What we end up needing to use is :options => LibXML::XML::Parser::Options::NOBLANKS
Why? :(
(Not necessarily an answer, but need formatting.)
What does the XML look like?
This XML:
<baz>
<plugh>ohai</plugh>
</baz>
may contain whitespace text nodes for the CR/LF and indentation between the <baz> and <plugh> opening tags, and the same for between the closing tags. This may or may not be significant whitespace depending on the nature of the XML. Structurally, it's different than:
<baz><plugh>ohai</plugh></baz>

Ruby - Writing Hpricot data to a file

I am currently doing some XML parsing and I've chosen to use Hpricot because of it's ease of use and syntax, however I am running into some problems. I need to write a piece of XML data that I have found out to another file. However, when I do this the format is not preserved. For example, if the content should look like this:
<dict>
<key>item1</key><value>12345</value>
<key>item2</key><value>67890</value>
<key>item3</key><value>23456</value>
</dict>
And assuming that there are many entries like this in the document. I am iterating through the 'dict' items by using
hpricot_element = Hpricot(xml_document_body)
f = File.new('some_new_file.xml')
(hpricot_element/:dict).each { |dict| f.write( dict.to_original_html ) }
After using the above code, I would expect that the output look like the following exactly like the XML shown above. However to my surprise, the output of the file looks more like this:
<dict>\n", " <key>item1</key><value>12345</value>\n", " <key>item2</key><value>67890</value>\n", " <key>item3</key><value>23456</value\n", " </dict>
I've tried splitting at the "\n" characters and writing to the file one line at a time, but that didn't seem to work either as it did not recognize the "\n" characters. Any help is greatly appreciated. It might be a very simple solution, but I am having troubling finding it. Thanks!
hpricot_element = Hpricot::XML(xml_document_body)
File.open('some_new_file.xml', 'w') {|f| f.write xml_document_body }
Don't use an an xml parser if you want the original xml to be written. It is unnecessary. You should still use one if you want to further process the data, though.
Also, for XML, you should be using Hpricot::XML instead of just Hpricot.
My solution was to just replace the literal '\n' characters with line breaks and remove the extra punctuation by simply adding two gsubs that looked like the following:
f.write( dict.to_original_html.gsub('\n', "\n").gsub('" ,"', '') )
I don't know why I didn't see this before. Like I said, it might be an easy answer that I wasn't seeing and that's exactly how it turned out. Thanks for all the answers!

Resources