Loading data from YAML in ruby changing the encoding/byte structure of data? - ruby

I am trying to write a method to remove some blacklisted characters like bom characters using their UTF-8 values. I am successful to achieve this by creating a method in String class with the following logic,
def remove_blacklist_utf_chars
self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
self
end
Now to make it useful across the applications and reusable I create a config in a yml file. The yml structure is something like,
:blacklist_utf_chars:
:zero_width_space: '"\u{200b}"'
(Edit) Also as suggested by Drenmi this didn't work,
:blacklist_utf_chars:
:zero_width_space: \u{200b}
The problem I am facing is that the method remove_blacklist_utf_chars does not work when I load the utf-encoding of blacklist characters from yml file
But when I directly pass these in the method and not via the yml file the method works.
So basically
self.force_encoding("UTF-8").gsub!("\u{200b}".force_encoding("UTF-8"), "") -- works.
but,
self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "") -- doesn't work.
I printed the value of config[:blacklist_utf_chars][:zero_width_space] and its equal to "\u{200b}"
I got this idea by referring: https://stackoverflow.com/a/5011768/2362505.
Now I am not sure how what exactly is happening when the blacklist chars list is loaded via yml in ruby code.
EDIT 2:
On further investigation I observed that there is an extra \ getting added while reading the hash from the yaml.
So,
puts config[:blacklist_utf_chars][:zero_width_space].dump
prints:
"\\u{200b}"
But then if I just define the yaml as:
:blacklist_utf_chars:
:zero_width_space: 200b
and do,
ch = "\u{#{config[:blacklist_utf_chars][:zero_width_space]}}"
self.force_encoding("UTF-8").gsub!(ch.force_encoding("UTF-8"), "")
I get
/Users/harshsingh/dir/to/code/utils.rb:121: invalid Unicode escape (SyntaxError)

The "\u{200b}" syntax is used for escaping Unicode characters in Ruby source code. It won’t work inside Yaml.
The equivalent syntax for a Yaml document is the similar "\u200b" (which also happens to be valid in Ruby). Note the lack of braces ({}), and also the double quotes are required, otherwise it will be parsed as literal \u200b.
So your Yaml file should look like this:
:blacklist_utf_chars:
:zero_width_space: "\u200b"

If you puts the value, and get the output "\u{200b}", it means the quotes are included in your string. I.e., you're actually calling:
self.force_encoding("UTF-8").gsub!('"\u{200b}"'.config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
Try changing your YAML file to:
:blacklist_utf_chars:
:zero_width_space: \u{200b}

Related

Should arguments to a custom directive be escaped?

I have created a custom directive for a documentation project of mine which is built using Sphinx and reStructuredText. The directive is used like this:
.. xpath-try:: //xpath[#expression="here"]
This will render the XPath expression as a simple code block, but with the addition of a link that the user can click to execute the expression against a sample XML document and view the matches (example link, example rendered page).
My directive specifies that it does not have content, takes one mandatory argument (the xpath expression) and recognises a couple of options:
class XPathTryDirective(Directive):
has_content = False
required_arguments = 1
optional_arguments = 0
final_argument_whitespace = True
option_spec = {
'filename': directives.unchanged,
'ns_args': directives.unchanged,
}
def run(self):
xpath_expr = self.arguments[0]
node = xpath_try(xpath_expr, xpath_expr)
...
return [node]
Everything seems to be working exactly as intended except that if the XPath expression contains a * then the syntax highlighting in my editor (gVim) gets really messed up. If I escape the * with a backslash, then that makes my editor happy, but the backslash comes through in the output.
My questions are:
Are special characters in an argument to a directive supposed to be escaped?
If so, does the directive API provide a way to get the unescaped version?
Or is it working fine and the only problem is my editor is failing to highlight things correctly?
It may seem like a minor concern but as I'm a novice at rst, I find the highlighting to be very helpful.
Are special characters in an argument to a directive supposed to be escaped?
No, I think that there is no additional processing performed on arguments of rst directives. Which matches your observation: whatever you specify as an argument of the directive, you are able to get directly via self.arguments[0].
Or is it working fine and the only problem is my editor is failing to highlight things correctly?
Yes, this seems to be the case. Character * is used for emphasis/italics in rst and it gets more attention during syntax highlighting for some reason.
This means that the solution here would be to tweak or fix vim syntax file for restructuredtext.

Can't YAML::load an YAML:dumped XML value

When trying to YAML::load a value produced by YAML::dump I get an error "did not find expected key while parsing a block mapping at line 1 column 1"
The YAML::dump value was written to an XML file as:
<format_store>---:text_formatting: '':url_pattern: ''</format_store>
If I look into the database, it is a text field with line breaks in it.
---
:text_formatting: ''
:url_pattern: ''
So it looks like the conversion from YAML::dump into the XML format dropped the line breaks.
I explicitly use the YAML::dump format for text fields. XML does not allow line breaks in element values. It would have to be escaped in some way and I assumed YAML would take care of that.
Is there a better way to dump/load text fields or is there someting I'm missing here?
Option 1: Wrap the YAML content in a <![CDATA]]> as suggested in Adding a new line/break tag in XML.
Option 2: Configure your YAML library to dump mappings using flow style (e.g {':text_formatting' : '', ':url_pattern' : ''). The exact method for accomplishing this will depend on the YAML library you are using and may require a bit of custom coding.

How do I filter file names out of a SQLite dump?

I'm trying to filter out all file names from an SQLite text dump using Ruby. I'm not very handy/familiar with regex and need a way to read, and write to a file, another dump of image files that are within the SQLite dump. I can filter out everything except stuff like this:
VALUES(3,5,1,43,'/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG','1415',NULL);
and this:
src="/images/9/94/folder%2FGraph.JPG"
I can't figure out the easiest way to filter through this. I've tried using split and other functions, but instead of splitting the string into an array by the character specified, it just removed the character.
You should be able to use .gsub('%2', ' ') the %2 with a space, while quoted, it should be fine.
Split does remove the character that is being split, though. So you may not want to do that, or if you do, you may want to use the Array#join method with the argument of the character you split with to put it back in.
I want to 'extract' the file name from the statements above. Say I have src="/images/9/94/folder%2FGraph.JPG", I want folder%2FGraph.JPG to be extracted out.
If you want to extract what is inside the src parameter:
foo = 'src="/images/9/94/folder%2FGraph.JPG"'
foo[/^src="(.+)"/, 1]
=> "/images/9/94/folder%2FGraph.JPG"
That returns a string without the surrounding parenthesis.
Here's how to do the first one:
bar = "VALUES(3,5,1,43,'/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG','1415',NULL);"
bar.split(',')[4][1..-2]
=> "/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG"
Not everything in programming is a regex problem. Somethings, actually, in my opinion, most things, are not candidates for a pattern. For instance, the first example could be written:
foo.split('=')[1][1..-2]
and the second:
bar[/'(.+?)'/, 1]
The idea is to use whichever is most clean and clear and understandable.
If all you want is the filename, then use a method designed to return only the filename.
Use one of the above and pass its output to File.basename. Filename.basename returns only the filename and extension.

Preserving whitespace / line breaks with REXML

I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.
I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.

Write # in yaml (in the string)

I'm new using yml files (for translations in my framework).
I'm trying to add a "#" inside the translation (will be a twitter share... blabla).
Is this possible, because the file translate it like a comment...
Just put the value inside single or double quotes and it won't be treated like a comment. Something like:
en:
twitter:
share: "#hashtag"

Resources