why does psych yaml interpreter add line breaks around 80 characters? - ruby

Psych is the default yaml engine since ruby 1.9.3
Why, oh why does psych add a line break in its output? Check the example below.
ruby -v # => ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-linux]
require 'yaml'
"this absolutely normal sentence is more than eighty characters long because it IS".to_yaml
# => "--- this absolutely normal sentence is more than eighty characters long because it\n IS\n...\n"
YAML::ENGINE.yamler = 'syck'
"this absolutely normal sentence is more than eighty characters long because it IS".to_yaml
# => "--- this absolutely normal sentence is more than eighty characters long because it IS\n"

You'll have to configure psych’s #to_yaml options. You'll most likely find it here:
ruby-1.9.3-p125/ext/psych/emitter.c
And then you can do something like this:
yaml.to_yaml(options = {:line_width => -1})

yaml.to_yaml(options = {:line_width => -1})
is ok to solve the problem.
but RuboCop say
Useless assignment to variable - options.
so
yaml.to_yaml(line_width: -1)
is better.

Why does it matter whether YAML wraps the line or not when it serializes the data?
The question is, after wrapping it, can YAML reconstruct the correct line later when it reloads the file? And, the answer is, yes, it can:
require 'yaml'
puts '"' + YAML.load("this absolutely normal sentence is more than eighty characters long because it IS".to_yaml) + '"'
Which outputs:
"this absolutely normal sentence is more than eighty characters long because it IS"
Data that has been serialized, is in a format that YAML understands. That's an important concept, as the data is YAML's at that point. We can mess with it in an editor, and add/subtract/edit, but the data is still YAML's, because it has to reload and reparse the data in order for our applications to use it. So, after the data makes a round-trip through YAML-land, if the data returns in the same form as it left, then everything is OK.
We'd have a problem if it was serialized and then corrupted during the parsing stage, but that doesn't happen.
You can modify some of YAML's Psych driver's behavior when it's serializing data. See the answers for "Documentation for Psych to_yaml options?" for more information.

Related

Unicode & Ruby - Expected behavior?

Ruby Version: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
Readline Version: 6.2
I'm working with some emojis and many of them behave correctly with the exception of 2. The 🌭 and 🍾 emojis. Here is some terminal output:
(byebug) "🌭"
"\u{1F32D}"
(byebug) "🛍"
"🛍"
(byebug) "🍾"
"\u{1F37E}"
Can someone tell me what's going on here? Is it just some encoding screwiness with irb? I might be snow-blind since I've been wrestling with this for so long so if there's any more information required to answer this please let me know.
Ruby may show a string with various backslash encodings for various reasons, one of which is irregular characters. For example:
"
"
# => "\n"
'"'
# => "\""
This doesn't mean the string contains an actual backslash, but rather that the version shown by inspect contains one. This is a long tradition dating back at least to the era of C in the 1970s where \n and such have been understood to mean "newline character".
In the case of emoji you might find that some are displayed and others aren't. This may be an interaction between the version of Ruby you're using and the terminal settings. As emoji are constantly being introduced you might find older ones display properly but Ruby's not confident enough with new ones to render them as-is, perhaps concerned that's an invalid Unicode character. Rather than showing something blank or the infamous question mark character, it shows the literal code for the character.

Remove   from Ruby String

i am try to parse some data and meet trouble with clean a   symbol. I knew that this is just a "space" but i realy got trouble to clean it from string
my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('my_page.hmtl')
price = page.search('#product_buy .price').text.to_s.gsub(/\s+/, "").gsub(" ","").gsub(" ", "")
puts price
And as result i always got "4 162" - with dat spaces. Don't know what to do.
Help please who meet this issue previously. Thank you
HTML escape codes don't mean anything to Ruby's regex engine. Looking for " " will look for those literal characters, not a thin space. Instead, versions of Ruby >= 1.8 support Unicode in strings, meaning that you can use the Unicode code point corresponding to a thin space to make your substitution. The Unicode code point for a thin space is 0x2009, meaning that you can reference it in a Ruby string as \u2009.
Additionally, instead of calling some_string.gsub('some_string', ''), you can just call some_string.delete('some_string').
Note that this isn't appropriate for all situations, because delete removes all instances of all characters appearing in the intersection of its arguments, while gsub will remove only segments matching the pattern provided. For example, 'hellohi'.gsub('hello', '') == "hi", while 'hellohi'.delete('hello') == 'i').
In your specific case, I'd use something like:
price = page.search('#product_buy .price').text.delete('\u2009\s')

printing the first letter of a variable returns a number?

I'm fairly new to Ruby and was mucking around with the basics and came across a problem.
Which was that when I tried to print the first letter of a variable it printed a number instead.
the code was.
name = "Max"
print name[0]
but instead of printing the letter M, it would print 77?
could someone please tell me what I did wrong?
The behaviour of this operator is different across various versions of Ruby. You're probably using an older one, in which case this is to be expected.
Here's an excerpt from the docs for Ruby 1.8.7's String class
If passed a single Fixnum, returns the code of the character at that
position.
This has been changed and the newer versions of Ruby (1.9.x and above, according to this site ) simply print the character as a String. See the docs for Ruby 2.1.0.
If passed a single index, returns a substring of one character at that
index.
Ruby 1.9.3, which I happen to have installed on the machine I'm using displays exactly the same behavior:
"Mwada"[0]
=> "M"
"Mwada"[0].class
=> String
If it's ruby 1.8.x, run #chr on a number representing the character:
"Mwada"[0].chr # => "M"
If it's ruby 1.9.x and above, everything will work as you would expect it to:
"Mwada"[0] # => "M"
Humm, I ran your question through irb (interactive ruby console) and got 'M' when looking for name[0]. You can open irb by simply typing irb from command line and test this for yourself.
irb > name = "Max"
=> "Max"
irb > print name[0]
M => nil
Can you tell me more about the context in which you requested name[0]? Could name have been reassigned to something else? Are you calling .to_i (convert to integer) anywhere in your code?
I just checked this issue from a Ruby book that includes information on Ruby 1.8 and Ruby 1.9. The book is called The Ruby Programming Language Book by David Flanagan & Yukihiro Matsumoto.
Well the book says: "In Ruby 1.8, a string is like an array of bytes or 8-bit character codes
s = 'hello' # Ruby 1.8
s[0] # 104: the ASCII character code for the first character 'h'
Ruby 1.9 returns single-character strings rather than character
s = 'hello' # Ruby 1.9
s[0] # 'h': the first character of the string, as a string
(please note some text was left out in the quote above)
In relation to your question directly. I also tested String.bytes.to_a method on
'Max', in my Ruby 1.9 environment.
print name.bytes.to_a
[77, 97, 120] => nil
and it printed the ASCII codes for 'Max', 77 is the ASCII code for 'M'
ASCII Codes
I am quite new Ruby programmer as well. I am also learning Ruby so I have found the book above worthwhile, although I have so far managed to read only the first 70 pages or so, I'll definitely try to finish the book :-)

Split utf8 string regardless of ruby version

str = "é-du-Marché"
I get the first char via
str.split(//).first
How I can get the rest of the string regardless of my ruby version ?
String does not have a method first. So you need in addition a split. When you do the split in unicode-mode (exactly utf-8) you have acces to the first (and other characters).
My solution:
puts RUBY_VERSION
str = "é-du-Marché"
p str.split(//u, 2)
Test with ruby 1.9.2:
1.9.2
["\u00E9", "-du-March\u00E9"]
Test with ruby 1.8.6:
1.8.6
["\303\251", "-du-March\303\251"]
With first and last you get your results:
str.split(//u, 2).first is the first character
str.split(//u, 2).last is the string after the first character.
str[1..-1] should return you everything after the first digit normally.
The first number is the starting index, which is set to 1 to skip the first digit, the second is the length, which is set to -1, so ruby counts from the back
Note: that multibyte characters only work in Ruby 1.9. If you wish to mimic this behavior downwards, you'll have to loop over the bytes yourself and figure out what needs to be removed from the data, cause Ruby 1.8 does not support this.
UPDATE:
You could try this as well, but I can't guarantee that it will work for every multibyte char:
str = "é-du-Marché"
substring = str.mb_chars[1..-1]
the mb_chars is a proxy class that directs the call to the appropiate implementation when dealing with UTF-8, UTF-32 or UTF-16 encoding of characters (e.g. multibyte chars).
More detailed info can be found here : http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
But I do not know if this exists in older rails versions
UPDATE2:
Ruby 1.8 treats any string just as a bunch of bytes, calling size() on it will return the amount of bytes that is used to store the data. To determine the characters regardless of the encoding try this:
char_array = str.scan(/./m)
substring = char_array[1..-1].join
This should do the trick normally. Try looking at http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 who explains how to treat multibyte data in older ruby versions.
EDIT3:
Playing around with the scan & join operations brings me closer to your problem & solution. I honestly don't have the time at to get the full solution working but if you play with the scan(/./mu) options you convert it to utf-8, which is supported by all ruby versions.

Strange \n in base64 encoded string in Ruby

The inbuilt Base64 library in Ruby is adding some '\n's. I'm unable to find out the reason. For this special example:
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'base64'
=> true
irb(main):003:0> str = "1110--ad6ca0b06e1fbeb7e6518a0418a73a6e04a67054"
=> "1110--ad6ca0b06e1fbeb7e6518a0418a73a6e04a67054"
irb(main):004:0> Base64.encode64(str)
=> "MTExMC0tYWQ2Y2EwYjA2ZTFmYmViN2U2NTE4YTA0MThhNzNhNmUwNGE2NzA1\nNA==\n"
The \n's are at the last and 6th position from end. The decoder (Base64.decode64) returns back the old string perfectly. Strange thing is, these \n's don't add any value to the encoded string. When I remove the newlines from the output string, the decoder decodes it again perfectly.
irb(main):005:0> Base64.decode64(Base64.encode64(str).gsub("\n", '')) == str
=> true
More of this, I used an another JS library to produce the base64 encoded output of the same input string, the output comes without the \n's.
Is this a bug or anything else? Has anybody faced this issue before?
FYI,
$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
Edit: Since I wrote this answer Base64.strict_encode64() was added, which does not add newlines.
The docs are somewhat confusing, the b64encode method is supposed to add a newline for every 60th character, and the example for the encode64 method is actually using the b64encode method.
It seems the pack("m") method for the Array class used by encode64 also adds the newlines. I would consider it a design bug that this is not optional.
You could either remove the newlines yourself, or if you're using rails, there's ActiveSupport::CoreExtensions::Base64::Encoding with the encode64s method.
In ruby-1.9.2 you have Base64.strict_encode64 which doesn't add that \n (newline) at the end.
Use strict_encode64 method. encode64 adds \n every 60 symbols
Yeah, this is quite normal. The doc gives an example demonstrating the line-splitting. base64 does the same thing in other languages too (eg. Python).
The reason content-free newlines are added at the encode stage is because base64 was originally devised as an encoding mechanism for sending binary content in e-mail, where the line length is limited. Feel free to replace them away if you don't need them.
Seems they've got to be stripped/ignored, like:
Base64.encode64(str).gsub(/\n/, '')
The \n added when using Base64#encode64 is correct, check this post out: https://glaucocustodio.github.io/2014/09/27/a-reminder-about-base64encode64-in-ruby/

Resources