how to substring input with various languages in ruby? - ruby

Given a string, it may contain english or japanese(wide chars) or other languages
How can I get the first char / substrings of this string?
ex: "Give" => "G"
"日本" => "日"
Thanks!

This is built in to ruby so long as you have the correct encoding set on your string:
$ ruby -ve 'p "日本".encoding, "日本"[0]'
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
#<Encoding:UTF-8>
"日"
There is no need to use mb_chars nor ActiveRecord.

You can use ActiveSupport's Chars class
string = "日本"
string.mb_chars[0]
=> "日"

If you have 'ActiveRecord', you can use mb_chars.
Or you can use the standard library:
str = '日本'
str.codepoints.take(1)
#日
'codepoint' gives an enumerator through the string's actual encodings and 'take' will take any amount of chars you want. Or you can use
str.codepoints.to_a[0]
It will convert the string's encodings to an array. It is good for short strings but not good for big ones.

Related

Ruby: How does concatenation effect the String in memory? [duplicate]

This question already has answers here:
Are strings in Ruby mutable? [duplicate]
(6 answers)
Closed 8 years ago.
How come concatenating to a string does not change its object_id? My understand was that Strings are immutable because Strings are essentally Arrays of Characters, and Arrays cannot be changed in memory since they are contiguous. Yet, as demonstrated below: Instantiating a String than adding characters does not change it's object_id. How does concatenation effect the String in memory?
2.1.2 :131 > t1 = "Hello "
=> "Hello "
2.1.2 :132 > t1.object_id
=> 70282949828720
2.1.2 :133 > t2 = t1
=> "Hello "
2.1.2 :134 > t2.object_id
=> 70282949828720
2.1.2 :135 > t2 << "HEY THERE MATE"
=> "Hello HEY THERE MATE"
2.1.2 :136 > t2.object_id
=> 70282949828720
2.1.2 :137 > t1.object_id
=> 70282949828720
2.1.2 :138 >
How come concatenating to a string does not change its object_id?
Because it's still the same string it was before.
My understand was that Strings are immutable
No, they are not immutable. In Ruby, strings are mutable.
because Strings are essentally Arrays of Characters,
They are not. In Ruby, strings are mostly a factory for iterators (each_line, each_char, each_codepoint, each_byte). It implements a subset of the Array protocol, but that does not mean that it is an array.
and Arrays cannot be changed in memory since they are contiguous.
Wrong, arrays are mutable in Ruby.
Yet, as demonstrated below: Instantiating a String than adding characters does not change it's object_id. How does concatenation effect the String in memory?
The Ruby Language Specification does not prescribe any particular in-memory representation of strings. Any representation is fine, as long as it supports the semantics specified in the Ruby Language Specification.
Here's a couple of examples from some Ruby implementations:
Rubinius:
kernel/common/string.rb
kernel/bootstrap/string.rb
vm/builtin/string.cpp
Topaz:
topaz/objects/stringobject.py
Cardinal:
src/classes/String.pir
IronRuby:
Ruby/Builtins/MutableString.cs
JRuby:
core/src/main/java/org/jruby/RubyString.java
Ruby strings are not immutable, in contrast to languages like Python and Java. The underlying char array is internally resized to accommodate the appended characters.
If you want an immutable string in ruby (for example, Bad Things can happen if you use a mutable value as a hash key), use a symbol:
my_sym = :foo
or
my_sym = my_string.to_sym

Generate string for Regex pattern in Ruby

In Python language I find rstr that can generate a string for a regex pattern.
Or in Python we have this method that can return range of string:
re.sre_parse.parse(pattern)
#..... ('range', (97, 122)) ....
But In Ruby I didn't find any thing.
So how to generate string for a regex pattern in Ruby(reverse regex)?
I wanna to some thing like this:
"/[a-z0-9]+/".example
#tvvd
"/[a-z0-9]+/".example
#yt
"/[a-z0-9]+/".example
#bgdf6
"/[a-z0-9]+/".example
#564fb
"/[a-z0-9]+/" is my input.
The outputs must be correct string that available in my regex pattern.
Here outputs were: tvvd , yt , bgdf6 , 564fb that "example" method generated them.
I need that method.
Thanks for your advice.
You can also use the Faker gem https://github.com/stympy/faker and then use this call:
Faker::Base.regexify(/[a-z0-9]{10}/)
In Ruby:
/qweqwe/.to_s
# => "(?-mix:qweqwe)"
When you declare a Regexp, you've got the Regexp class object, to convert it to String class object, you may use Regexp's method #to_s. During conversion the special fields will be expanded, as you may see in the example., using:
(using the (?opts:source) notation. This string can be fed back in to Regexp::new to a regular expression with the same semantics as the original.
Also, you can use Regexp's method #inspect, which:
produces a generally more readable version of rxp.
/ab+c/ix.inspect #=> "/ab+c/ix"
Note: that the above methods are only use for plain conversion Regexp into String, and in order to match or select set of string onto an other one, we use other methods. For example, if you have a sourse array (or string, which you wish to split with #split method), you can grep it, and get result array:
array = "test,ab,yr,OO".split( ',' )
# => ['test', 'ab', 'yr', 'OO']
array = array.grep /[a-z]/
# => ["test", "ab", "yr"]
And then convert the array into string as:
array.join(',')
# => "test,ab,yr"
Or just use #scan method, with slightly changed regexp:
"test,ab,yr,OO".scan( /[a-z]+/ )
# => ["test", "ab", "yr"]
However, if you really need a random string matched the regexp, you have to write your own method, please refer to the post, or use ruby-string-random library. The library:
generates a random string based on Regexp syntax or Patterns.
And the code will be like to the following:
pattern = '[aw-zX][123]'
result = StringRandom.random_regex(pattern)
A bit late to the party, but - originally inspired by this stackoverflow thread - I have created a powerful ruby gem which solves the original problem:
https://github.com/tom-lord/regexp-examples
/this|is|awesome/.examples #=> ['this', 'is', 'awesome']
/https?:\/\/(www\.)?github\.com/.examples #=> ['http://github.com', 'http://www.github.com', 'https://github.com', 'https://www.github.com']
UPDATE: Now regular expressions supported in string_pattern gem and it is 30 times faster than other gems
require 'string_pattern'
/[a-z0-9]+/.generate
To see a comparison of speed https://repl.it/#tcblues/Comparison-generating-random-string-from-regular-expression
I created a simple way to generate strings using a pattern without the mess of regular expressions, take a look at the string_pattern gem project: https://github.com/MarioRuiz/string_pattern
To install it: gem install string_pattern
This is an example of use:
# four characters. optional: capitals and numbers, required: lower
"4:XN/x/".gen # aaaa, FF9b, j4em, asdf, ADFt
Maybe you can find what you are looking for over here.

Ruby 1.8.7 vs 1.9* String[Fixnum] differences

Ruby 1.8.7:
"abc"[0]
=> 65
Ruby 1.9*
"abc"[0]
=> "a"
Is there a way I can safely write the code above to produce the second result in both 1.8.7 and 1.9*? My solution so far is: "abc".split('').first but that doesn't seem very clever.
"abc"[0].chr
produces the 2nd result in both versions.
1.8: http://ruby-doc.org/core-1.8.7/Integer.html#method-i-chr
1.9: http://ruby-doc.org/core-1.9.3/String.html#method-i-chr
If you want the first character of a string, as a string, then add a length in the brackets:
"abc"[0,1]
Note that in 1.8, most of these answers will only work for characters in the ASCII range:
irb(main):001:0> "ā"[0].chr
=> "\304"
irb(main):002:0> "ā"[0,1]
=> "\304"
irb(main):003:0> "ā"[0..0]
=> "\304"
Though of course it depends on your encoding.
What about
"abc"[0].ord
?
http://ruby-doc.org/core-1.9.3/String.html#method-i-ord

How do I remove a non-breaking space in Ruby

I have a string that looks like this:
d = "foo\u00A0\bar"
When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?
In case you do not care about the non-breaking space specifically, but about any "special" unicode whitespace character that might appear in your string, you can replace it using the POSIX bracket expression for whitespace:
s.gsub(/[[:space:]]/, '')
These bracket expressions (as opposed to matchers like \s) do not only match ASCII characters, but all unicode characters of a class.
For more details see the ruby documentation
irb(main):001:0> d = "foo\u00A0\bar"
=> "foo \bar"
irb(main):002:0> d.gsub("\u00A0", "")
=> "foo\bar"
It's an old thread but maybe it helps somebody.
I found myself looking for a solution to the same problem when I discovered that strip doesn't do the job. I checked with method ord what the character was and used chr to represent it in gsub
2.2.3 :010 > 160.chr("UTF-8")
=> " "
2.2.3 :011 > 160.chr("UTF-8").strip
=> " "
2.2.3 :012 > nbsp = 160.chr("UTF-8")
=> " "
2.2.3 :013 > nbsp.gsub(160.chr("UTF-8"),"")
=> ""
I couldn't understand why strip doesn't remove something that looked like a space to me so I checked here what ASCII 160 actually is.
d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"")
See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.

How to extract a single character (as a string) from a larger string in Ruby?

What is the Ruby idiomatic way for retrieving a single character from a string as a one-character string? There is the str[n] method of course, but (as of Ruby 1.8) it returns a character code as a fixnum, not a string. How do you get to a single-character string?
In Ruby 1.9, it's easy. In Ruby 1.9, Strings are encoding-aware sequences of characters, so you can just index into it and you will get a single-character string out of it:
'µsec'[0] => 'µ'
However, in Ruby 1.8, Strings are sequences of bytes and thus completely unaware of the encoding. If you index into a string and that string uses a multibyte encoding, you risk indexing right into the middle of a multibyte character (in this example, the 'µ' is encoded in UTF-8):
'µsec'[0] # => 194
'µsec'[0].chr # => Garbage
'µsec'[0,1] # => Garbage
However, Regexps and some specialized string methods support at least a small subset of popular encodings, among them some Japanese encodings (e.g. Shift-JIS) and (in this example) UTF-8:
'µsec'.split('')[0] # => 'µ'
'µsec'.split(//u)[0] # => 'µ'
Before Ruby 1.9:
'Hello'[1].chr # => "e"
Ruby 1.9+:
'Hello'[1] # => "e"
A lot has changed in Ruby 1.9 including string semantics.
Should work for Ruby before and after 1.9:
'Hello'[2,1] # => "l"
Please see Jörg Mittag's comment: this is correct only for single-byte character sets.
'abc'[1..1] # => "b"
'abc'[1].chr # => "b"

Resources