What is the simplest way to get UTF-8 substring in Julia - utf-8

UTF-8 string in Julia cannot use slice operator because it slice the byte index of string not character. For example
s = "ポケットモンスター"
s[1:4]
s[1:4] will be "ポケ" not "ポケット".
I would like to know the simplest and most readable for get UTF-8 sub-string in Julia.

Perhaps this question calls attention to some missing functions in the standard string library (which is supposed to undergo changes in the next version of Julia). In the meantime, if we define:
substr(s,i,j) = s[chr2ind(s,i):chr2ind(s,j)]
Then,
substr(s,1,4)
Would be "ポケット"

You might want to consider using UTF32String instead of UTF8String, if you are going to be doing this a lot, and only converting to UTF8String if necessary, when you are finished.

Related

Regexp argument to readlines

I'm trying to pass in /\!|\.|\?/ to the separator argument for readlines. It seems it's not possible. Or is it?
f.readlines(/\!|\.|\?/)
I know the alternative is to use read and split, which accepts Regexp, but I want to know if this is also possible with readlines
IO#readlines expects a string, not a regular expression. But the desired behaviour might be easily achieved with read + split since according to the documentation readlines “reads the entire file”:
f.read.split /\!|\.|\?/
Please also read the valuable comment by #tom-lord with a significant improvement suggestion.

Julia: Strange characters in my string

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case ú), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.
Any way to solve this?
EDIT:
I have a variable word of type SubString{UTF8String}
When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).
Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.
The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).
The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]
PS. Sorry for a less than clear answer on my phone.
EDIT:
I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

How do I extract strings from a string?

I have a long string, consisting of multiple sentences, of various length, divided by a "-".
I want to iterate over the string and extract everything between the -'s, preferably to an array.
From another thread I found something that gets me pretty close, but not all the way:
longString.scan( /-([^-]*)-/)
Needless to say, I am new to Ruby, and especially to RegEx.
What's wrong with using String#split?
longString.split('-')
Why not just use string.split()?
longString.split('-');

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Calculating the size of Array::pack format string

How do you calculate the length of the string that would be returned by Array::pack? Is there something like Python's calcsize?
array.pack("").count I would say. Not really the fastest method, but it works.
By making an interpreter complying to the specifications found in Array::pack.
Or, reusing the existing implementation to count the number of characters instead of appending them to a string.

Resources