How do you use unicode characters within a regular expression in Ruby? - ruby

I am attempting to write a line of code that will take a line of japanese text and delete a certain set of characters. However I am having trouble with using unicode characters inside of the regular expression.
I am currently using text.gsub(/《.*?》/u, '') but I get the error
'gsub': invalid byte sequence in Windows-31J (Argument error)
Can anyone tell me what I am doing incorrectly?
Example text : その仕草《しぐさ》があまりに無造作《むぞうさ》だったので
Expected result: その仕草があまりに無造作だったので
Thanks
edit: # encoding: utf-8 is present at the top of the script.

Try this:
text.encode('utf-8', 'utf-8').gsub(/《.*?》/u, '')

Related

json parser error unexpected token

I am getting a json response array as below.
"[{\"id\":\"23886\",\"item_type\":2,\"name\":\"Equalizer\",\"label\":null,\"desc\":null,\"genre\":null,\"show_name\":null,\"img\":\"http:\\/\\/httpg3.scdn.arkena.com\\/10242\\/v2_images\\/tf1\\/0\\/tf1_media_ingest95290_image\\/tf1_media_ingest95290_image_0_208x277.jpg\",\"url\":\"\\/films\\/media-23886-Equalizer.html\",\"duration\":\"2h27mn\",\"durationtime\":\"8865\",\"audio_languages\":null,\"prod\":null,\"year\":null,\"vf\":\"1\",\"vost\":\"1\",\"sd\":true,\"hd\":false,\"sdprice\":\"4.99\",\"hdprice\":null,\"sdfile\":null,\"hdfile\":null,\"sdbundle\":\"12771\",\"hdbundle\":\"12771\",\"teaser\":\"23887\",\"att_getter\":\"Tout le monde a le droit \\u00e0 la justice\",\"orig_prod\":null,\"director\":null,\"actors\":null,\"csa\":\"CSA_6\",\"season\":null,\"episode\":null,\"typeid\":\"1\",\"isfav\":false,\"viewersrating\":\"4.0\",\"criticsrating\":\"3.0\",\"onThisPf\":1},{\"id\":\"23998\",\"item_type\":2,\"name\":\"Le Labyrinthe\",\"label\":null,\"desc\":null,\"genre\":null,\"show_name\":null,\"img\":\"http:\\/\\/httpg3.scdn.arkena.com\\/10242\\/v2_images\\/tf1\\/1\\/tf1_media_ingest94727_image\\/tf1_media_ingest94727_image_1_208x277.jpg\",\"url\":\"\\/films\\/media-23998-Le_Labyrinthe.html\",\"duration\":\"1h48mn\",\"durationtime\":\"6533\",\"audio_languages\":null,\"prod\":null,\"year\":null,\"vf\":\"1\",\"vost\":\"1\",\"sd\":true,\"hd\":false,\"sdprice\":\"4.99\",\"hdprice\":null,\"sdfile\":null,\"hdfile\":null,\"sdbundle\":\"12699\",\"hdbundle\":\"12699\",\"teaser\":\"23999\",\"att_getter\":\"Saurez-vous r\\u00e9chapper du labyrinthe ?\",\"orig_prod\":null,\"director\":null,\"actors\":null,\"csa\":\"CSA_1\",\"season\":null,\"episode\":null,\"typeid\":\"1\",\"isfav\":false,\"viewersrating\":\"3.5\",\"criticsrating\":\"4.0\",\"onThisPf\":1},{\"id\":\"23688\",\"item_type\":2,\"name\":\"Gone Girl\",\"label\":null,\"desc\":null,\"genre\":null,\"show_name\":null,\"img\":\"http:\\/\\/httpg3.scdn.arkena.com\\/10242\\/v2_images\\/tf1\\/0\\/tf1_media_ingest92895_image\\/tf1_media_ingest92895_image_0_208x277.jpg\",\"url\":\"\\/films\\/media-23688-Gone_Girl.html\",\"duration\":\"2h22mn\",\"durationtime\":\"8579\",\"audio_languages\":null,\"prod\":null,\"year\":null,\"vf\":\"1\",\"vost\":\"1\",\"sd\":true,\"hd\":false,\"sdprice\":\"4.99\",\"hdprice\":null,\"sdfile\":null,\"hdfile\":null,\"sdbundle\":\"12507\",\"hdbundle\":\"12507\",\"teaser\":\"23689\",\"att_getter\":\"Il ne faut pas se fier aux apparences...\",\"orig_prod\":null,\"director\":null,\"actors\":null,\"csa\":\"CSA_2\",\"season\":null,\"episode\":null,\"typeid\":\"1\",\"isfav\":false,\"viewersrating\":\"4.0\",\"criticsrating\":\"4.5\",\"onThisPf\":1}]"
While I try to parse it, I get Unexpected token Parser Error, which I believe is due to the quotes at the beginning and end of the response.
I was wrong to say that the parser error was due to the quotes at the beginning and end of response. But I am not sure why it happens. But when I try to parse the json response array, it does throw error.
Any idea whether there is anything wrong in the json respnse array.
I tried to parse it but it throws parser error. I tried as below
JSON.parse(File.read('demo')). The demo file contains the json
response which I pasted.
First of all, the json you posted is a ruby String. And ruby parses it as json without error. However, if you paste that string into a file, it will not be valid json because of the escape sequences, the most numerous of which is \".
In a ruby string, the sequence \", which is two characters long, is converted to one character; in a file that same sequence is two characters long: a \ and a ". In other words, escape sequences that are legal inside a ruby String do not represent the same thing when pasted into a file.
Another example: in a ruby String the escape sequence \20AC is a single character--the Euro sign. However, if you paste that sequence into a file, it will be five characters long: a \, and a 2, and a 0, and an A, and a C.
Response to comment:
There is an invisible byte order mark (BOM) at the start of the json, which you can see by executing:
p resp
...which produces the output:
\xEF\xBB\xBF[{\"id\":\"2388\" .....
The UTF-8 representation of the BOM is the byte sequence
0xEF,0xBB,0xBF
Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to
signal at the start that the text stream is encoded in UTF-8.
You can skip the first 3 bytes/characters like this:
resp[3..-1]
I had this error with reading in JSON files and it turned out that the issue was that JSON.parse somehow did not like UTF-8-encoded files. When I first encoded the files to ASCII (= ISO 8859-1) everything went fine.
Try this. It works.
require 'json'
my_obj = JSON.parse("your json string", :symbolize_names => true)

How can I get the char for a given UTF-8 code in Ruby 2.1

I was wondering if there is a way to get the character for a given UTF-8 code ?
E.g.:
1103 = > "я"(russian letter)
Using Array#pack with U directive (UTF-8 character):
[1103].pack('U')
# => "я"
Another approach is "\u{hex}", e.g. "\u{4355}". This syntax accepts only hex numbers, not decimal. Syntax U+184B is the most commonly used one for referencing Unicode characters.

Incompatible character encodings error

I'm trying to run a ruby script which generates translated HTML files from a JSON file. However I get this error:
incompatible character encodings: UTF-8 and CP850
Ruby
translation_hash = JSON.parse(File.read('translation_master.json').force_encoding("ISO-8859-1").encode("utf-8", replace: nil))
It seems to get stuck on this line of the JSON:
Json
"3": "Klassisch geschnittene Anzüge",
because there is a special character "ü". The JSON file's encoding is ANSI. Any ideas what could be wrong?
Try adding # encoding: UTF-8 to the top of the ruby file. This tells ruby to interpret the file with a different encoding. If this doesn't work try to find out what kind of encoding the text uses and change the line accordingly.
IMHO your code should work if the encoding of the json file is "ISO-8859-1" and if it is a valid json file.
So you should first verify if "ISO-8859-1" is the correct encoding and
by the way if the file is a valid json file.
# read the file with the encoding, you assume it is correct
json_or_not = File.read('translation_master.json').force_encoding("ISO-8859-1")
# print result and ckeck if something is obscure
puts json_or_not

How to use şŞıİçÇöÖüÜĞğ characters in a regular expression in Ruby?

I tried using a regular expression to capture names:
r[1].scan(/^([A-Z]|[ŞİÇÖÜĞ])([a-z]|[şŞıİçÇöÖüÜĞğ])*\s([A-Z]|[ŞİÇÖÜĞ])([a-z]|[şŞıİçÇöÖüÜĞğ])*/u)
But, it gives me an error:
syntax error, unexpected $end, expecting ')'
...atches = r[1].scan(/^([A-Z]|[ŞİÇÖÜĞ])([a-z]|[şŞ�...
...
I see that the problem is the Turkish characters I'm using. Is it possible to use unicode values of the characters in regexp? How can I use these problematic characters in this regexp?
Use ruby 1.9
Go with /\p{Word}+\p{Space}\p{Word}*/

CreateTextfile() > write does not work

VBScript does delivers an "Illegal Argument" message when trying to write the text shown below to file using the following code. If I change resultStr to some test text, it works. What could be the problem?
Set resFile = fs.CreateTextfile(resFilePath, true)
resFile.write resultStr
resFile.close
Contents of resultStr:
Your string looks like it contains non-ASCII characters. You need to pass an extra True argument to CreateTextfile to open the text file using a Unicode encoding (probably UTF-16 on Windows).
If you want to write UTF-8 to the file, see Writing UTF8 text to file.

Resources