How to match any quoted strings containing Cyrillic symbols - ruby

Need parse a lot of text files and replace any quoted strings containing cyrillic symbols. They are may contains new lines, non-alphabetic characters and special symbols (for example '$' or escaped quote).
Can anyone help with regex?
From comments:
for example php code
function hello($word) {
$word2 = "ха-ха!";
echo "Привет, $word $word2\n";
}
hello('Мир');
I need match "ха-ха!", "Привет, $word $word2\n" and 'Мир'

This should work:
str = 'The cat is under the "таблица"'
regex = /"\p{Cyrillic}+.*?\.?"/ui
str.match(regex){|s| do_stuff_with_each_matching s}
# or...
str.gsub!(regex){|s| method_that_translates_russian s}
Check it out on live at http://rubular.com/r/0Mwbfinjvp.
http://www.ruby-doc.org/core-1.9.3/Regexp.html

".*[^a-zA-Z\d]+.*" matches any quoted character sequence containing at least one non-alphanumeric character.
i.e. it matches "aa$bb" and "a1$b1"
It doesn't match "aabb" or a$b.
Hope that this is what you want (Add required escaping).

Related

Unable to substitute escaped characters in string

I have this string:
str = "no,\"contact_last_name\",\"token\""
=> "no,\"contact_last_name\",\"token\""
I want to remove the escaped double quoted string character \". I use gsub:
result = str.gsub('\\"','')
=> "no,\"contact_last_name\",\"token\""
It appears that the string has not substituted the double quote escape characters in the string.
Why am I trying to do this? I have this csv file:
no,"contact_last_name","token",company,urbanization,sec-"property_address","property_address",city-state-zip,ase,oel,presorttrayid,presortdate,imbno,encodedimbno,fca,"property_city","property_state","property_zip"
1,MARIE A JEANTY,1083123,,,,17 SW 6TH AVE,DANIA BEACH FL 33004-3260,Electronic Service Requested,,T00215,12/14/2016,00-314-901373799-105112-33004-3260-17,TATTTADTATTDDDTTFDDFATFTDDDTTFADTTDFAAADDATDAATTFDTDFTTAFFTTATFFF,017,DANIA BEACH,FL, 33004-3260
When I try to open it with CSV, I get the following error:
CSV.foreach(path, headers: true) do |row|
end
CSV::MalformedCSVError: Illegal quoting in line 1.
Once I removed those double quoted strings in the first row (the header), the error went away. So I am trying to remove those double quoted strings before I run it through CSV:
file = File.open "file.csv"
contents = file.read
"no,\"contact_last_name\",\"token\" ... "
contents.gsub!('\\"','')
So again my question is why is gsub not removing the specified characters? Note that this actuall does work:
contents.gsub /"/, ""
as if the string is ignoring the \ character.
There is no escaped double quote in this string:
"no,\"contact_last_name\",\"token\""
The interpreter recognizes the text above as a string because it is enclosed in double quotes. And because of the same reason, the double quotes embedded in the string must be escaped; otherwise they signal the end of the string.
The enclosing double quote characters are part of the language, not part of the string. The use of backslash (\) as an escape character is also the language's way to put inside a string characters that otherwise have special meaning (double quotes f.e.).
The actual string stored in the str variable is:
no,"contact_last_name","token"
You can check this for yourself if you tell the interpreter to put the string on screen (puts str).
To answer the issue from the question's title, all your efforts to substitute escaped characters string were in vain just because the string doesn't contain the character sequences you tried to find and replace.
And the actual problem is that the CSV file is malformed. The 6th value on the first row (sec-"property_address") doesn't follow the format of a correctly encoded CSV file.
It should read either sec-property_address or "sec-property_address"; i.e. the value should be either not enclosed in quotes at all or completely enclosed in quotes. Having it partially enclosed in quotes confuses the Ruby's CSV parser.
The string looks fine; You're not understanding what you're seeing. Meditate on this:
"no,\"contact_last_name\",\"token\"" # => "no,\"contact_last_name\",\"token\""
'no,"contact_last_name","token"' # => "no,\"contact_last_name\",\"token\""
%q[no,"contact_last_name","token"] # => "no,\"contact_last_name\",\"token\""
%Q#no,"contact_last_name","token"# # => "no,\"contact_last_name\",\"token\""
When looking at a string that is delimited by double-quotes, it's necessary to escape certain characters, such as embedded double-quotes. Ruby, along with many other languages, has multiple ways of defining a string to remove that need.

What is an escape character in Ruby?

I would like to split lines which contains [ (bracket: []). However, when I type this as /[/ it is treated as comment.
You need to escape the [ char like /\[/.
I infer that you're using string.split, which can use a regex (the stuff between the / /) to indicate what delimiter character it will split the string into a list with.
Well, regexes use the [ and ] characters in a special way, to denote that such a group will match any of the characters inside.
[abc] => matches a, b, or c
Since you actually need to match the [ symbol literally, you need to escape it with the \ switch
So, write your split as:
string.split(/\[/)

Adding underscore to variable

I have a variable title. It can look like:
title = 'One two three'
Is it possible to replace the blanks with underscores?
Sure! What you want is either gsub or gsub! depending on your use case.
title = "One two three".gsub(/\s+/, "_")
will substitute any whitespace character with an underscore in the string and will store the string into title
if you already have title with the string stored then you can do
title.gsub!(/\s+/, "_")
and it will do the same substitution in title.
Yes, you can use the gsub method:
title = 'One two three'.gsub(/ /, '_')
title = 'One two three'.tr(" ", "_")
You can also split the string, automatically removing extra white space with .split and then rejoin the words with .join('_')
So title.split.join('_')
This has the benefit of not putting underscores or hyphens or whatever in the place of any trailing or leading spaces.

regex any non-digit with exception

I've got strings like these:
+996999966966AA
-996999966966AA
I am using this code:
"+996999966966AA".gsub!(/\D/, "")
to get rid of any character except digits, but the sign + also being stripped. How can my code retain the +?
Use:
[^+\d]
to match anything that isn't + or a digit.
You can also use \W, "non-word character" which matches any character that is not a word character (alphanumeric & underscore)).
(\W\d+)\w+

How to find string containing regex special chars with regex

I have the fallowing piece of code :
details =~ /.#{action.name}.*/
If action.name contains regular string such as "abcd" then everything goes ok ,
but if action.string contains special chars such as . or / ,im getting an exception.
Is there a way to check the action.name string without having to put \ before every special char inside action.name ?
You can escape all special characters using Regexp::escape.
Try:
details =~ /.#{Regexp.escape(action.name)}.*/

Resources