Split string with special characters in Ruby - ruby

I need to split a string using special characters like " < > = and white space.
Example of the string:
<fileset dir="/tmp/test/my_test" includes="all"/>
By now I tried using different combinations but no results
Example:
line.split(/<=>"\s+/).each do |line_parsed|
puts line_parsed
end

Regex is not the right tool for parsing XML. You can use any XML parser you like, here I am using Nokogiri:
require 'nokogiri'
doc = Nokogiri::XML(line)
fileset = doc.css('fileset').first
fileset.attr 'dir'
#=> "/tmp/test/my_test"
fileset.attr 'includes'
#=> "all"
If you have another loop à la each_line around the code you showed us, chances are you can drop this altogether and parse the whole document in one run.

Try enclosing the special characters in a "character class" ([...]) and moving the repetition character (+) outside:
parts = line.split(/[<=>"\s]+/)
# => ["", "fileset", "dir", "/tmp/test/my_test", "includes", "all", "/"]
parts[1] # => "fileset"
parts[2] # => "dir"
parts[3] # => "/tmp/test/my_test"

Related

Deleting an unwanted character in a string

I am parsing a CSV file and converting each element in UTF-8 :
CSV.foreach(#data_source, { :col_sep => ';' , quote_char: "\x00", :encoding => "CP850"}) do |row|
row.map! {|x| x.force_encoding('UTF-8') unless x.nil? ; x.scrub!("") unless x.nil? ; x.delete('\u2FEC') unless x.nil? }
end
The script then does a bunch of calculations and then saves the data in xlsx format using axlsx gem.
I added x.delete('\u2FEC') unless x.nil? because I found that in the source file, there was this strange sequence that later causes and "Unreadable content" error in Excel.
I found that it solves the "Unreadable content" issue but it not only deletes the "\u2FEC" sequence, it deletes every occurence of the character "2" too.
Do you have any idea how I can get rid of only "\u2FEC" and not of every "2" in my rows ?
Thanks.
Single-quoted strings don't support Unicode escapes. (In fact, they don't support any escapes other than \' and \\.)
You need to use either a double-quoted string or enter the character directly into the single-quoted string instead of a Unicode escape sequence.

Use ARGV[] argument vector to pass a regular expression in Ruby

I am trying to use gsub or sub on a regex passed through terminal to ARGV[].
Query in terminal: $ruby script.rb input.json "\[\{\"src\"\:\"
Input file first 2 lines:
[{
"src":"http://something.com",
"label":"FOO.jpg","name":"FOO",
"srcName":"FOO.jpg"
}]
[{
"src":"http://something123.com",
"label":"FOO123.jpg",
"name":"FOO123",
"srcName":"FOO123.jpg"
}]
script.rb:
dir = File.dirname(ARGV[0])
output = File.new(dir + "/output_" + Time.now.strftime("%H_%M_%S") + ".json", "w")
open(ARGV[0]).each do |x|
x = x.sub(ARGV[1]),'')
output.puts(x) if !x.nil?
end
output.close
This is very basic stuff really, but I am not quite sure on how to do this. I tried:
Regexp.escape with this pattern: [{"src":".
Escaping the characters and not escaping.
Wrapping the pattern between quotes and not wrapping.
Meditate on this:
I wrote a little script containing:
puts ARGV[0].class
puts ARGV[1].class
and saved it to disk, then ran it using:
ruby ~/Desktop/tests/test.rb foo /abc/
which returned:
String
String
The documentation says:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ‘d’, instead of a digit.
That means that the regular expression, though it appears to be a regex, it isn't, it's a string because ARGV only can return strings because the command-line can only contain strings.
When we pass a string into sub, Ruby recognizes it's not a regular expression, so it treats it as a literal string. Here's the difference in action:
'foo'.sub('/o/', '') # => "foo"
'foo'.sub(/o/, '') # => "fo"
The first can't find "/o/" in "foo" so nothing changes. It can find /o/ though and returns the result after replacing the two "o".
Another way of looking at it is:
'foo'.match('/o/') # => nil
'foo'.match(/o/) # => #<MatchData "o">
where match finds nothing for the string but can find a hit for /o/.
And all that leads to what's happening in your code. Because sub is being passed a string, it's trying to do a literal match for the regex, and won't be able to find it. You need to change the code to:
sub(Regexp.new(ARGV[1]), '')
but that's not all that has to change. Regexp.new(...) will convert what's passed in into a regular expression, but if you're passing in '/o/' the resulting regular expression will be:
Regexp.new('/o/') # => /\/o\//
which is probably not what you want:
'foo'.match(/\/o\//) # => nil
Instead you want:
Regexp.new('o') # => /o/
'foo'.match(/o/) # => #<MatchData "o">
So, besides changing your code, you'll need to make sure that what you pass in is a valid expression, minus any leading and trailing /.
Based on this answer in the thread Convert a string to regular expression ruby, you should use
x = x.sub(/#{ARGV[1]}/,'')
I tested it with this file (test.rb):
puts "You should not see any number [0123456789].".gsub(/#{ARGV[0]}/,'')
I called the file like so:
ruby test.rb "\d+"
# => You should not see any number [].

Replace single quote with backslash single quote

I have a very large string that needs to escape all the single quotes in it, so I can feed it to JavaScript without upsetting it.
I have no control over the external string, so I can't change the source data.
Example:
Cote d'Ivoir -> Cote d\'Ivoir
(the actual string is very long and contains many single quotes)
I'm trying to this by using gsub on the string, but can't get this to work:
a = "Cote d'Ivoir"
a.gsub("'", "\\\'")
but this gives me:
=> "Cote dIvoirIvoir"
I also tried:
a.gsub("'", 92.chr + 39.chr)
but got the same result; I know it's something to do with regular expressions, but I never get those.
The %q delimiters come in handy here:
# %q(a string) is equivalent to a single-quoted string
puts "Cote d'Ivoir".gsub("'", %q(\\\')) #=> Cote d\'Ivoir
The problem is that \' in a gsub replacement means "part of the string after the match".
You're probably best to use either the block syntax:
a = "Cote d'Ivoir"
a.gsub(/'/) {|s| "\\'"}
# => "Cote d\\'Ivoir"
or the Hash syntax:
a.gsub(/'/, {"'" => "\\'"})
There's also the hacky workaround:
a.gsub(/'/, '\#').gsub(/#/, "'")
# prepare a text file containing [ abcd\'efg ]
require "pathname"
backslashed_text = Pathname("/path/to/the/text/file.txt").readlines.first.strip
# puts backslashed_text => abcd\'efg
unslashed_text = "abcd'efg"
unslashed_text.gsub("'", Regexp.escape(%q|\'|)) == backslashed_text # true
# puts unslashed_text.gsub("'", Regexp.escape(%q|\'|)) => abcd\'efg

Remove all non-alphabetical, non-numerical characters from a string?

If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.
You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.
string.gsub(/[^[:alnum:]]/, "")
The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.
You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.
If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.

Cleaning up 'smart' characters from Word in Ruby

I need to clean up various Word 'smart' characters in user input, including but not limited to the following:
– EN DASH
‘ LEFT SINGLE QUOTATION MARK
’ RIGHT SINGLE QUOTATION MARK
Are there any Ruby functions or libraries for mapping these into their ASCII (near-) equivalents, or do I really need to just do a bunch of manual gsubs?
The HTMLEntities gem will decode the entities to UTF-8.
You could use iconv to transliterate to the closest ASCII equivalents or simple gsub or tr calls. James Grey has some blogs about converting between various character sets showing how to do the transliterations.
require 'htmlentities'
chars = [
'–', # EN DASH
'‘', # LEFT SINGLE QUOTATION MARK
'’' # RIGHT SINGLE QUOTATION MARK
]
decoder = HTMLEntities.new('expanded')
chars.each do |c|
puts "#{ c } => #{ decoder.decode(c) } => #{ decoder.decode(c).tr('–‘’', "-'")} => #{ decoder.decode(c).encoding }"
end
# >> – => – => - => UTF-8
# >> ‘ => ‘ => ' => UTF-8
# >> ’ => ’ => ' => UTF-8
Some gsubs sound like the best bet, especially if you're planning to load an entire extra library to do basically the same thing.

Resources