How do I parse YAML with nil values? - ruby

I apologize for the very specific issue I'm posting here but I hope it will help others that may also run across this issue. I have a string that is being formatted to the following:
[[,action1,,],[action2],[]]
I would like to translate this to valid YAML so that it can be parsed which would look like this:
[['','acton1','',''],['action2'],['']]
I've tried a bunch of regular expressions to accomplish this but I'm afraid that I'm at a complete loss. I'm ok with running multiple expressions if needed. For example (ruby):
puts s.gsub!(/,/,"','") # => [[','action1','',']','[action2]','[]]
puts s.gsub!(/\[',/, "['',") # => [['','action1','',']','[action2]','[]]
That's getting there, but I have a feeling I'm starting to go down a rat-hole with this approach. Is there a better way to accomplish this?
Thanks for the help!

This does the job for the empty fields (ruby1.9):
s.gsub(/(?<=[\[,])(?=[,\]])/, "''")
Or for ruby1.8, which doesn't support zero-width look-behind:
s.gsub(/([\[,])(?=[,\]])/, "\\1''")
Quoting non-empty fields can be done with one of these:
s.gsub(/(?<=[\[,])\b|\b(?=[,\]])/, "'")
s.gsub(/(\w+)/, "'\\1'")
In the above I'm making use of zero-width positive look behind and zero-width positive look ahead assertions (the '(?<=' and '(?=').
I've looked for some ruby specific documentation but could not find anything that explains these features in particular. Instead, please let me refer you to perlre.

It would be easier to just parse it, then output valid YAML.
Since I don't know Ruby, Here is an example in Perl.
Since you only want a subset of YAML, that appears to be similar to JSON, I used the JSON module.
I've been wanting an excuse to use Regexp::Grammars, so I used it to parse the data.
I guarantee it will work, no matter how deep the arrays are.
#! /usr/bin/env perl
use strict;
#use warnings;
use 5.010;
#use YAML;
use JSON;
use Regexp::Grammars;
my $str = '[[,action1,,],[action2],[],[,],[,[],]]';
my $parser = qr{
<match=Array>
<token: Text>
[^,\[\]]*
<token: Element>
(?:
<.Text>
|
<MATCH=Array>
)
<token: Array>
\[
(?:
(?{ $MATCH = [qw'']; })
|
<[MATCH=Element]> ** (,)
)
\]
}x;
if( $str =~ $parser ){
say to_json $/{match};
}else{
die $# if $#;
}
Which outputs.
[["","action1","",""],["action2"],[],["",""],["",[],""]]
If you really wanted YAML, just un comment "use YAML;", and replace to_json() with Dump()
---
-
- ''
- action1
- ''
- ''
-
- action2
- []
-
- ''
- ''
-
- ''
- []
- ''

Try this:
s.gsub(/([\[,])(?=[,\]])/, "\\1''")
.gsub(/([\[,])(?=[^'\[])|([^\]'])(?=[,\]])/, "\\+'");
EDIT: I'm not sure about the replacement syntax. That's supposed to be group #1 in the first gsub, and the highest-numbered participating group -- $+ -- in the second.

Related

Deleting all special characters from a string - ruby

I was doing the challenges from pythonchallenge writing code in ruby, specifically this one. It contains a really long string in page source with special characters. I was trying to find a way to delete them/check for the alphabetical chars.
I tried using scan method, but I think I might not use it properly. I also tried delete! like that:
a = "PAGE SOURCE CODE PASTED HERE"
a.delete! "!", "#" #and so on with special chars, does not work(?)
a
How can I do that?
Thanks
You can do this
a.gsub!(/[^0-9A-Za-z]/, '')
try with gsub
a.gsub!(/[!#%&"]/,'')
try the regexp on rubular.com
if you want something more general you can have a string with valid chars and remove what's not in there:
a.gsub!(/[^abcdefghijklmnopqrstuvwxyz ]/,'')
When you give multiple arguments to string#delete, it's the intersection of those arguments that is deleted. a.delete! "!", "#" deletes the intersections of the sets ! and # which means that nothing will be deleted and the method returns nil.
What you wanted to do is a.delete! "!#" with the characters to delete passed as a single string.
Since the challenge is asking to clean up the mess and find a message in it, I would go with a whitelist instead of deleting special characters. The delete method accepts ranges with - and negations with ^ (similar to a regex) so you can do something like this: a.delete! "^A-Za-z ".
You could also use regular expressions as shown by #arieljuod.
gsub is one of the most used Ruby methods in the wild.
specialname="Hello!#$#"
cleanedname = specialname.gsub(/[^a-zA-Z0-9\-]/,"")
I think a.gsub(/[^A-Za-z0-9 ]/, '') works better in this case. Otherwise, if you have a sentence, which typically should start with a capital letter, you will lose your capital letter. You would also lose any 1337 speak, or other possible crypts within the text.
Case in point:
phrase = "Joe can't tell between 'large' and large."
=> "Joe can't tell between 'large' and large."
phrase.gsub(/[^a-z ]/, '')
=> "oe cant tell between large and large"
phrase.gsub(/[^A-Za-z0-9 ]/, '')
=> "Joe cant tell between large and large"
phrase2 = "W3 a11 f10a7 d0wn h3r3!"
phrase2.gsub(/[^a-z ]/, '')
=> " a fa dwn hr"
phrase2.gsub(/[^A-Za-z0-9 ]/, '')
=> "W3 a11 f10a7 d0wn h3r3"
If you don't want to change the original string - i.e. to solve the challenge.
str.each_char do |letter|
if letter =~ /[a-z]/
p letter
end
end
You will have to write down your own string sanitize function, could easily use regex and the gsub method.
Atomic sample:
your_text.gsub!(/[!#\[;\]^%*\(\);\-_\/&\\|$\{#\}<>:`~"]/,'')
API sample:
Route: post 'api/sanitize_text', to: 'api#sanitize_text'
Controller:
def sanitize_text
return render_bad_request unless params[:text].present? && params[:text].present?
sanitized_text = params[:text].gsub!(/[!#\[;\]^%*\(\);\-_\/&\\|$\{#\}<>:`~"]/,'')
render_response( {safe_text: sanitized_text})
end
Then you call it
POST /api/sanitize_text?text=abcdefghijklmnopqrstuvwxyz123456<>$!#%23^%26*[]:;{}()`,.~'"\|/

Ruby gsub / regex with several arguments [duplicate]

This question already has answers here:
Match a string against multiple patterns
(2 answers)
Closed 8 years ago.
I'm new to ruby and I'm trying to solve a problem.
I'm parsing through several text field where I want to remove the header which has different values. It works fine when the header always is the same:
variable = variable.gsub(/(^Header_1:$)/, '')
But when I put in several arguments it doesn't work:
variable = variable.gsub(/(^Header_1$)/ || /(^Header_2$)/ || /(^Header_3$)/ || /(^Header_4$)/ || /^:$/, '')
You can use Regexp.union:
regex = Regexp.union(
/^Header_1/,
/^Header_2/,
/^Header_3/,
/^Header_4/,
/^:$/
)
variable.gsub(regex, '')
Please note that ^something$ will not work on strings containing something more than something :)
Cause ^ is for matching beginning of string and $ is for end of string.
So i intentionally removed $.
Also you do not need brackets when you only need to remove the matched string.
You can also use it like this:
headers = %w[Header_1 Header_2 Header_3]
regex = Regexp.union(*headers.map{|s| /^#{s}/}, /^\:$/, /etc/)
variable.gsub(regex, '')
And of course you can remove headers without explicitly define them.
Most likely there are a white space after headers?
If so, you can do it as simple as:
variable = "Header_1 something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
variable = "Header_BLAH something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
Just use a proper regexp:
variable.gsub(/^(Header_1|Header_2|Header_3|Header_4|:)$/, '')
If the header is always the same format of Header_n, where n is some integer value, then you can simplify your regex greatly:
/Header_\d+/
will find every one of these:
%w[Header_1 Header_2 Header_3].grep(/Header_\d+/)
[
[0] "Header_1",
[1] "Header_2",
[2] "Header_3"
]
Tweaking it to handle finding words, not substrings:
/^Header_\d+$/
or:
/\bHeader_\d+\b/
As mentioned, using Regexp.union is a good start, but, used blindly, can result in very slow or inefficient patterns, so think ahead and help out the engine by giving it useful sub-patterns to work with:
values = %w[foo bar]
/Header_(?:\d+|#{ values.join('|') })/
=> /Header_(?:\d+|foo|bar)/
Unfortunately, Ruby doesn't have the equivalent to Perl's Regexp::Assemble module, which can build highly optimized patterns from big lists of words. Search here on Stack Overflow for examples of what it can do. For instance:
use Regexp::Assemble;
my #values = ('Header_1', 'Header_2', 'foo', 'bar', 'Header_3');
my $ra = Regexp::Assemble->new;
foreach (#values) {
$ra->add($_);
}
print $ra->re, "\n";
=> (?-xism:(?:Header_[123]|bar|foo))

More Efficient Way to Find/Replace Non-Escaped Characters

I'm trying to find the best way to find and replace (in Ruby 1.9.2) all instances of a special code (%x) preceded by zero, or an even number of backslashes.
In other words, :
%x --> FOO
\%x --> \%x
\\%x --> \\FOO
\\\%x --> \\\%x
\\\\%x --> \\\\FOO
etc.
There may be multiple instances in a string: "This is my %x string with two %x codes."
With help from the questions asked here and here I got the following code to do what I want:
str.gsub(/
(?<!\\) # Not preceded by a single backslash
((?:\\\\)*) # Eat up any sets of double backslashes - match group 1
(%x) # Match the code itself - match group 2
/x,
# Keep the double backslashes (match group 1) then put in the sub
"\\1foo")
That regex seems kind of heavyweight, though. Since this code will be called with reasonable frequency in my application, I want to make sure I'm not missing a better (cleaner/more efficient) way to do this.
I can imagine two alternative regular expressions:
Using a look-behind assertion, as in your code. (look-behind-2)
Matching one more character, before the back-slashes. (alternative)
Other than that, I do only see a minor optimization for your regular expression. The "%x" is constant, so you do not have to capture it. (look-behind-1)
I am not sure which of these is actually more efficient. Therefore, I created a small benchmark:
$ perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $test = '%x \%x \\%x \\\%x \\\\%x \\\\\%x \\\\%x \\\%x \\%x \%x %x';
cmpthese 1_000_000, {
'look-behind-1' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)\%x/${1}foo/g },
'look-behind-2' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)(\%x)/${1}foo/g },
'alternative' => sub { (my $t = $test) =~ s/((?:^|[^\\])(?:\\\\)*)\%x/${1}foo/g },
};
Results:
Rate alternative look-behind-2 look-behind-1
alternative 145349/s -- -23% -26%
look-behind-2 188324/s 30% -- -5%
look-behind-1 197239/s 36% 5% --
As you can clearly see, the alternative regular expression is far behind the look-behind approach and capturing the "%x" is slightly slower than not capturing it.
regards, Matthias

Regular expression help

I am currently doing a bunch of processing on a string using regular expressions with gsub() but I'm chaining them quite heavily which is starting to get messy. Can you help me construct a single regex for the following:
string.gsub(/\.com/,'').gsub(/\./,'').gsub(/&/,'and').gsub(' ','-').gsub("'",'').gsub(",",'').gsub(":",'').gsub("#39;",'').gsub("*",'').gsub("amp;",'')
Basically the above removes the following:
.com
.
,
:
*
switches '&' for 'and'
switches ' ' for '-'
switches ' for ''
Is there an easier way to do this?
You can combine the ones that remove characters:
string.gsub(/\.com|[.,:*]/,'')
The pipe | means "or". The right side of the or is a character class; it means "one of these characters".
A translation table is more scalable as you add more options:
translations = Hash.new
translations['.com'] = ''
translations['&'] = 'and'
...
translations.each{ |from, to| string.gsub from, to }
Building on Tim's answer:
You can pass a block to String.gsub, so you could combine them all, if you wanted:
string.gsub(/\.com|[.,:*& ']/) do |sub|
case(sub)
when '&'
'and'
when ' '
'-'
else
''
end
end
Or, building off echoback's answer, you could use a translation hash in the block (you may need to call translations.default = '' to get this working):
string.gsub(/\.com|[.,:*& ']/) {|sub| translations[sub]}
The biggest perk of using a block is only having one call to gsub (not the fastest function ever).
Hope this helps!

Ruby: Escaping special characters in a string

I am trying to write a method that is the same as mysqli_real_escape_string in PHP. It takes a string and escapes any 'dangerous' characters. I have looked for a method that will do this for me but I cannot find one. So I am trying to write one on my own.
This is what I have so far (I tested the pattern at Rubular.com and it worked):
# Finds the following characters and escapes them by preceding them with a backslash. Characters: ' " . * / \ -
def escape_characters_in_string(string)
pattern = %r{ (\'|\"|\.|\*|\/|\-|\\) }
string.gsub(pattern, '\\\0') # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
And I am using start_string as the string I want to change, and correct_string as what I want start_string to turn into:
start_string = %("My" 'name' *is* -john- .doe. /ok?/ C:\\Drive)
correct_string = %(\"My\" \'name\' \*is\* \-john\- \.doe\. \/ok?\/ C:\\\\Drive)
Can somebody try and help me determine why I am not getting my desired output (correct_string) or tell me where I can find a method that does this, or even better tell me both? Thanks a lot!
Your pattern isn't defined correctly in your example. This is as close as I can get to your desired output.
Output
"\\\"My\\\" \\'name\\' \\*is\\* \\-john\\- \\.doe\\. \\/ok?\\/ C:\\\\Drive"
It's going to take some tweaking on your part to get it 100% but at least you can see your pattern in action now.
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\)/
string.gsub(pattern){|match|"\\" + match} # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
I have changed above function like this:
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
string.gsub(pattern){|match|"\\" + match}
end
This is working great for regex
This should get you started:
print %("'*-.).gsub(/["'*.-]/){ |s| '\\' + s }
\"\'\*\-\.
Take a look at the ActiveRecord sanitization methods: http://api.rubyonrails.org/classes/ActiveRecord/Base.html#method-c-sanitize_sql_array
Take a look at escape_string / quote method in Mysql class here

Resources