Selecting Content in quotes using regex - ruby

If I have:
["eaacbf7e-37b3-509e-b2d1-ddce7f0e1f6e", "f9e52e06-697a-57af-9566-d05fabb001a4",
"19edb822-eccb-5289-8fee-a39cdda66cd5", "83d3ad63-b468-5a1e-ba6c-6b69eb4a3dc5"]
(where the entire thing is a string)
Is there a simple regular expression that I can use to select content within the quotes (quotes included)?
Since the above comes out as a string. I want to use regex to select out each id within the quotes (along with the quotes) and store them into a ruby array.

Simply use this regex
"[^"]*"
[^"]* says match any character except " i.e [^"] 0 to many times i.e *

Try using the String#scan method with the regular expression /"[^"]+"/:
ids = str.scan(/"[^"]+"/) # => [ "eaacbf7e-...", "f9e52e06-...", ...]
puts ids
"eaacbf7e-37b3-509e-b2d1-ddce7f0e1f6e"
"f9e52e06-697a-57af-9566-d05fabb001a4"
"19edb822-eccb-5289-8fee-a39cdda66cd5"
"83d3ad63-b468-5a1e-ba6c-6b69eb4a3dc5"
That expression breaks down like so:
str.scan(/"[^"]+"/)
# │├──┘│└─ Another literal quotation mark (").
# ││ └─ Match one or more of the previous thing.
# │└─ A class matching any character except (^) quotation marks.
# └─ A literal quotation mark (").

Why are you getting the string in that format? It looks like JSON output, which, if it is, should be parsed by the JSON module.
require 'json'
require 'pp'
foo = [
"eaacbf7e-37b3-509e-b2d1-ddce7f0e1f6e",
"f9e52e06-697a-57af-9566-d05fabb001a4",
"19edb822-eccb-5289-8fee-a39cdda66cd5",
"83d3ad63-b468-5a1e-ba6c-6b69eb4a3dc5"
]
foo.to_json
=> "[\"eaacbf7e-37b3-509e-b2d1-ddce7f0e1f6e\",\"f9e52e06-697a-57af-9566-d05fabb001a4\",\"19edb822-eccb-5289-8fee-a39cdda66cd5\",\"83d3ad63-b468-5a1e-ba6c-6b69eb4a3dc5\"]"
That's probably the string you're getting. If you parse it using the JSON parser, you'll get back a Ruby array:
pp JSON[ foo.to_json ]
=> ["eaacbf7e-37b3-509e-b2d1-ddce7f0e1f6e",
"f9e52e06-697a-57af-9566-d05fabb001a4",
"19edb822-eccb-5289-8fee-a39cdda66cd5",
"83d3ad63-b468-5a1e-ba6c-6b69eb4a3dc5"]

Related

Find youtube url in json file with ruby

For testing purpose my json file (test.json) consists of only the string I want to find:
"https://www.youtube.com/watch?v=hBIZF3sDFTI"
Somehow I cannot find the string in file with this ruby code:
if not File.foreach("test.json").grep(/https://www.youtube.com/watch?v=hBIZF3sDFTI/).any?
puts("string not in file")
end
Output: "string not in file"
But the string is in the file.
Searching for other strings works fine, so it must be a problem with this particular string.
Any help is much appreciated!
Problems
Your regex pattern isn't valid, because it's got too many forward slashes in it. Specifically:
/https://www.youtube.com/watch?v=hBIZF3sDFTI/
is not a valid regular expression. Your String is also not a valid a JSON object.
Solution
You need to escape special regular expression characters like / and ? before trying to use your pattern. For example, you could call Regexp#escape on the String like so:
Regexp.escape 'https://www.youtube.com/watch?v=hBIZF3sDFTI'
#=> "https://www\\.youtube\\.com/watch\\?v=hBIZF3sDFTI"
Then, assuming you have a valid JSON object, you could match the expression as follows:
require 'json'
str = 'https://www.youtube.com/watch?v=hBIZF3sDFTI'
json = str.to_json
#=> "\"https://www.youtube.com/watch?v=hBIZF3sDFTI\""
pattern = Regexp.escape str
json.match pattern
#=> #<MatchData "https://www.youtube.com/watch?v=hBIZF3sDFTI">

YAML: error parsing a string containing a square bracket as its first character

I'm parsing a YAML file in Ruby and some of the input is causing a Psych syntax error:
require 'yaml'
example = "my_key: [string] string"
YAML.load(example)
Resulting in:
Psych::SyntaxError: (<unknown>): did not find expected key
while parsing a block mapping at line 1 column 1
from [...]/psych.rb:456:in `parse'
I received this YAML from an external API that I do not have control over. I can see that editing the input to force parsing as a string, using my_key: '[string] string', as noted in "Do I need quotes for strings in YAML?", fixes the issue however I don't control how the input is received.
Is there a way to force the input to be parsed as a string for some keys such as my_key? Is there a workaround to successfully parse this YAML?
One approach would be to process the response before reading it as YAML. Assuming it's a string, you could use a regex to replace the problematic pattern with something valid. I.e.
resp_str = "---\nmy_key: [string] string\n"
re = /(\: )(\[[a-z]*?\] [a-z]*?)(\n)/
resp_str.gsub!(re, "#{$1}'#{$2}'#{$3}")
#=> "---\n" + "my_key: '[string] string'\n"
Then you can do
YAML.load(resp_str)
#=> {"my_key"=>"[string] string"}
It does not work because square brackets have a special meaning in YAML, denoting arrays:
YAML.load "my_key: [string]"
#⇒ {"my_key"=>["string"]}
and [foo] bar is an invalid type. One should escape square brackets explicitly
YAML.load "my_key: \\[string\\] string"
#⇒ {"my_key"=>"\\[string\\] string"}
Also, one might implement the custom Psych parser.
There is very native and easy solution. If you would like to have string context you can always put quotes around it:
YAML.load "my_key: '[string]'"
=> {"my_key"=>"[string]"}

Use ARGV[] argument vector to pass a regular expression in Ruby

I am trying to use gsub or sub on a regex passed through terminal to ARGV[].
Query in terminal: $ruby script.rb input.json "\[\{\"src\"\:\"
Input file first 2 lines:
[{
"src":"http://something.com",
"label":"FOO.jpg","name":"FOO",
"srcName":"FOO.jpg"
}]
[{
"src":"http://something123.com",
"label":"FOO123.jpg",
"name":"FOO123",
"srcName":"FOO123.jpg"
}]
script.rb:
dir = File.dirname(ARGV[0])
output = File.new(dir + "/output_" + Time.now.strftime("%H_%M_%S") + ".json", "w")
open(ARGV[0]).each do |x|
x = x.sub(ARGV[1]),'')
output.puts(x) if !x.nil?
end
output.close
This is very basic stuff really, but I am not quite sure on how to do this. I tried:
Regexp.escape with this pattern: [{"src":".
Escaping the characters and not escaping.
Wrapping the pattern between quotes and not wrapping.
Meditate on this:
I wrote a little script containing:
puts ARGV[0].class
puts ARGV[1].class
and saved it to disk, then ran it using:
ruby ~/Desktop/tests/test.rb foo /abc/
which returned:
String
String
The documentation says:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ‘d’, instead of a digit.
That means that the regular expression, though it appears to be a regex, it isn't, it's a string because ARGV only can return strings because the command-line can only contain strings.
When we pass a string into sub, Ruby recognizes it's not a regular expression, so it treats it as a literal string. Here's the difference in action:
'foo'.sub('/o/', '') # => "foo"
'foo'.sub(/o/, '') # => "fo"
The first can't find "/o/" in "foo" so nothing changes. It can find /o/ though and returns the result after replacing the two "o".
Another way of looking at it is:
'foo'.match('/o/') # => nil
'foo'.match(/o/) # => #<MatchData "o">
where match finds nothing for the string but can find a hit for /o/.
And all that leads to what's happening in your code. Because sub is being passed a string, it's trying to do a literal match for the regex, and won't be able to find it. You need to change the code to:
sub(Regexp.new(ARGV[1]), '')
but that's not all that has to change. Regexp.new(...) will convert what's passed in into a regular expression, but if you're passing in '/o/' the resulting regular expression will be:
Regexp.new('/o/') # => /\/o\//
which is probably not what you want:
'foo'.match(/\/o\//) # => nil
Instead you want:
Regexp.new('o') # => /o/
'foo'.match(/o/) # => #<MatchData "o">
So, besides changing your code, you'll need to make sure that what you pass in is a valid expression, minus any leading and trailing /.
Based on this answer in the thread Convert a string to regular expression ruby, you should use
x = x.sub(/#{ARGV[1]}/,'')
I tested it with this file (test.rb):
puts "You should not see any number [0123456789].".gsub(/#{ARGV[0]}/,'')
I called the file like so:
ruby test.rb "\d+"
# => You should not see any number [].

Rule for a regex enclosed by two /

How can I match everything between a pair of / characters with treetop? I would also like to match escaped / characters as well. For example, if I were to parse a "regex":
/blarg: dup\/md5 [0-9a-zA-Z]{32}/
The result would return:
blarg: dup\/md5 [0-9a-zA-Z]{32}
This should match everything inside two / characters including escaped slashes. I'm using Ruby's DATA __END__ feature so that everything can run in a single file.
Also, note that you can tag parts of a parsed expression and then use them as functions. In the example below I tagged inside. This could also have been accessed as elements[1] instead of being tagged.
This works similar to matching a string which you can find in the treetop docs.
require 'treetop'
Treetop.load_from_string DATA.read
parser = RegexParser.new
puts parser.parse('/blarg: dup\/md5 [0-9a-zA-Z]{32}/').inside.text_value
# => blarg: dup\/md5 [0-9a-zA-Z]{32}
__END__
grammar Regex
rule regex
'/' inside:('\/' / !'/' .)* '/'
end
end

Remove all non-alphabetical, non-numerical characters from a string?

If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.
You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.
string.gsub(/[^[:alnum:]]/, "")
The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.
You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.
If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.

Resources