Ruby Regular Expression lookahead to Split at pipe unless contained in brackets - ruby

I'm trying to decode the following string:
body = '{type:paragaph|class:red|content:[class:intro|body:This is the introduction paragraph.][body:This is the second paragraph.]}'
body << '{type:image|class:grid|content:[id:1|title:image1][id:2|title:image2][id:3|title:image3]}'
I need the string to split at the pipes but not where a pipe is contained with square brackets, to do this I think I need to perform a lookahead as described here: How to split string by ',' unless ',' is within brackets using Regex?
My attempt(still splits at every pipe):
x = self.body.scan(/\{(.*?)\}/).map {|m| m[0].split(/ *\|(?!\]) */)}
->
[
["type:paragaph", "class:red", "content:[class:intro", "body:This is the introduction paragraph.][body:This is the second paragraph.]"]
["type:image", "class:grid", "content:[id:1", "title:image1][id:2", "title:image2][id:3", "title:image3]"]
]
Expecting:
->
[
["type:paragaph", "class:red", "content:[class:intro|body:This is the introduction paragraph.][body:This is the second paragraph.]"]
["type:image", "class:grid", "content:[id:1|title:image1][id:2|title:image2][id:3|title:image3]"]
]
Does anyone know the regex required here?
Is it possible to match this regex? I can't seem to modify it correctly Regular Expression to match underscores not surrounded by brackets?
I modified the answer here Split string in Ruby, ignoring contents of parentheses? to get:
self.body.scan(/\{(.*?)\}/).map {|m| m[0].split(/\|\s*(?=[^\[\]]*(?:\[|$))/)}
Seems to do the trick. Though I'm sure if there's any shortfalls.

Dealing with nested structures that have identical syntax is going to make things difficult for you.
You could try a recursive descent parser (a quick Google turned up https://github.com/Ragmaanir/grammy - not sure if any good)
Personally, I'd go for something really hacky - some gsubs that convert your string into JSON, then parse with a JSON parser :-). That's not particularly easy either, though, but here goes:
require 'json'
b1 = body.gsub(/([^\[\|\]\:\}\{]+)/,'"\1"').gsub(':[',':[{').gsub('][','},{').gsub(']','}]').gsub('}{','},{').gsub('|',',')
JSON.parse('[' + b1 + ']')
It wasn't easy because the string format apparently uses [foo:bar][baz:bam] to represent an array of hashes. If you have a chance to modify the serialised format to make it easier, I would take it.

I modified the answer here Split string in Ruby, ignoring contents of parentheses? to get:
self.body.scan(/\{(.*?)\}/).map {|m| m[0].split(/\|\s*(?=[^\[\]]*(?:\[|$))/)}
Seems to do the trick. If it has any shortfalls please suggest something better.

Related

Using a ruby regular expression

I'm completely new to Ruby so I was just wondering if someone could help me out.
I have the following String:
"<planKey><key>OR-J8U</key></planKey>"
What is the regex I have to write to get the center part OR-J8U?
Use the following:
str = "<planKey><key>OR-J8U</key></planKey>"
str[/(?<=\<key\>).*(?=\<\/key\>)/]
#=> "OR-J8U"
This captures anything in between opening and closing 'key' tags using lookahead and lookbehinds
If you want to get the string OR-J8U then you could simply use that string in the regular expression; the - character has to be escaped:
/OR\-J8U/
Though, I believe you want any string that is enclosed within <planKey><key> and </key></planKey>. In that case ice's answer is useful if you allow for an empty string:
/(?<=\<key\>).*(?=\<\/key\>)/
If you don't allow for an empty string, replace the * with +:
/(?<=\<key\>).*(?=\<\/key\>)/
If you prefer a more general approach (any string enclosed within any tags), then I believe the common opinion is not to use a regular expression. Instead consider using an HTML parser. On SO you can find some questions and answers in that regard.

In ruby, get substring based on substring match

In Ruby,
I have one string like this :
"name":"jucy","id":123,"property":"abc"
I would like to get '123' from the id.
what is the easy way to get it?
I don't want to create JSON and parse it, it could be a way but too much for this case.
Load the JSON parser and parse it.
Yes, you thought that would be too much work. It isn't. Why? An extensive JSON library comes with Ruby. The library is probably already loaded. It's very easy to use. It's very fast. It's very flexible, you'll have the whole data structure to work with.
And, most importantly, writing your own parser for balanced delimiters (ie. quotes) is either a lot of work to get right, or too simple and it misses plenty of edge cases like spaces or escapes. This answer and this answer are good examples of that.
The only caveat is your string isn't quite valid JSON, it needs the hash delimiters around it.
require 'json'
almost_json = '"name":"jucy","id":123,"property":"abc"'
my_hash = JSON.parse('{' + almost_json + '}')
puts my_hash["id"]
'"name":"jucy","id":123,"property":"abc"'[/"id":(\d+),/, 1].to_i
Here are a few ways other than converting the JSON string to a hash.
s = '"name":"jucy","id":123,"property":"abc"'
Regex with a capture group
s[/\"id\":\s*(\d+)/,1] #=> "123"
Regex with a positive lookbehind
s[/(?<=\"id\":)\d+/] #=> "123"
This requires the number of spaces between the colon and first digit to be fixed. I have assumed zero spaces.
Regex with \K (forget everything matched so far)
s[/\"id\":\s*\K\d+/] #=> "123"
Find index where id: begins, convert to an integer and back to a string
s[s.index('"id":')+5..-1].to_i.to_s #=> "123"
Look for digits...
...if, as in your example, the substring is comprised of the only digits in the string:
s[/\d+/] #=> "123"

String parse using regex

I have a string which is a function call. I want to parse it and obtain the parameters:
"add_location('http://abc.com/page/1/','This is the title, it is long',39.677765,-45.4343,34454,'http://abc.com/images/image_1.jpg')"
It has a total of 6 parameters and is a mixture of urls, integers and decimals. I can't figure out the regex for the split method which I will be using. Please help!
This is what I have come up with - which is wrong.
/('(.*\/[0-9]*)',)|([0-9]*,)/
Treating the string like a CSV might work:
require 'csv'
str = "add_location('http://abc.com/page/1/','This is the title, it is long',39.677765,-45.4343,34454,'http://abc.com/images/image_1.jpg')"
p CSV.parse(str[13..-2], :quote_char => "'").first
# => ["http://abc.com/page/1/", "This is the title, it is long", "39.677765", "-45.4343", "34454", "http://abc.com/images/image_1.jpg"]
Assuming all non-numeric parameters are enclosed in single quotes, as in your example
string.scan( /'.+?'|[-0-9.]+/ )
You really don't want to be parsing things this complex with a reg-ex; it just won't work in the long run. I'm not sure if you just want to parse this one string, or if there are lots of strings in this form which vary in exact contents. If you give a bit more info about your end goal, you might be able to get some more detailed help.
For parsing things this complex in the general case, you really want to perform proper tokenization (i.e. lexical analysis) of the string. In the past with Ruby, I've had good experiences doing this with Citrus. It's a nice gem for parsing complex tokens/languages like you're trying to do. You can find more about it here:
https://github.com/mjijackson/citrus

Convert Ruby string to *nix filename-compatible string

In Ruby I have an arbitrary string, and I'd like to convert it to something that is a valid Unix/Linux filename. It doesn't matter what it looks like in its final form, as long as it is visually recognizable as the string it started as. Some possible examples:
"Here's my string!" => "Heres_my_string"
"* is an asterisk, you see" => "is_an_asterisk_you_see"
Is there anything built-in (maybe in the file libraries) that will accomplish this (or close to this)?
By your specifications, you could accomplish this with a regex replacement. This regex will match all characters other than basic letters and digits:
s/[^\w\s_-]+//g
This will remove any extra whitespace in between words, as shown in your examples:
s/(^|\b\s)\s+($|\s?\b)/\\1\\2/g
And lastly, replace the remaining spaces with underscores:
s/\s+/_/g
Here it is in Ruby:
def friendly_filename(filename)
filename.gsub(/[^\w\s_-]+/, '')
.gsub(/(^|\b\s)\s+($|\s?\b)/, '\\1\\2')
.gsub(/\s+/, '_')
end
First, I see that it was asked purely in ruby, and second that it's not the same purpose (*nix filename compatible), but if you are using Rails, there is a method called parameterize that should help.
In rails console:
"Here's my string!".parameterize => "here-s-my-string"
"* is an asterisk, you see".parameterize => "is-an-asterisk-you-see"
I think that parameterize, as being compliant with URL specifications, may work as well with filenames :)
You can see more about here:
http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-parameterize
There's also a whole lot of another helpful methods.

Using regex to replace all spaces NOT in quotes in Ruby

I'm trying to write a regex to replace all spaces that are not included in quotes so something like this:
a = 4, b = 2, c = "space here"
would return this:
a=4,b=2,c="space here"
I spent some time searching this site and I found a similar q/a ( Split a string by spaces -- preserving quoted substrings -- in Python ) that would replace all the spaces inside quotes with a token that could be re-substituted in after wiping all the other spaces...but I was hoping there was a cleaner way of doing it.
It's worth noting that any regular expression solution will fail in cases like the following:
a = 4, b = 2, c = "space" here"
While it is true that you could construct a regexp to handle the three-quote case specifically, you cannot solve the problem in the general sense. This is a mathematically provable limitation of simple DFAs, of which regexps are a direct representation. To perform any serious brace/quote matching, you will need the more powerful pushdown automaton, usually in the form of a text parser library (ANTLR, Bison, Parsec).
With that said, it sounds like regular expressions should be sufficient for your needs. Just be aware of the limitations.
This seems to work:
result = string.gsub(/( |(".*?"))/, "\\2")
I consider this very clean:
mystring.scan(/((".*?")|([^ ]))/).map { |x| x[0] }.join
I doubt gsub could do any better (assuming you want a pure regex approach).
try this one, string in single/double quoter is also matched (so you need to filter them, if you only need space):
/( |("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/

Resources