Split string suppressing all null fields - ruby

I want to split a string suppressing all null fields
Command:
",1,2,,3,4,,".split(',')
Result:
["", "1", "2", "", "3", "4", ""]
Expected:
["1", "2", "3", "4"]
How to do this?
Edit
Ok. Just to sum up all that good questions posted.
What I wanted is that split method (or other method) didn't generate empty strings. Looks like it isn't possible.
So, the solution is two step process: split string as usual, and then somehow delete empty strings from resulting array.
The second part is exactly this question
(and its duplicate)
So I would use
",1,2,,3,4,,".split(',').delete_if(&:empty?)
The solution proposed by Nikita Rybak and by user229426 is to use reject method. According to docs reject returns a new array. While delete_if method is more efficient since I don't want a copy. Using select proposed by Mark Byers even more inefficient.
steenslag proposed to replace commas with space and then use split by space:
",1,2,,3,4,,".gsub(',', ' ').split(' ')
Actually, the documentation says that space is actually a white space. But results of "split(/\s/)" and "split(' ')" are not the same. Why's that?
Mark Byers proposed another solution - just using regular expressions. Seems like this is what I need. But this solution implies that you have to be a master of regexp. But this is great solution! For example, if I need spaces to be separators as well as any non-alphanumeric symbol I can rewrite this to
",1,2, ,3 3,4 4 4,,".scan(/\w+[\s*\w*]*/)
the result is:
["1", "2", "3 3", "4 4 4"]
But again regexps are very unintuitive and they need an experience.
Summary
I expect that split to work with whitespaces as if whitespaces were a comma or even regexp. I expect it to do not produce empty strings. I think this is a bug in ruby or my misunderstanding.
Made it a community question.

There's a reject method in Array:
",1,2,,3,4,,".split(',').reject { |s| s.empty? }
Or if you prefer Symbol#to_proc:
",1,2,,3,4,,".split(',').reject(&:empty?)

Hoping to illuminate a bit here:
But results of "split(/\s/)" and "split(' ')" are not the same. Why's that?
If you look at the docs for String#split you'll see that split with ' ' is a special case:
If pattern is a single space, str is split on whitespace,
with leading whitespace and runs of contiguous whitespace characters ignored.
You also mention:
I expect it to do not produce empty strings. I think this is a bug in ruby or my misunderstanding.
The problem probably lies between the keyboard and the chair. ;-)
split will happily produce empty strings as it should, because there are times when you would definitely want this ability, and there are plenty of easy ways to work around it. Consider if you were splitting a csv from an Excel file. Anywhere you see ',,' would be an empty column, not a column you should just get rid of.
Regardless, you've seen a bunch of solutions - and here's another one that might show you the things you can do with ruby and split!
It seems you want to split up data between multiple commas, so why not try that and see what happens?
a = ",1,2,,3,4,,5,,,,6,,,".split(/,+/)
It's a simple enough regular expression: /,+/ means one or more commas, so we'll split on that.
This almost gives you want you want, except that you also want to ignore the leading empty field. You'll note that split ignores the empty field on the end because (from the String#split docs):
If the limit parameter is omitted, trailing null fields are suppressed.
So that means we can either use something that will remove that nil at the front of the array or just remove the initial commas. We can use gsub for that:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/^,+/,'')
If you print that out you'll see that our trailing empty "field" is now gone. So we can combine them all in one line:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/^,+/,'').split(/,+/)
And you have another solution!
And incidentally, this points out another possibility, that we can just cleanup our string entirely before sending it to split if we want a simple split. I'll leave it to you to figure out what this one is doing:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/,+/,',').gsub(/^,/,'').split(',')
There's lots of ways to do things in ruby. If it seems that ruby isn't doing what you want, then take a look at the docs and realize that it probably works the way that it does for a reason (there are plenty of people who would be upset if split wasn't able to spit out empty fields :)
Hope that helps!

You could use split followed by select:
",1,2,,3,4,,".split(',').select{|x|!x.empty?}
Or you could use a regular expression to match what you want to keep instead of splitting on the delimiter:
",1,2,,3,4,,".scan(/[^,]+/)

",1,2,,3,4,,".split(/,/).reject(&:empty?)
",1,2,,3,,,4,,".squeeze(",").sub(/^,*|,*$/,"").split(",")

String#split(pattern) behaves as desired when pattern is a single space (ruby-doc).
",1,2,,3,4,,".gsub(',', ' ').split(' ')

Related

Methods to concatenate strings on separate lines

This produces newlines:
%(https://api.foursquare.com/v2/venues/search
?ll=80.914207,%2030.328466&radius=200
&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259
&intent=browse)
This produces spaces:
"https://api.foursquare.com/v2/venues/search
?ll=80.914207,%2030.328466&radius=200
&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259
&intent=browse"
This produces one string:
"https://api.foursquare.com/v2/venues/search"\
"?ll=80.914207,%2030.328466&radius=200"\
"&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259"\
"&intent=browse"
When I want to separate one string on multiple lines to read it better on screen, is it preferred to use the escape character?
My IDE complains that I should use single quoted strings rather than double quoted strings since there is no interpolation.
Normally you'd put something like this on one line, readability be damned, because the alternatives are going to be problematic. There's no way of declaring a string with whitespace ignored, but you can do this:
url = %w[ https://api.foursquare.com/v2/venues/search
?ll=80.914207,%2030.328466&radius=200
&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259
&intent=browse
].join
Where you explicitly remove the whitespace.
I'd actually suggest avoiding this whole mess by properly composing this URI:
uri = url("https://api.foursquare.com/v2/venues/search",
ll: [ 80.914207,30.328466 ],
radius: 200,
v: 20161201,
m: 'foursquare',
categoryId: '4d4b7105d754a06374d81259',
intent: 'browse'
)
Where you have some kind of helper function that properly encodes that using URI or other tools. By keeping your parameters as data, not as encoded strings, for as long as possible you make it easier to spot bugs as well as make last-second changes to them.
The answer by #tadman definitely suggests the proper way to do it; I’ll post another approach just for the sake of diversity:
query = "https://api.foursquare.com/v2/venues/search"
"?ll=80.914207,%2030.328466&radius=200"
"&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259"
"&intent=browse"
Yes, without any visible concatenation, 4 strings in quotes one by one in a row. This example won’t work in irb/pry (due to it’s REPL nature,) but the above is the most efficient way to concatenate strings in ruby without producing any intermediate result.
Contrived example to test in pry/irb:
value = "a" "b" "c" "d"

In ruby, get substring based on substring match

In Ruby,
I have one string like this :
"name":"jucy","id":123,"property":"abc"
I would like to get '123' from the id.
what is the easy way to get it?
I don't want to create JSON and parse it, it could be a way but too much for this case.
Load the JSON parser and parse it.
Yes, you thought that would be too much work. It isn't. Why? An extensive JSON library comes with Ruby. The library is probably already loaded. It's very easy to use. It's very fast. It's very flexible, you'll have the whole data structure to work with.
And, most importantly, writing your own parser for balanced delimiters (ie. quotes) is either a lot of work to get right, or too simple and it misses plenty of edge cases like spaces or escapes. This answer and this answer are good examples of that.
The only caveat is your string isn't quite valid JSON, it needs the hash delimiters around it.
require 'json'
almost_json = '"name":"jucy","id":123,"property":"abc"'
my_hash = JSON.parse('{' + almost_json + '}')
puts my_hash["id"]
'"name":"jucy","id":123,"property":"abc"'[/"id":(\d+),/, 1].to_i
Here are a few ways other than converting the JSON string to a hash.
s = '"name":"jucy","id":123,"property":"abc"'
Regex with a capture group
s[/\"id\":\s*(\d+)/,1] #=> "123"
Regex with a positive lookbehind
s[/(?<=\"id\":)\d+/] #=> "123"
This requires the number of spaces between the colon and first digit to be fixed. I have assumed zero spaces.
Regex with \K (forget everything matched so far)
s[/\"id\":\s*\K\d+/] #=> "123"
Find index where id: begins, convert to an integer and back to a string
s[s.index('"id":')+5..-1].to_i.to_s #=> "123"
Look for digits...
...if, as in your example, the substring is comprised of the only digits in the string:
s[/\d+/] #=> "123"

Regex can this be achieved

I'm too ambitious or is there a way do this
to add a string if not present ?
and
remove a the same string if present?
Do all of this using Regex and avoid the if else statement
Here an example
I have string
"admin,artist,location_manager,event_manager"
so can the substring location_manager be added or removed with regards to above conditions
basically I'm looking to avoid the if else statement and do all of this plainly in regex
"admin,artist,location_manager,event_manager".test(/some_regex/)
The some_regex will remove location_manager from the string if present else it will add it
Am I over over ambitions
You will need to use some sort of logic.
str += ',location_manager' unless str.gsub!(/location_manager,/,'')
I'm assuming that if it's not present you append it to the end of the string
Regex will not actually add or remove anything in any language that I am aware of. It is simply used to match. You must use some other language construct (a regex based replacement function for example) to achieve this functionality. It would probably help to mention your specific language so as to get help from those users.
Here's one kinda off-the-wall solution. It doesn't use regexes, but it also doesn't use any if/else statements either. It's more academic than production-worthy.
Assumptions: Your string is a comma-separated list of titles, and that these are a unique set (no duplicates), and that order doesn't matter:
titles = Set.new(str.split(','))
#=> #<Set: {"admin", "artist", "location_manager", "event_manager"}>
titles_to_toggle = ["location_manager"]
#=> ["location_manager"]
titles ^= titles_to_toggle
#=> #<Set: {"admin", "artist", "event_manager"}>
titles ^= titles_to_toggle
#=> #<Set: {"location_manager", "admin", "artist", "event_manager"}>
titles.to_a.join(",")
#=> "location_manager,admin,artist,event_manager"
All this assumes that you're using a string as a kind of set. If so, you should probably just use a set. If not, and you actually need string-manipulation functions to operate on it, there's probably no way around except for using if-else, or a variant, such as the ternary operator, or unless, or Bergi's answer
Also worth noting regarding regex as a solution: Make sure you consider the edge cases. If 'location_manager' is in the middle of the string, will you remove the extraneous comma? Will you handle removing commas correctly if it's at the beginning or the end of the string? Will you correctly add commas when it's added? For these reasons treating a set as a set or array instead of a string makes more sense.
No. Regex can only match/test whether "a string" is present (or not). Then, the function you've used can do something based on that result, for example replace can remove a match.
Yet, you want to do two actions (each can be done with regex), remove if present and add if not. You can't execute them sequentially, because they overlap - you need to execute either the one or the other. This is where if-else structures (or ternary operators) come into play, and they are required if there is no library/native function that contains them to do exactly this job. I doubt there is one in Ruby.
If you want to avoid the if-else-statement (for one-liners or expressions), you can use the ternary operator. Or, you can use a labda expression returning the correct value:
# kind of pseudo code
string.replace(/location,?|$/, function($0) return $0 ? "" : ",location" )
This matches the string "location" (with optional comma) or the string end, and replaces that with nothing if a match was found or the string ",location" otherwise. I'm sure you can adapt this to Ruby.
to remove something matching a pattern is really easy:
(admin,?|artist,?|location_manager,?|event_manager,?)
then choose the string to replace the match -in your case an empty string- and pass everything to the replace method.
The other operation you suggested was more difficult to achieve with regex only. Maybe someone knows a better answer

Regular expression to strip everything but words

I'm helpless on regular expressions so please help me on this problem.
Basically I am downloading web pages and rss feeds and want to strip everything except plain words. No periods, commas, if, ands, and buts. Literally I have a list of the most common words used in English and I also want to strip those too but I think I know how to do that and don't need a regular expression because it would be really way to long.
How do I strip everything from a chunk of text except words that are delimited by spaces? Everything else goes in the trash.
This works quite well thanks to Pavel .split(/[^[:alpha:]]/).uniq!
I think that what fits you best would be splitting of the string into words. In this case, String::split function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.
In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by [:alpha:]. So, here's the example of what you need:
irb(main):001:0> "asd, < er >w , we., wZr,fq.".split(/[^[:alpha:]]+/)
=> ["asd", "er", "w", "we", "wZr", "fq"]
You may further filter the result by intersecting the resultant array with array that contains only English words:
irb(main):001:0> ["asd", "er", "w", "we", "wZr", "fq"] & ["we","you","me"]
=> ["we"]
try \b\w*\b to match whole words

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources