Ruby: Remove whitespace chars at the beginning of a string - ruby

Edit: I solved this by using strip! to remove leading and trailing whitespaces as I show in this video. Then, I followed up by restoring the white space at the end of each string the array by iterating through and adding whitespace. This problem varies from the "dupe" as my intent is to keep the whitespace at the end. However, strip! will remove both the leading and trailing whitespace if that is your intent. (I would have made this an answer, but as this is incorrectly marked as a dupe, I could only edit my original question to include this.)
I have an array of words where I am trying to remove any whitespace that may exist at the beginning of the word instead of at the end. rstrip! just takes care of the end of a string. I want whitespaces removed from the beginning of a string.
example_array = ['peanut', ' butter', 'sammiches']
desired_output = ['peanut', 'butter', 'sammiches']
As you can see, not all elements in the array have the whitespace problem, so I can't just delete the first character as I would if all elements started with a single whitespace char.
Full code:
words = params[:word].gsub("\n", ",").delete("\r").split(",")
words.delete_if {|x| x == ""}
words.each do |e|
e.lstrip!
end
Sample text that a user may enter on the form:
Corn on the cob,
Fibonacci,
StackOverflow
Chat, Meta, About
Badges
Tags,,
Unanswered
Ask Question

String#lstrip (or String#lstrip!) is what you're after.
desired_output = example_array.map(&:lstrip)
More comments about your code:
delete_if {|x| x == ""} can be replaced with delete_if(&:empty?)
Except you want reject! because delete_if will only return a different array, rather than modify the existing one.
words.each {|e| e.lstrip!} can be replaced with words.each(&:lstrip!)
delete("\r") should be redundant if you're reading a windows-style text document on a Windows machine, or a Unix-style document on a Unix machine
split(",") can be replaced with split(", ") or split(/, */) (or /, ?/ if there should be at most one space)
So now it looks like:
words = params[:word].gsub("\n", ",").split(/, ?/)
words.reject!(&:empty?)
words.each(&:lstrip!)
I'd be able to give more advice if you had the sample text available.
Edit: Ok, here goes:
temp_array = text.split("\n").map do |line|
fields = line.split(/, */)
non_empty_fields = fields.reject(&:empty?)
end
temp_array.flatten(1)
The methods used are String#split, Enumerable#map, Enumerable#reject and Array#flatten.
Ruby also has libraries for parsing comma seperated files, but I think they're a little different between 1.8 and 1.9.

> ' string '.lstrip.chop
=> "string"
Strips both white spaces...

Related

Erase word from ruby that contains a certain char

As the tittle suggests, I'd like to get some chars and check if the string as any of them. If I suppose, for example, "!" to be forbidden, then string.replace("",word_with_!). How can I check for forbidden chars if forbidden_chars is an array?
forbidden_chars = ["!",",",...]
check ARRAY (it is the string split into an array) for forbidden chars
erase all words with forbidden chars
Could anyone help me please? I just consider searching for the words with the cards and retrieving index as mandatory in the answer please. Thank you very much :)
string = 'I like my coffee hot, with no sugar!'
forbidden_chars = ['!', ',']
forbidden_chars_pattern = forbidden_chars.map(&Regexp.method(:escape)).join('|')
string.gsub /\S*(#{forbidden_chars_pattern})\S*/, ''
# => "I like my coffee with no "
The idea is to match as many non-white space characters as possible \S*, followed by any of the forbidden characters (!|,), followed by as many non-white space characters as possible again.
The reason we need the Regexp.escape is for the cases when a forbidden character has special regex meaning (like .).
string = 'I like my coffee strong, with no cream or sugar!'
verboten = '!,'
string.split.select { |s| s.count(verboten).zero? }.join ' '
#=> "I like my coffee with no cream or"
Note this does not preserve the spacing between "I" and "like" but if there are no extra spaces in string it returns a string that has no extra spaces.

Ruby Delete From Array On Criteria

I'm just learning Ruby and have been tackling small code projects to accelerate the process.
What I'm trying to do here is read only the alphabetic words from a text file into an array, then delete the words from the array that are less than 5 characters long. Then where the stdout is at the bottom, I'm intending to use the array. My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones. This seems like it's doing too much work.
goal = File.read('big.txt').split(/\s/).map do |word|
word.scan(/[[:alpha:]]+/).uniq
end
goal.each { |word|
if word.length < 5
goal.delete(word)
end
}
puts goal.sample
Is there a way to apply the criteria to my File.read block to keep it from mapping the short words to begin with? I'm open to anything that would help me speed this up.
You might want to change your regex instead to catch only words longer than 5 characters to begin with:
goal = File.read('C:\Users\bkuhar\Documents\php\big.txt').split(/\s/).flat_map do |word|
word.scan(/[[:alpha:]]{6,}/).uniq
end
Further optimization might be to maintain a Set instead of an Array, to avoid re-scanning for uniqueness:
goal = Set.new
File.read('C:\Users\bkuhar\Documents\php\big.txt').scan(/\b[[:alpha:]]{6,}\b/).each do |w|
goal << w
end
In this case, use the delete_if method
goal => your array
goal.delete_if{|w|w.length < 5}
This will return a new array with the words of length lower than 5 deleted.
Hope this helps.
I really don't understand what a lot of the stuff you are doing in the first loop is for.
You take every chunk of text separated by white space, and map it to a unique value in an array generated by chunking together groups of letter characters, and plug that into an array.
This is way too complicated for what you want. Try this:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5
end
This makes it easy to add new conditions, too. If the word can't contain 'q' or 'Q', for example:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5 &&
! word.upcase.include? 'Q'
end
This assumes that each word in your dictionary is on its own line. You could go back to splitting it on white space, but it makes me wonder if the file you are reading in is written, human-readable text; a.k.a, it has 'words' ending in periods or commas, like this sentence. In that case, splitting on whitespace will not work.
Another note - map is the wrong array function to use. It modifies the values in one array and creates another out of those values. You want to select certain values from an array, but not modify them. The Array#select method is what you want.
Also, feel free to modify the Regex back to using the :alpha: tag if you are expecting non-standard letter characters.
Edit: Second version
goal = /([a-z][a-z']{4,})/gi.match(File.readlines('big.txt').join(" "))[1..-1]
Explanation: Load a file, and join all the lines in the file together with a space. Capture all occurences of a group of letters, at least 5 long and possibly containing but not starting with a '. Put all those occurences into an array. the [1..-1] discards "full match" returned by the MatchData object, which would be all the words appended together.
This works well, and it's only one line for your whole task, but it'll match
sugar'
in
I'd like some 'sugar', if you know what I mean
Like above, if your word can't contain q or Q, you could change the regex to
/[a-pr-z][a-pr-z']{4,})[ .'",]/i
And an idea - do another select on goal, removing all those entries that end with a '. This overcomes the limitations of my Regex

What is the best way to delimit a csv files thats contain commas and double quotes?

Lets say I have the following string and I want the below output without requiring csv.
this, "what I need", to, do, "i, want, this", to, work
this
what i need
to
do
i, want, this
to
work
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
"([^"]+)"|[^, ]+
The left side of the alternation | matches complete "quotes" and captures the contents to Group1. The right side matches characters that are neither commas nor spaces, and we know they are the right ones because they were not matched by the expression on the left.
Option 2: Allowing Multiple Words
In your input, all tokens are single words, but if you also want the regex to work for my cat scratches, "what I need", your dog barks, use this:
"([^"]+)"|[^, ]+(?:[ ]*[^, ]+)*
The only difference is the addition of (?:[ ]*[^, ]+)* which optionally adds spaces + characters, zero or more times.
This program shows how to use the regex (see the results at the bottom of the online demo):
subject = 'this, "what I need", to, do, "i, want, this", to, work'
regex = /"([^"]+)"|[^, ]+/
# put Group 1 captures in an array
mymatches = []
subject.scan(regex) {|m|
$1.nil? ? mymatches << $& : mymatches << $1
}
mymatches.each { |x| puts x }
Output
this
what I need
to
do
i, want, this
to
work
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

Regex: Substring the second last value between two slashes of a url string

I have a string like this:
http://www.example.com/value/1234/different-value
How can I extract the 1234?
Note: There may be a slash at the end:
http://www.example.com/value/1234/different-value
http://www.example.com/value/1234/different-value/
/([^/]+)(?=/[^/]+/?$)
should work. You might need to format it differently according to the language you're using. For example, in Ruby, it's
if subject =~ /\/([^\/]+)(?=\/[^\/]+\/?\Z)/
match = $~[1]
else
match = ""
end
Use Slice for Positional Extraction
If you always want to extract the 4th element (including the scheme) from a URI, and are confident that your data is regular, you can use Array#slice as follows.
'http://www.example.com/value/1234/different-value'.split('/').slice 4
#=> "1234"
'http://www.example.com/value/1234/different-value/'.split('/').slice 4
#=> "1234"
This will work reliably whether there's a trailing slash or not, whether or not you have more than 4 elements after the split, and whether or not that fourth element is always strictly numeric. It works because it's based on the element's position within the path, rather than on the contents of the element. However, you will end up with nil if you attempt to parse a URI with fewer elements such as http://www.example.com/1234/.
Use Scan/Match for Pattern Extraction
Alternatively, if you know that the element you're looking for is always the only one composed entirely of digits, you can use String#match with look-arounds to extract just the numeric portion of the string.
'http://www.example.com/value/1234/different-value'.match %r{(?<=/)\d+(?=/)}
#=> #<MatchData "1234">
$&
#=> "1234"
The look-behind and look-ahead assertions are needed to anchor the expression to a path. Without them, you'll match things like w3.example.com too. This solution is a better approach if the position of the target element may change, and if you can guarantee that your element of interest will be the only one that matches the anchored regex.
If there will be more than one match (e.g. http://www.example.com/1234/5678/) then you might want to use String#scan instead to select the first or last match. This is one of those "know your data" things; if you have irregular data, then regular expressions aren't always the best choice.
Javascript:
var myregexp = /:\/\/.*?\/.*?\/(\d+)/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
}
Works with your examples... But I am sure it will fail in general...
Ruby edit:
if subject =~ /:\/\/.*?\/.*?\/(.+?)\//
match = $~[1]
It does work.
I think this is a little simpler than the accepted answer, because it doesn't use any positive lookahead (?=), but rather simply makes the last slash optional via the ? character:
^.+\/(.+)\/.+\/?$
In Ruby:
STDIN.read.split("\n").each do |nextline|
if nextline =~ /^.+\/(.+)\/.+\/?$/
printf("matched %s in %s\n", $~[1], nextline);
else
puts "no match"
end
end
Live Demo
Let's break down what's happening:
^: start of the line
.+\/: match anything (greedily) up to a slash
Since we're going to later match at least 1, at most 2 more slashes, this slash will be either the second last slash (as in http://www.example.com/value/1234/different-value) or the third last slash as in (http://www.example.com/value/1234/different-value/)
Up to this point we've matched http://www.example.com/value/ (due to greediness)
(.+)\/: Our capturing group for 1234 indicated by the parenthesis. It's anything followed by another slash.
Since the previous match matched up to the second or third last slash, this will match up to the last slash or second last slash, respectively
.+: match anything. This would be after our 1234, so we're assuming there are characters after 1234/ (different-value)
\/?: optionally match another slash (the slash after different-value)
$: match the end of the line
Note that in a url, you probably won't have spaces. I used the . character because it's easily distinguished, but perhaps you might use \S instead to match non-spaces.
Also, you might use \A instead of ^ to match start of string (instead of after line break) and \Z instead of $ to match end of string (instead of at line break)

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources