Separating a sentence into words using regexp - ruby

I'm trying to do the above. For example:
"This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all."
should split into individual words:
This
is
a
sentence
I'm
and so on.
I'm just struggling to write the regexp. I know this would be pretty easy using a delimiter or two, but trying to learn more about regexp.

Just split your input according to one or more space characters.
> "This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all.".split(/\s+/)
=> ["This", "is", "a", "sentence", "I'm", "currently", "writing,", "potentially", "with", "punctuation", "dotted", "in:", "item1,", "item2,", "item3.", "That", "is", "all."]
> "This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all.".split()
=> ["This", "is", "a", "sentence", "I'm", "currently", "writing,", "potentially", "with", "punctuation", "dotted", "in:", "item1,", "item2,", "item3.", "That", "is", "all."]
OR
Match one or more non-space characters.
> "This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all.".scan(/\S+/)
=> ["This", "is", "a", "sentence", "I'm", "currently", "writing,", "potentially", "with", "punctuation", "dotted", "in:", "item1,", "item2,", "item3.", "That", "is", "all."]

use split
2.0.0-p481 :001 > a="This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all."
=> "This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all."
2.0.0-p481 :002 > a.split
=> ["This", "is", "a", "sentence", "I'm", "currently", "writing,", "potentially", "with", "punctuation", "dotted", "in:", "item1,", "item2,", "item3.", "That", "is", "all."]
2.0.0-p481 :003 >
OR USE LOOP TO FRAME A WORD ON EACH LINE
2.0.0-p481 :036 > a="This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all."
=> "This is a sentence I'm currently writing, potentially with punctuation dotted in: item1, item2, item3. That is all."
2.0.0-p481 :037 > a.split.each{ |i| puts "#{i}"}
This
is
a
sentence
I'm
currently
writing,
potentially
with
punctuation
dotted
in:
item1,
item2,
item3.
That
is
all.
=> ["This", "is", "a", "sentence", "I'm", "currently", "writing,", "potentially", "with", "punctuation", "dotted", "in:", "item1,", "item2,", "item3.", "That", "is", "all."]
2.0.0-p481 :038 >

Related

Ruby string scan returns different results for different string

irb(main):161:0> "Ready for your my next session?".scan(/[A-Za-z]+|\d+|. /)
=> ["Ready", "for", "your", "my", "next", "session"]
=> ["Ready", "for", "your", "my", "next", "session", "?"] #==> EXPECTED
irb(main):162:0> "yo mr. menon how are you? call at 9 a.m. \"okay\"".scan(/[A-Za-z]+|\d+|. /)
=> ["yo", "mr", ". ", "menon", "how", "are", "you", "? ", "call", "at", "9", "a", "m", ". ", "okay"]
=> ["yo", "mr", ". ", "menon", "how", "are", "you", "? ", "call", "at", "9", "a",".", "m", ".", "``", "okay", "''"] #==> EXPECTED
I am trying to use this scan(/[A-Za-z]+|\d+|. /) to tokenize the string and even the punctuations, even if there is an escaped quote in the string, \"
But it is behaving differently on different structure of a string? How to correct?
r = /
(?: # begin a non-capture group
\"? # optionally (?) match a double-quote
\p{alpha}+ # match one or more letters
\"? # optionally (?) match a double-quote
) # end non-capture group
| # or
\d+ # match one or more digits
| # or
[.,?!:;] # match a punctuation mark
/x # free-spacing regex definition mode
"yo mr. menon how are you? call at 9 a.m. \"okay\"".scan(r)
#=> ["yo", "mr", ".", "menon", "how", "are", "you", "?", "call", "at", "9",
# "a", ".", "m", ".", "\"okay\""]
puts "\"okay\""
# "okay"
The regular expression is conventionally written
/(?:\"?\p{alpha}+\"?)|\d+|[.,?!:;]/

Method to front capitalized words

I am trying to move capitalized words to the front of the sentence. I expect to get this:
capsort(["a", "This", "test.", "Is"])
#=> ["This", "Is", "a", "test."]
capsort(["to", "return", "I" , "something", "Want", "It", "like", "this."])
#=> ["I", "Want", "It", "to", "return", "something", "like", "this."]
The key is maintaining the word order.
I feel like I'm very close.
def capsort(words)
array_cap = []
array_lowcase = []
words.each { |x| x.start_with? ~/[A-Z]/ ? array_cap.push(x) : array_lowcase.push(x) }
words= array_cap << array_lowcase
end
Curious to see what other elegant solutions might be.
The question was changed radically, making my earlier answer completely wrong. Now, the answer is:
def capsort(strings)
strings.partition(&/\p{Upper}/.method(:match)).flatten
end
capsort(["a", "This", "test.", "Is"])
# => ["This", "Is", "a", "test."]
My earlier answer was:
def capsort(strings)
strings.sort
end
capsort(["a", "This", "test.", "Is"])
# => ["Is", "This", "a", "test."]
'Z' < 'a' # => true, there's nothing to be done.
def capsort(words)
words.partition{|s| s =~ /\A[A-Z]/}.flatten
end
capsort(["a", "This", "test.", "Is"])
# => ["This", "Is", "a", "test."]
capsort(["to", "return", "I" , "something", "Want", "It", "like", "this."])
# => ["I", "Want", "It", "to", "return", "something", "like", "this."]
def capsort(words)
caps = words.select{ |x| x =~ /^[A-Z]/ }
lows = words.select{ |x| x !~ /^[A-Z]/ }
caps.concat(lows)
end

Applescript: number list of words according to their occurrence in the sequence

So I have this list of words:
{"It", "was", "the", "best", "of", "times", "it", "was", "the", "worst", "of", "times", "it", "was", "the", "age", "of", "wisdom", "it", "was", "the", "age", "of", "foolishness", "it", "was", "the", "epoch", "of", "belief"}
I'd like each occurrence of a word to be numbered according to the number of times it has occurred in the sequence:
{"It", 1}, {"was", 1}, {"the", 1}, {"best", 1}, {"of", 1}, {"times", 1} {"it", 2}, {"was", 2}, {"the", 2}, {"worst", 1}, {"of", 2}, {"times", 2}, {"it", 3}, {"was", 3}, {"the", 3}, {"age", 1}, {"of", 3}, {"wisdom", 1}, {"it", 4}, {"was", 4}, {"the", 4}, etc.
Any help would be greatly appreciated.
AppleScript is not the right tool for this job, so while the following solution works, it is sloooow.
(For better performance, you'd need full hash-table functionality, which AppleScript doesn't provide - AppleScript's record class is severely handicapped by the inability to specify keys dynamically, at runtime, via variables - third-party solutions exist, though (e.g., http://www.latenightsw.com/freeware/list-record-tools/)).
# Helper handler: Given a search word, returns the number of times the word
# already occurs in the specified list (as the first sub-item of each list item).
on countOccurrences(searchWrd, lst)
local counter
set counter to 0
repeat with wordNumPair in lst
if item 1 of wordNumPair is searchWrd then
set counter to counter + 1
end if
end repeat
return counter
end countOccurrences
# Define the input list.
set inList to {"It", "was", "the", "best", "of", "times", "it", "was", "the", "worst", "of", "times", "it", "was", "the", "age", "of", "wisdom", "it", "was", "the", "age", "of", "foolishness", "it", "was", "the", "epoch", "of", "belief"}
# Initialize the output list.
set outList to {}
# Loop over the input list and build the output list incrementally.
repeat with wrd in inList
# Note that `contents of` returns the string content of the list item
# (dereferences the object specifier that `repeat with` returns).
set outList to outList & {{contents of wrd, 1 + (my countOccurrences(contents of wrd, outList))}}
end repeat
# outList now contains the desired result.
Another way of doing it...
# Define the input list.
set inList to {"It", "was", "the", "best", "of", "times", "it", "was", "the", "worst", "of", "times", "it", "was", "the", "age", "of", "wisdom", "it", "was", "the", "age", "of", "foolishness", "it", "was", "the", "epoch", "of", "belief"}
set AppleScript's text item delimiters to linefeed
set inListLF to inList as text
set AppleScript's text item delimiters to {""}
set usedList to {}
set resultList to {}
repeat with aWord in inList
set aWord to contents of aWord
considering case
if aWord is not in usedList then
set end of usedList to aWord
set end of resultList to aWord & " - " & (do shell script "echo " & quoted form of inListLF & " | grep -o " & quoted form of aWord & " | wc -l")
end if
end considering
end repeat
set AppleScript's text item delimiters to linefeed
set resultList to resultList as text
set AppleScript's text item delimiters to {""}
return resultList

Ruby string split into words ignoring all special characters: Simpler query

I need a query to be split into words everywhere a non word character is used. For example:
query = "I am a great, boy's and I like! to have: a lot-of-fun and #do$$nice&acti*vities+enjoy good ?times."
Should output:
["I", "am", "a", "great", "", "boy", "s", "and", "I", "like", "", "to", "have", "", "a", "lot", "of", "fun", "and", "", "do", "", "nice", "acti", "vities", "enjoy", "good", "", "times"]
This does the trick but is there a simpler way?
query.split(/[ ,'!:\\#\\$\\&\\*+?.-]/)
query.split(/\W+/)
# => ["I", "am", "a", "great", "boy", "s", "and", "I", "like", "to", "have", "a", "lot", "of", "fun", "and", "do", "nice", "acti", "vities", "enjoy", "good", "times"]
query.scan(/\w+/)
# => ["I", "am", "a", "great", "boy", "s", "and", "I", "like", "to", "have", "a", "lot", "of", "fun", "and", "do", "nice", "acti", "vities", "enjoy", "good", "times"]
This is different from the expected output in that it does not include empty strings.
I am adding this answer as #sawa's did not exactly reproduce the desired output:
#Split using any single non-word character:
query.split(/\W/) #=> ["I", "am", "a", "great", "", "boy", "s", "and", "I", "like", "", "to", "have", "", "a", "lot", "of", "fun", "and", "", "do", "", "nice", "acti", "vities", "enjoy", "good", "", "times"]
Now if you do not want the empty strings in the result just use sawa's answer.
The result above will create many empty strings in the result if the string contains multiple spaces, as each extra spaces will be matched again and create a new splitting point. To avoid that we can add an or condition:
# Split using any number of spaces or a single non-word character:
query.split(/\s+|\W/)

What doesn't e.upcase return upper case words

I wanted to upper-case an array but got this behaviour:
=> ["this", "set", "of", "words", "is", "in", "a", "certain", "order"]
for this:
%w[this set of words is in a certain order].each {|e| e.upcase}
Why were the words NOT upper-cased?
(ignoring actual ordering right now dessite the words while I resolve this issue).
String#upcase returns a new string value, it doesn't modify the receiver. Use String#upcase! to get the behavior you want, or use map to produce a new array of the upcased values.
%w[this set of words is in a certain order].each { |e| e.upcase! }
up_words = %w[this set of words is in a certain order].map(&:upcase)
irb> %w[this set of words is in a certain order].map {|e| e.upcase}
=> ["THIS", "SET", "OF", "WORDS", "IS", "IN", "A", "CERTAIN", "ORDER"]
each throws away all the results, map collects all the results in a new array for you.
You are not mutating the input array. Although each is actually upcased while iterating, the original array will be returned unchanged. Use upcase! instead:
# Modifies the array in place and returns the modified version:
>> %w[this set of words is in a certain order].each {|e| e.upcase!}
=> ["THIS", "SET", "OF", "WORDS", "IS", "IN", "A", "CERTAIN", "ORDER"]
# Assign it to a variable to get the up-cased array:
up_cased = %w[this set of words is in a certain order].each {|e| e.upcase!}
up_cased
# => ["THIS", "SET", "OF", "WORDS", "IS", "IN", "A", "CERTAIN", "ORDER"]
If you were to print them in the each, they would have been upcased, but the original unmutated array was returned.
# Calls upcase for proof, but the original array is still returned:
>> %w[this set of words is in a certain order].each {|e| puts e.upcase}
THIS
SET
OF
WORDS
IS
IN
A
CERTAIN
ORDER
=> ["this", "set", "of", "words", "is", "in", "a", "certain", "order"]
It is a little easier to see if you operate on a variable:
arr = %w[this set of words is in a certain order]
# upcase, but don't modify original
arr.each {|e| e.upcase}
arr.inspect
# ["this", "set", "of", "words", "is", "in", "a", "certain", "order"]
# now modify in place with upcase!
arr.each {|e| e.upcase!}
arr.inspect
# ["THIS", "SET", "OF", "WORDS", "IS", "IN", "A", "CERTAIN", "ORDER"]

Resources