I'm trying to take the string "xxxyyyzzz" and split it up into an array that groups the same letters. So I want the output to be ["xxx","yyy","zzz"]. I'm not sure why this code keeps on looping. Any suggestions?
def split_up(str)
i = 1
result = []
array = str.split("")
until array == []
if array[i] == array[i-1]
i += 1
else
result << array.shift(i).join("")
end
i = 1
end
result
end
puts split_up("xxxyyyzzz")
The looping is because your until condition never exits. You are incrementing i when the successive characters match, but at the end of the loop you are resetting i to 1.
If you edit this section and add this line:
until array == []
puts i # new line
Then you'll see that i is always 1, and the code keeps printing 1 forever.
Delete the line i = 1 line and you'll get the result you want.
Also, you may be interested in reading about the Ruby string scan method, and pattern matching and capture groups, and using look-ahead and look-behind zero-length assertions, which can match boundaries.
Here is how I would personally accomplish splitting a string at letter boundaries:
"xxxyyyzzz".scan(/(.)(\1*)/).map{|a,b| a+b }
=> ["xxx", "yyy", "zzz"]
The scan method is doing this:
. matches any character e.g. "x", and the parentheses capture this.
\1* matches the previous capture any number of time, e.g. "xx", and the parentheses capture this.
Thus $1 matches the first character "x" and $2 matches all the repeats "xx".
The scan block concatenates the first character and its repeats, so returns "xxx".
As mentioned above, this can be solved using scan like this:
def split_up(string)
repeat_alphabets = /(\w)(\1*)/
string.scan(repeat_alphabets).map do |match|
match[0] << match[1]
end
end
Explanation:
The regular expression matches repeating characters, but due to the construction of the regex matches occur as pairs of the alphabet and remaining repeated instances.
m[0] << m[1] joins the matches to form the required string.
map combines the string into an array and returns the array as it being the last statement.
Related
Using the oliver.txt
write a method called count_paragraphs that counts the number of paragraphs in the text.
In oliver.txt the paragraph delimiter consists of two or more consecutive newline characters, like this: \n\n, \n\n\n, or even \n\n\n\n.
Your method should return either the number of paragraphs or nil.
I have this code but it doesn't work:
def count_paragraphs(some_file)
file_content = open(some_file).read()
count = 0
file_content_split = file_content.split('')
file_content_split.each_index do |index|
count += 1 if file_content_split[index] == "\n" && file_content_split[index + 1] == "\n"
end
return count
end
# test code
p count_paragraphs("oliver.txt")
It's much easier to either count it directly:
file_content.split(/\n\n+/).count
or count the separators and add one:
file_content.scan(/\n\n+/).count + 1
To determine the number of paragraphs there is no need to construct an array and determine its size. One can instead operate on the string directly by creating an enumerator and counting the number of elements it will generate (after some cleaning of the file contents). This can be done with an unconventional (but highly useful) form of the method String#gsub.
Code
def count_paragraphs(fname)
(File.read(fname).gsub(/ +$/,'') << "\n\n").gsub(/\S\n{2,}/).count
end
Examples
First let us construct a text file.
str =<<BITTER_END
Now is the time
for all good
Rubiest to take
a break.
Oh, happy
day.
One for all,
all for one.
Amen!
BITTER_END
# " \n\nNow is the time\nfor all good\nRubiest to take\na break.\n \n \nOh, happy\nday.\n\nOne for all,\nall for one.\n\n \nAmen!\n"
Note the embedded spaces.
FNAME = 'temp'
File.write(FNAME, str)
#=> 128
Now test the method with this file.
count_paragraphs(FNAME)
#=> 4
One more:
count_paragraphs('oliver.txt')
#=> 61
Explanation
The first step is deal with ill-formed text by removing spaces immediately preceding newlines:
File.read(fname).gsub(/ +$/,'')
#=> "\n\nNow is the time\nfor all good\nRubiest to take\na break.\n\n\nOh, happy\nday.\n\nOne for all,\nall for one.\n\n\nAmen!\n"
Next, two newlines are appended so we can identify all paragraphs, including the last, as containing a non-whitespace character followed by two or more newlines.1.
Note that files containing only spaces and newlines are found to contain zero paragraphs.
If the file is known to contain no ill-formed text, the operative line of the method can be simplified to:
(File.read(fname) << "\n\n").gsub(/\S\n{2,}/).count
See Enumerable#count and IO#read. (As File.superclass #=> IO, read is also in instance method of the class File, and seems to be more commonly invoked on that class than on IO.)
Note that String#gsub without a block returns an enumerator (to which Enumerable#count is applied),
Aside: I believe this form of gsub would be more widely used if it merely had a separate name, such as pattern_match. Calling it gsub seems a misnomer, as it has nothing to do with "substitution", "global" or otherwise.
1 I revised my original answer to deal with ill-formed text, and in doing so borrowed #Kimmo's idea of requiring matches to include a non-whitespace character.
How about a loop that memoizes the previous character and a state of being in or outside of a paragraph?
def count_paragraphs(some_file)
paragraphs = 0
in_paragraph = false
previous_char = ""
File.open(some_file).each_char do |char|
if !in_paragraph && char != "\n"
paragraphs += 1
in_paragraph = true
elsif in_paragraph && char == "\n" && previous_char == "\n"
in_paragraph = false
end
previous_char = char
end
paragraphs
rescue
nil
end
This solution does not build any temporary arrays of the full content so you could parse a huge file without it being read into memory. Also, there are no regular expressions.
The rescue was added because of the "Your function should return either the number of paragraphs or nil" which did not give a clear definition of when a nil should be returned. In this case it will be returned if any exception happens, for example if the file isn't found or can't be read, which will raise an exception that will be catched by the rescue.
You don't need an explicit return in Ruby. The return value of the last statement will be used as the method's return value.
I am trying to do this test and there are bunch of solutions online and here but I first want to figure out why my solution is wrong even though it seems that it puts right results when I enter certain strings :
Here is what they are asking :
Write a method that takes in a string. Return the longest word in the
string. You may assume that the string contains only letters and
spaces.
You may use the String split method to aid you in your quest.
Here is my solution where I thought I could turn string into array, sort it from max length descending and then just print first element in that new string like this :
def longest_word(sentence)
sentence = sentence.split
sentence.sort_by! { |longest| -longest.length }
return sentence[0]
end
That doesn't seem to work obviously since their test gives me all false..here is the test :
puts("\nTests for #longest_word")
puts("===============================================")
puts(
'longest_word("short longest") == "longest": ' +
(longest_word('short longest') == 'longest').to_s
)
puts(
'longest_word("one") == "one": ' +
(longest_word('one') == 'one').to_s
)
puts(
'longest_word("abc def abcde") == "abcde": ' +
(longest_word('abc def abcde') == 'abcde').to_s
)
puts("===============================================")
So the question is why? And can I just fix my code or the idea is all wrong and I need to do it completely different?
str = "Which word in this string is longest?"
r = /[[:alpha:]]+/
str.scan(r).max_by(&:length)
#=> "longest"
This regular expression reads, "match one or more characters". The outer brackets constitute a character class, meaning one of the characters within the brackets must be matched.
To deal with words that are hyphenated or contain single quotes, the following is an imperfect modification1:
str = "Who said that chicken is finger-licken' good?"
r = /[[[:alpha:]]'-]+/
str.scan(r).max_by(&:length)
#=> "finger-licken'"
This regular expression reads, "match one or more characters that are a letter, apostrophe or hyphen". The outer brackets constitute a character class, meaning one of the characters within the brackets must be matched.
1 I've successfully used "finger-licken'" in scrabble.
I'd write it something like:
str = "Write a method that takes in a string"
str.split.sort_by(&:length).last # => "string"
I am trying to call the first duplicate character in my string in Ruby.
I have defined an input string using gets.
How do I call the first duplicate character in the string?
This is my code so far.
string = "#{gets}"
print string
How do I call a character from this string?
Edit 1:
This is the code I have now where my output is coming out to me No duplicates 26 times. I think my if statement is wrongly written.
string "abcade"
puts string
for i in ('a'..'z')
if string =~ /(.)\1/
puts string.chars.group_by{|c| c}.find{|el| el[1].size >1}[0]
else
puts "no duplicates"
end
end
My second puts statement works but with the for and if loops, it returns no duplicates 26 times whatever the string is.
The following returns the index of the first duplicate character:
the_string =~ /(.)\1/
Example:
'1234556' =~ /(.)\1/
=> 4
To get the duplicate character itself, use $1:
$1
=> "5"
Example usage in an if statement:
if my_string =~ /(.)\1/
# found duplicate; potentially do something with $1
else
# there is no match
end
s.chars.map { |c| [c, s.count(c)] }.drop_while{|i| i[1] <= 1}.first[0]
With the refined form from Cary Swoveland :
s.each_char.find { |c| s.count(c) > 1 }
Below method might be useful to find the first word in a string
def firstRepeatedWord(string)
h_data = Hash.new(0)
string.split(" ").each{|x| h_data[x] +=1}
h_data.key(h_data.values.max)
end
I believe the question can be interpreted in either of two ways (neither involving the first pair of adjacent characters that are the same) and offer solutions to each.
Find the first character in the string that is preceded by the same character
I don't believe we can use a regex for this (but would love to be proved wrong). I would use the method suggested in a comment by #DaveNewton:
require 'set'
def first_repeat_char(str)
str.each_char.with_object(Set.new) { |c,s| return c unless s.add?(c) }
nil
end
first_repeat_char("abcdebf") #=> b
first_repeat_char("abcdcbe") #=> c
first_repeat_char("abcdefg") #=> nil
Find the first character in the string that appears more than once
r = /
(.) # match any character in capture group #1
.* # match any character zero of more times
? # do the preceding lazily
\K # forget everything matched so far
\1 # match the contents of capture group 1
/x
"abcdebf"[r] #=> b
"abccdeb"[r] #=> b
"abcdefg"[r] #=> nil
This regex is fine, but produces the warning, "regular expression has redundant nested repeat operator '*'". You can disregard the warning or suppress it by doing something clunky, like:
r = /([^#{0.chr}]).*?\K\1/
where ([^#{0.chr}]) means "match any character other than 0.chr in capture group 1".
Note that a positive lookbehind cannot be used here, as they cannot contain variable-length matches (i.e., .*).
You could probably make your string an array and use detect. This should return the first char where the count is > 1.
string.split("").detect {|x| string.count(x) > 1}
I'll use positive lookahead with String#[] method :
"abcccddde"[/(.)(?=\1)/] #=> c
As a variant:
str = "abcdeff"
p str.chars.group_by{|c| c}.find{|el| el[1].size > 1}[0]
prints "f"
How do I get the first word from each line? Thanks to help from someone on Stack Overflow, I am working with the code below:
File.open("pastie.rb", "r") do |file|
while (line = file.gets)
next if (line[0,1] == " ")
labwords = line.split.first
print labwords.join(' ')
end
end
It extracts the first word from each line, but it has problems with spaces. I need help adjusting it. I need to use the first method, but I don't know how to use it.
If you want the first word from each line from a file:
first_words = File.read(file_name).lines.map { |l| l.split(/\s+/).first }
It's pretty simple. Let's break it apart:
File.read(file_name)
Reads the entire contents of the file and returns it as a string.
.lines
Splits a string by newline characters (\n) and returns an array of strings. Each string represents a "line."
.map { |l| ... }
Array#map calls the provided block passing in each item and taking the return value of the block to build a new array. Once Array#map finishes it returns the array containing new values. This allows you to transform the values. In the sample block here |l| is the block params portion meaning we're taking one argument and we'll reference it as l.
|l| l.split(/\s+/).first
This is the block internal, I've gone ahead and included the block params here too for completeness. Here we split the line by /\s+/. This is a regular expression, the \s means any whitespace (\t \n and space) and the + following it means one or more so \s+ means one or more whitespace character and of course, it will try to match as many consecutive whitespace characters as possible. Passing this to String#split will return an array of substrings that occur between the seperator given. Now, our separator was one or more whitespace so we should get everything between whitespace. If we had the string "A list of words" we'll get ["A", "list", "of", "words"] after the split call. It's very useful. Finally, we call .first which returns the first element of an array (in this case "the first word").
Now, in Ruby, the evaluated value of the last expression in a block is automatically returned so our first word is returned and given that this block is passed to map we should get an array of the first words from a file. To demonstrate, let's take the input (assuming our file contains):
This is line one
And line two here
Don't forget about line three
Line four is very board
Line five is the best
It all ends with line six
Running this through the line above we get:
["This", "And", "Don't", "Line", "Line", "It"]
Which is the first word from each line.
Consider this:
def first_words_from_file(file_name)
lines = File.readlines(file_name).reject(&:empty?)
lines.map do |line|
line.split.first
end
end
puts first_words_from_file('pastie.rb')
I am looking to extract all Methionine residues to the end from a sequence.
In the below sequence:
MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
Original Amino Acid sequence:
atgtttgaaatcgaagaacatatgaaggattcacaggtggaatacataattggccttcataatatcccattattgaatgcaactatttcagtgaagtgcacaggatttcaaagaactatgaatatgcaaggttgtgctaataaatttatgcaaagacattatgagaatcccctgacgggg
I want to extract from the sequence any M residue to the end, and obtain the following:
- MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
- MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
- MNMQGCANKFMQRHYENPLTG
- MQGCANKFMQRHYENPLTG
- MQRHYENPLTG
With the data I am working with there are cases where there are a lot more "M" residues in the sequence.
The script I currently have is below. This script translates the genomic data first and then works with the amino acid sequences. This does the first two extractions but nothing further.
I have tried to repeat the same scan method after the second scan (See the commented part in the script below) but this just gives me an error:
private method scan called for #<Array:0x7f80884c84b0> No Method Error
I understand I need to make a loop of some kind and have tried, but all in vain. I have also tried matching but I haven't been able to do so - I think that you cannot match overlapping characters a single match method but then again I'm only a beginner...
So here is the script I'm using:
#!/usr/bin/env ruby
require "bio"
def extract_open_reading_frames(input)
file_output = File.new("./output.aa", "w")
input.each_entry do |entry|
i = 1
entry.naseq.translate(1).scan(/M\w*/i) do |orf1|
file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf1}"
i = i + 1
orf1.scan(/.(M\w*)/i) do |orf2|
file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf2}"
i = i + 1
# orf2.scan(/.(M\w*)/i) do |orf3|
# file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf3}"
# i = i + 1
# end
end
end
end
file_output.close
end
biofastafile = Bio::FlatFile.new(Bio::FastaFormat, ARGF)
extract_open_reading_frames(biofastafile)
The script has to be in Ruby since this is part of a much longer script that is in Ruby.
You can do:
str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
str.scan(/(?=(M.*))./).flatten
#=> ["MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG", MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG", "MNMQGCANKFMQRHYENPLTG", "MQGCANKFMQRHYENPLTG", "MQRHYENPLTG"]
This works by capturing loookaheads starting with M and advancing one char at a time.
str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
pos = 0
while pos < str.size
if md = str.match(/M.*/, pos)
puts md[0]
pos = md.offset(0)[0] + 1
else
break
end
end
--output:--
MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
MNMQGCANKFMQRHYENPLTG
MQGCANKFMQRHYENPLTG
MQRHYENPLTG
md -- stands for the MatchData object.
match() -- returns nil if there is no match, the second argument is the start position of the search.
md[0] -- is the whole match (md[1] would be the first parenthesized group, etc.).
md.offset(n) -- returns an array containing the beginning and ending position in the string of md[n].
Running the program on the string "MMMM" produces the output:
MMMM
MMM
MM
M
I have also tried matching but I haven't been able to do so - I think
that you cannot match overlapping characters a single match method but
then again I'm only a beginner...
Yes, that's true. String#scan will not find overlapping matches. After scan finds a match, the search continues from the end of the match. Perl has some ways to make regexes back-up, I don't know whether Ruby has those.
Edit:
For Ruby 1.8.7:
str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
pos = 0
while true
str = str[pos..-1]
if md = str.match(/M.*/)
puts md[0]
pos = md.offset(0)[0] + 1
else
break
end
end