I have a file like this:
some content
some oterh
*********************
useful1 text
useful3 text
*********************
some other content
How do I get the content of the file within between two stars line in an array. For example, on processing the above file the content of array should be like this
a=["useful1 text" , "useful2 text"]
A really hack solution is to split the lines on the stars, grab the middle part, and then split that, too:
content.split(/^\*+$/)[1].split(/\s+/).reject(&:empty?)
# => ["useful1","useful3"]
f = File.open('test_doc.txt', 'r')
content = []
f.each_line do |line|
content << line.rstrip unless !!(line =~ /^\*(\*)*\*$/)
end
f.close
The regex pattern /^*(*)*$/ matches strings that contain only asterisks. !!(line =~ /^*(*)*$/) always returns a boolean value. So if the pattern does not match, the string is added to the array.
What about this:
def values_between(array, separator)
array.slice array.index(separator)+1..array.rindex(separator)-1
end
filepath = '/tmp/test.txt'
lines = %w(trash trash separator content content separator trash)
separator = "separator\n"
File.write '/tmp/test.txt', lines.join("\n")
values_between File.readlines('/tmp/test.txt'), "separator\n"
#=> ["content\n", "content\n"]
I'd do it like this:
lines = []
File.foreach('./test.txt') do |li|
lines << li if (li[/^\*{5}/] ... li[/^\*{5}/])
end
lines[1..-2].map(&:strip).select{ |l| l > '' }
# => ["useful1 text", "useful3 text"]
/^\*{5}/ means "A string that starts with and has at least five '*'.
... is one of two uses of .. and ... and, in this use, is commonly called a "flip-flop" operator. It isn't used often in Ruby because most people don't seem to understand it. It's sometimes mistaken for the Range delimiters .. and ....
In this use, Ruby watches for the first test, li[/^\*{5}/] to return true. Once it does, .. or ... will return true until the second condition returns true. In this case we're looking for the same delimiter, so the same test will work, li[/^\*{5}/], and is where the difference between the two versions, .. and ... come into play.
.. will return toggle back to false immediately, whereas ... will wait to look at the next line, which avoids the problem of the first seeing a delimiter and then the second seeing the same line and triggering.
That lets the test assign to lines, which, prior to the [1..-2].map(&:strip).select{ |l| l > '' } looks like:
# => ["*********************\n",
# "\n",
# "useful1 text\n",
# "\n",
# "useful3 text\n",
# "\n",
# "*********************\n"]
[1..-2].map(&:strip).select{ |l| l > '' } cleans that up by slicing the array to remove the first and last elements, strip removes leading and trailing whitespace, effectively getting rid of the trailing newlines and resulting in empty lines and strings containing the desired text. select{ |l| l > '' } picks up the lines that are greater than "empty" lines, i.e., are not empty.
See "When would a Ruby flip-flop be useful?" and its related questions, and "What is a flip-flop operator?" for more information and some background. (Perl programmers use .. and ... often, for just this purpose.)
One warning though: If the file has multiple blocks delimited this way, you'll get the contents of them all. The code I wrote doesn't know how to stop until the end-of-file is reached, so you'll have to figure out how to handle that situation if it could occur.
Related
Using the oliver.txt
write a method called count_paragraphs that counts the number of paragraphs in the text.
In oliver.txt the paragraph delimiter consists of two or more consecutive newline characters, like this: \n\n, \n\n\n, or even \n\n\n\n.
Your method should return either the number of paragraphs or nil.
I have this code but it doesn't work:
def count_paragraphs(some_file)
file_content = open(some_file).read()
count = 0
file_content_split = file_content.split('')
file_content_split.each_index do |index|
count += 1 if file_content_split[index] == "\n" && file_content_split[index + 1] == "\n"
end
return count
end
# test code
p count_paragraphs("oliver.txt")
It's much easier to either count it directly:
file_content.split(/\n\n+/).count
or count the separators and add one:
file_content.scan(/\n\n+/).count + 1
To determine the number of paragraphs there is no need to construct an array and determine its size. One can instead operate on the string directly by creating an enumerator and counting the number of elements it will generate (after some cleaning of the file contents). This can be done with an unconventional (but highly useful) form of the method String#gsub.
Code
def count_paragraphs(fname)
(File.read(fname).gsub(/ +$/,'') << "\n\n").gsub(/\S\n{2,}/).count
end
Examples
First let us construct a text file.
str =<<BITTER_END
Now is the time
for all good
Rubiest to take
a break.
Oh, happy
day.
One for all,
all for one.
Amen!
BITTER_END
# " \n\nNow is the time\nfor all good\nRubiest to take\na break.\n \n \nOh, happy\nday.\n\nOne for all,\nall for one.\n\n \nAmen!\n"
Note the embedded spaces.
FNAME = 'temp'
File.write(FNAME, str)
#=> 128
Now test the method with this file.
count_paragraphs(FNAME)
#=> 4
One more:
count_paragraphs('oliver.txt')
#=> 61
Explanation
The first step is deal with ill-formed text by removing spaces immediately preceding newlines:
File.read(fname).gsub(/ +$/,'')
#=> "\n\nNow is the time\nfor all good\nRubiest to take\na break.\n\n\nOh, happy\nday.\n\nOne for all,\nall for one.\n\n\nAmen!\n"
Next, two newlines are appended so we can identify all paragraphs, including the last, as containing a non-whitespace character followed by two or more newlines.1.
Note that files containing only spaces and newlines are found to contain zero paragraphs.
If the file is known to contain no ill-formed text, the operative line of the method can be simplified to:
(File.read(fname) << "\n\n").gsub(/\S\n{2,}/).count
See Enumerable#count and IO#read. (As File.superclass #=> IO, read is also in instance method of the class File, and seems to be more commonly invoked on that class than on IO.)
Note that String#gsub without a block returns an enumerator (to which Enumerable#count is applied),
Aside: I believe this form of gsub would be more widely used if it merely had a separate name, such as pattern_match. Calling it gsub seems a misnomer, as it has nothing to do with "substitution", "global" or otherwise.
1 I revised my original answer to deal with ill-formed text, and in doing so borrowed #Kimmo's idea of requiring matches to include a non-whitespace character.
How about a loop that memoizes the previous character and a state of being in or outside of a paragraph?
def count_paragraphs(some_file)
paragraphs = 0
in_paragraph = false
previous_char = ""
File.open(some_file).each_char do |char|
if !in_paragraph && char != "\n"
paragraphs += 1
in_paragraph = true
elsif in_paragraph && char == "\n" && previous_char == "\n"
in_paragraph = false
end
previous_char = char
end
paragraphs
rescue
nil
end
This solution does not build any temporary arrays of the full content so you could parse a huge file without it being read into memory. Also, there are no regular expressions.
The rescue was added because of the "Your function should return either the number of paragraphs or nil" which did not give a clear definition of when a nil should be returned. In this case it will be returned if any exception happens, for example if the file isn't found or can't be read, which will raise an exception that will be catched by the rescue.
You don't need an explicit return in Ruby. The return value of the last statement will be used as the method's return value.
I am trying to read in a text file and iterate through every line. If the line contains "_u" then I want to copy that word in that line.
For example:
typedef struct {
reg 1;
reg 2;
} buffer_u;
I want to copy the word buffer_u.
This is what I have so far (everything up to how to copy the word in the string):
f_in = File.open( h_file )
test = h_file.read
text.each_line do |line|
if line.include? "_u"
# copy word
# add to output file
end
end
Thanks in advance for your help!
Don't make it harder than it has to be. If you want to scan a body of text for words that match a criteria, do just that:
text = "
word_u1
something
_u1 foo
bar _u2
another word_u2
typedef struct {
reg 1;
reg 2;
} buffer_u;
"
text.scan(/\w+/).select{ |w| w['_u'] }
# => ["word_u1", "_u1", "_u2", "word_u2", "buffer_u"]
Regex are useful but the more complex ("smarter") they are, they slower they run unless you are very careful to anchor them, as anchors give them hints on where to look. Without those, the engine tries a number of things to determine exactly what you want, and that can really bog down the processing.
I recommend instead simply grabbing the words in the text:
scan(/\w+/)
Then filtering out the ones that match:
select{ |w| w['_u'] }
Using select with a simple sub-string search w['_u'] is extremely fast.
It could probably run faster using split() instead of scan(/\w+/) but you'll have to deal with cleaning up non-word characters.
Note: \w means [a-zA-Z0-9_] so what we generally call a "word" character is actually a "variable" definition for most languages since words generally don't include digits or _.
You can probably reduce your code to:
File.read( h_file ).scan(/\w+/).select{ |w| w['_u'] }
That will return an array of matching words.
Caveat: Using read has scalability issues. If you're concerned about the size of the file being read (which you always should be) then use foreach and iterate over the file line-by-line. You will probably see no change in processing speed.
You can try something like this:
words = []
File.open( h_file ) { |file| file.each_line { |line|
words << line.split.find { |a| a =~ /_u/ }
}}
words.compact!
# => [["buffer_u"]]
puts words
# buffer_u
This regex should catch a word ending with _u
(\w*_u)(?!\w)
The matching group will match a word ending with _u not followed by letters digits or underscores.
If you want _u to appear anywhere in a word use
(\w*_u\w*)
See DEMO here.
This will return all such words in the file, even if there are two or more in a line:
r = /
\w* # match >= 0 word characters
_u # match string
\w* # match >= 0 word characters
/x # extended mode
File.read(fname).scan r
For example:
str = "Cat_u has 9 lives, \n!dog_u has none and \n pig_u_o and cow_u, 3."
fname = 'temp'
File.write(fname, str)
#=> 63
Confirm the file contents:
File.read(fname)
#=> "Cat_u has 9 lives, \n!dog_u has none and \n pig_u_o and cow_u, 3."
Extract strings:
File.read(fname).scan r
#=> ["Cat_u", "dog_u", "pig_u_o", "cow_u"]
It's not difficult to modify this code to return at most one string per line. Simply read the file into an array of lines (or read a line at a time) and execute s = line[r]; arr << s if s for each line, where r is the above regex.
How do I get the first word from each line? Thanks to help from someone on Stack Overflow, I am working with the code below:
File.open("pastie.rb", "r") do |file|
while (line = file.gets)
next if (line[0,1] == " ")
labwords = line.split.first
print labwords.join(' ')
end
end
It extracts the first word from each line, but it has problems with spaces. I need help adjusting it. I need to use the first method, but I don't know how to use it.
If you want the first word from each line from a file:
first_words = File.read(file_name).lines.map { |l| l.split(/\s+/).first }
It's pretty simple. Let's break it apart:
File.read(file_name)
Reads the entire contents of the file and returns it as a string.
.lines
Splits a string by newline characters (\n) and returns an array of strings. Each string represents a "line."
.map { |l| ... }
Array#map calls the provided block passing in each item and taking the return value of the block to build a new array. Once Array#map finishes it returns the array containing new values. This allows you to transform the values. In the sample block here |l| is the block params portion meaning we're taking one argument and we'll reference it as l.
|l| l.split(/\s+/).first
This is the block internal, I've gone ahead and included the block params here too for completeness. Here we split the line by /\s+/. This is a regular expression, the \s means any whitespace (\t \n and space) and the + following it means one or more so \s+ means one or more whitespace character and of course, it will try to match as many consecutive whitespace characters as possible. Passing this to String#split will return an array of substrings that occur between the seperator given. Now, our separator was one or more whitespace so we should get everything between whitespace. If we had the string "A list of words" we'll get ["A", "list", "of", "words"] after the split call. It's very useful. Finally, we call .first which returns the first element of an array (in this case "the first word").
Now, in Ruby, the evaluated value of the last expression in a block is automatically returned so our first word is returned and given that this block is passed to map we should get an array of the first words from a file. To demonstrate, let's take the input (assuming our file contains):
This is line one
And line two here
Don't forget about line three
Line four is very board
Line five is the best
It all ends with line six
Running this through the line above we get:
["This", "And", "Don't", "Line", "Line", "It"]
Which is the first word from each line.
Consider this:
def first_words_from_file(file_name)
lines = File.readlines(file_name).reject(&:empty?)
lines.map do |line|
line.split.first
end
end
puts first_words_from_file('pastie.rb')
I want to append </tag> to each line where it's missing:
text = '<tag>line 1</tag>
<tag>line2 # no closing tag, append
<tag>line3 # no closing tag, append
line4</tag> # no opening tag, but has a closing tag, so ignore
<tag>line5</tag>'
I tried to create a regular expression to match this but I know its wrong:
text.gsub! /.*?(<\/tag>)Z/, '</tag>'
How can I create a regular expression to conditionally append each line?
Here you go:
text.gsub!(%r{(?<!</tag>)$}, "</tag>")
Explanation:
$ means end of line and \z means end of string. \Z means something similar, with complications.
(?<!) work together to create a negative lookbehind.
Given the example provided, I'd just do something like this:
text.split(/<\/?tag>/).
reject {|t| t.strip.length == 0 }.
map {|t| "<tag>%s</tag>" % t.strip }.
join("\n")
You're basically treating either and as record delimiters, so you can just split on them, reject any blank records, then construct a new combined string from the extracted values. This works nicely when you can't count on newlines being record delimiters and will generally be tolerant of missing tags.
If you're insistent on a pure regex solution, though, and your data format will always match the given format (one record per line), you can use a negative lookbehind:
text.strip.gsub(/(?<!<\/tag>)(\n|$)/, "</tag>\\1")
One that could work is:
/<tag>[^\n ]+[^>][\s]*(\n)/
This is will return all the newline chars without a ">" before them.
Replace it with "\n", i.e.
text.gsub!( /<tag>[^\n ]+[^>][\s]*(\n)/ , "</tag>\n")
For more polishing, try http://rubular.com/
text = '<tag>line 1</tag>
<tag>line2
<tag>line3
line4</tag>
<tag>line5</tag>'
result = ""
text.each_line do |line|
line.rstrip!
line << "</tag>" if not line.end_with?("</tag>")
result << line << "\n"
end
puts result
--output:--
<tag>line 1</tag>
<tag>line2</tag>
<tag>line3</tag>
line4</tag>
<tag>line5</tag>
What is the best way to validate a gets input against a very long word list (a list of all the English words available)?
I am currently playing with readlines to manipulate the text, but before there's any manipulation, I would like to first validate the entry against the list.
The simplest way, but by no means the fastest, is to simply search against the word list each time. If the word list is in an array:
if word_list.index word
#manipulate word
end
If, however, you had the word list as a separate file (with each word on a separate line), then we'll use File#foreach to find it:
if File.foreach("word.list") {|x| break x if x.chomp == word}
#manipulate word
end
Note that foreach does not strip off the trailing newline character(s), so we get rid of them with String#chomp.
Here's a simple example using a Set, though Mark Johnson is right,
a bloom filter would be more efficient.
require 'set'
WORD_RE = /\w+/
# Read in the default dictionary (from /usr/share/dict/words),
# and put all the words into a set
WORDS = Set.new(File.read('/usr/share/dict/words').scan(WORD_RE))
# read the input line by line
STDIN.each_line do |line|
# find all the words in the line that aren't contained in our dictionary
unrecognized = line.scan(WORD_RE).find_all { |term| not WORDS.include? term }
# if none were found, the line is valid
if unrecognized.empty?
puts "line is valid"
else # otherwise, the line contains some words not in our dictionary
puts "line is invalid, could not recognize #{unrecognized.inspect}"
end
end
are you reading the list from a file?
can't you have it all in memory?
maybe a finger tree may help you
if not, there's not more than "read a chunk of data from the file and grep into"
Read the word list into memory, and for each word, make an entry into a hash table:
def init_word_tester
#words = {}
File.foreach("word.list") {|word|
#words[word.chomp] = 1
}
end
now you can just check every word against your hash:
def test_word word
return #words[word]
end