How do I get the first word from each line? Thanks to help from someone on Stack Overflow, I am working with the code below:
File.open("pastie.rb", "r") do |file|
while (line = file.gets)
next if (line[0,1] == " ")
labwords = line.split.first
print labwords.join(' ')
end
end
It extracts the first word from each line, but it has problems with spaces. I need help adjusting it. I need to use the first method, but I don't know how to use it.
If you want the first word from each line from a file:
first_words = File.read(file_name).lines.map { |l| l.split(/\s+/).first }
It's pretty simple. Let's break it apart:
File.read(file_name)
Reads the entire contents of the file and returns it as a string.
.lines
Splits a string by newline characters (\n) and returns an array of strings. Each string represents a "line."
.map { |l| ... }
Array#map calls the provided block passing in each item and taking the return value of the block to build a new array. Once Array#map finishes it returns the array containing new values. This allows you to transform the values. In the sample block here |l| is the block params portion meaning we're taking one argument and we'll reference it as l.
|l| l.split(/\s+/).first
This is the block internal, I've gone ahead and included the block params here too for completeness. Here we split the line by /\s+/. This is a regular expression, the \s means any whitespace (\t \n and space) and the + following it means one or more so \s+ means one or more whitespace character and of course, it will try to match as many consecutive whitespace characters as possible. Passing this to String#split will return an array of substrings that occur between the seperator given. Now, our separator was one or more whitespace so we should get everything between whitespace. If we had the string "A list of words" we'll get ["A", "list", "of", "words"] after the split call. It's very useful. Finally, we call .first which returns the first element of an array (in this case "the first word").
Now, in Ruby, the evaluated value of the last expression in a block is automatically returned so our first word is returned and given that this block is passed to map we should get an array of the first words from a file. To demonstrate, let's take the input (assuming our file contains):
This is line one
And line two here
Don't forget about line three
Line four is very board
Line five is the best
It all ends with line six
Running this through the line above we get:
["This", "And", "Don't", "Line", "Line", "It"]
Which is the first word from each line.
Consider this:
def first_words_from_file(file_name)
lines = File.readlines(file_name).reject(&:empty?)
lines.map do |line|
line.split.first
end
end
puts first_words_from_file('pastie.rb')
Related
I need to take a file name and an integer N, and return the first N unique words in the file given. Let us say that input.txt has this content:
I like pancakes in my breakfast. Also, I like pancakes in my dinner.
The output of running this with N = 13 could be
I
like
pancakes
in
my
breakfast.
Also,
dinner.
I know how to open the file and read line by line, but beyond that, I don't know how to take the unique words out if the lines.
Let's first create a test file.
str =<<END
We like pancakes for breakfast,
but we know others like waffles.
END
FName = 'temp'
File.write(FName, str)
#=> 65 (characters written)
We need to return an array containing the first nbr_unique unique words from the file named file, so let's write a method that will do that.
def unique_words(fname, nbr_unique)
<code needed here>
end
You need to add unique words to an array that will be returned by this method, so let's begin by creating an empty array and then return that array at the end of the method.
def unique_words(fname, nbr_unique)
arr = []
<code needed here>
arr
end
You know how to read a file line-by-line, so let's do that, using the class method IO::foreach1.
def unique_words(fname, nbr_unique)
arr = []
File.foreach(fname) do |line|
<code need here to process line>
end
arr
end
The block variable line equals "We like pancakes for breakfast,\n" after the first line is read. Firstly, the newline character needs to be removed. Examine the methods of the class
String to see if one can be used to do that.
The second line contains the word "we". I assume "We" and "we" are not to be regarded as unique words. This is usually handled by converting all characters of a string to either all lowercase or all uppercase. You can do this to each line or to each word (after words have been extracted from a line). Again, look for a suitable method in the class String for doing this.
Next you need to extract words from each line. Once again, look for a String method for doing that.
Next we need to determine if, say, "like" (or "LIKE") is to be added to the array arr. Look at the instance methods for the class Array for a suitable method. If it is added we need to see if arr now contains nbr_unique words. If it does we don't need to read any more lines of the file, so we need to break out of foreach's block (perhaps use the keyword break).
There's one more thing we need to take care of. The first line contains "breakfast,", the second, "waffles.". We obviously don't want the words returned to contain punctuation. There are two ways to do that. The first is to remove the punctuation, the second is to accept only letters.
Given a string that contains punctuation (a line or a word) we can create a second string that equals the original string with the punctuation removed. One way to do that is to use the method String#tr. Suppose the string is "breakfast,". Then
"breakfast,".tr(".,?!;:'", "") #=> "breakfast"
To only accept letters we could use any of the following regular expressions (all return "breakfast"):
"breakfast,".gsub(/[a-zA-Z]+/, "")
"breakfast,".gsub(/[a-z]+/i, "")
"breakfast,".gsub(/[[:alphaa:]]+/, "")
"breakfast,".gsub(/\p{L}+/, "")
The first two work with ASCII characters only. The third (POSIX) and fourth work (\p{} construct) with Unicode (search within Regexp).
Note that it is more efficient to remove punctuation from a line before words are extracted.
Extra credit: use Enumerator#with_object
Whenever you see an object (here arr) initialized to be be empty, manipulated and then returned at the end of a method, you should consider using the method Enumerator#with_object or (more commonly), Enumerable#each_with_object. Both of these return the object referred to in the method name.
The method IO::foreach returns an enumerator (an instance of the class Enumerator) when it does not have a block (see doc). We therefore could write
def unique_words(fname, nbr_unique)
File.foreach(fname).with_object([]) do |line, arr|
<code need here to process line>
end
end
We have eliminated two lines (arr = [] and arr), but have also confined arr's scope to the block. This is not a big deal but is the Ruby way.
More extra credit: use methods of the class Set
Suppose we wrote the following.
require 'set'
def unique_words(fname, nbr_unique)
File.foreach(fname).with_object(Set.new) do |line, set|
<code need here to process line>
end.to_a
end
When we extract the word "we" from the second line we need to check if it should be added to the set. Since sets have unique elements we can just try to do it. We won't be able to do that because set will already contain that word from the first line of the file. A handy method for doing that is Set#add?:
set.add?("we")
#=> nil
Here the method returns nil, meaning the set already contains that word. It also tells us that we don't need to check if the set now contains nbr_unique words. Had we been able to add the word to the set, set (with the added word) would be returned.
The block returns the value of set (a set). The method Set#to_a converts that set to an array, which is returned by the method.
1 Notice that I've invoked the class method IO::foreach by writing File.foreach(fname)... below. This is permissible because File is a subclass of IO (File.superclass #=> IO). I could have instead written IO.foreach(fname)..., but it is more common to use File as the receiver.
need to be able to take the last line of a string and put it in it's own string. and then more importantly I need to be able to remove the last line of the original string that has non-whitespace characters.
Consider a string like the following (line breaks written as \n):
str = "Hello\nThere\nWorld!\n\n"
First, use String#strip to remove trailing whitespace, and use String#split to break the string into an array where each element represents one line of the string.
str = str.strip.split("\n")
#=> ["Hello", "There", "World!"]
You can then extract the last line from the last element in the array using Array#pop.
last_line = str.pop
#=> "World!"
Finally, use Array#join to re-assemble the array.
str = str.join("\n")
#=> "Hello\nThere"
I have a file like this:
some content
some oterh
*********************
useful1 text
useful3 text
*********************
some other content
How do I get the content of the file within between two stars line in an array. For example, on processing the above file the content of array should be like this
a=["useful1 text" , "useful2 text"]
A really hack solution is to split the lines on the stars, grab the middle part, and then split that, too:
content.split(/^\*+$/)[1].split(/\s+/).reject(&:empty?)
# => ["useful1","useful3"]
f = File.open('test_doc.txt', 'r')
content = []
f.each_line do |line|
content << line.rstrip unless !!(line =~ /^\*(\*)*\*$/)
end
f.close
The regex pattern /^*(*)*$/ matches strings that contain only asterisks. !!(line =~ /^*(*)*$/) always returns a boolean value. So if the pattern does not match, the string is added to the array.
What about this:
def values_between(array, separator)
array.slice array.index(separator)+1..array.rindex(separator)-1
end
filepath = '/tmp/test.txt'
lines = %w(trash trash separator content content separator trash)
separator = "separator\n"
File.write '/tmp/test.txt', lines.join("\n")
values_between File.readlines('/tmp/test.txt'), "separator\n"
#=> ["content\n", "content\n"]
I'd do it like this:
lines = []
File.foreach('./test.txt') do |li|
lines << li if (li[/^\*{5}/] ... li[/^\*{5}/])
end
lines[1..-2].map(&:strip).select{ |l| l > '' }
# => ["useful1 text", "useful3 text"]
/^\*{5}/ means "A string that starts with and has at least five '*'.
... is one of two uses of .. and ... and, in this use, is commonly called a "flip-flop" operator. It isn't used often in Ruby because most people don't seem to understand it. It's sometimes mistaken for the Range delimiters .. and ....
In this use, Ruby watches for the first test, li[/^\*{5}/] to return true. Once it does, .. or ... will return true until the second condition returns true. In this case we're looking for the same delimiter, so the same test will work, li[/^\*{5}/], and is where the difference between the two versions, .. and ... come into play.
.. will return toggle back to false immediately, whereas ... will wait to look at the next line, which avoids the problem of the first seeing a delimiter and then the second seeing the same line and triggering.
That lets the test assign to lines, which, prior to the [1..-2].map(&:strip).select{ |l| l > '' } looks like:
# => ["*********************\n",
# "\n",
# "useful1 text\n",
# "\n",
# "useful3 text\n",
# "\n",
# "*********************\n"]
[1..-2].map(&:strip).select{ |l| l > '' } cleans that up by slicing the array to remove the first and last elements, strip removes leading and trailing whitespace, effectively getting rid of the trailing newlines and resulting in empty lines and strings containing the desired text. select{ |l| l > '' } picks up the lines that are greater than "empty" lines, i.e., are not empty.
See "When would a Ruby flip-flop be useful?" and its related questions, and "What is a flip-flop operator?" for more information and some background. (Perl programmers use .. and ... often, for just this purpose.)
One warning though: If the file has multiple blocks delimited this way, you'll get the contents of them all. The code I wrote doesn't know how to stop until the end-of-file is reached, so you'll have to figure out how to handle that situation if it could occur.
I want to append </tag> to each line where it's missing:
text = '<tag>line 1</tag>
<tag>line2 # no closing tag, append
<tag>line3 # no closing tag, append
line4</tag> # no opening tag, but has a closing tag, so ignore
<tag>line5</tag>'
I tried to create a regular expression to match this but I know its wrong:
text.gsub! /.*?(<\/tag>)Z/, '</tag>'
How can I create a regular expression to conditionally append each line?
Here you go:
text.gsub!(%r{(?<!</tag>)$}, "</tag>")
Explanation:
$ means end of line and \z means end of string. \Z means something similar, with complications.
(?<!) work together to create a negative lookbehind.
Given the example provided, I'd just do something like this:
text.split(/<\/?tag>/).
reject {|t| t.strip.length == 0 }.
map {|t| "<tag>%s</tag>" % t.strip }.
join("\n")
You're basically treating either and as record delimiters, so you can just split on them, reject any blank records, then construct a new combined string from the extracted values. This works nicely when you can't count on newlines being record delimiters and will generally be tolerant of missing tags.
If you're insistent on a pure regex solution, though, and your data format will always match the given format (one record per line), you can use a negative lookbehind:
text.strip.gsub(/(?<!<\/tag>)(\n|$)/, "</tag>\\1")
One that could work is:
/<tag>[^\n ]+[^>][\s]*(\n)/
This is will return all the newline chars without a ">" before them.
Replace it with "\n", i.e.
text.gsub!( /<tag>[^\n ]+[^>][\s]*(\n)/ , "</tag>\n")
For more polishing, try http://rubular.com/
text = '<tag>line 1</tag>
<tag>line2
<tag>line3
line4</tag>
<tag>line5</tag>'
result = ""
text.each_line do |line|
line.rstrip!
line << "</tag>" if not line.end_with?("</tag>")
result << line << "\n"
end
puts result
--output:--
<tag>line 1</tag>
<tag>line2</tag>
<tag>line3</tag>
line4</tag>
<tag>line5</tag>
What is the best way to validate a gets input against a very long word list (a list of all the English words available)?
I am currently playing with readlines to manipulate the text, but before there's any manipulation, I would like to first validate the entry against the list.
The simplest way, but by no means the fastest, is to simply search against the word list each time. If the word list is in an array:
if word_list.index word
#manipulate word
end
If, however, you had the word list as a separate file (with each word on a separate line), then we'll use File#foreach to find it:
if File.foreach("word.list") {|x| break x if x.chomp == word}
#manipulate word
end
Note that foreach does not strip off the trailing newline character(s), so we get rid of them with String#chomp.
Here's a simple example using a Set, though Mark Johnson is right,
a bloom filter would be more efficient.
require 'set'
WORD_RE = /\w+/
# Read in the default dictionary (from /usr/share/dict/words),
# and put all the words into a set
WORDS = Set.new(File.read('/usr/share/dict/words').scan(WORD_RE))
# read the input line by line
STDIN.each_line do |line|
# find all the words in the line that aren't contained in our dictionary
unrecognized = line.scan(WORD_RE).find_all { |term| not WORDS.include? term }
# if none were found, the line is valid
if unrecognized.empty?
puts "line is valid"
else # otherwise, the line contains some words not in our dictionary
puts "line is invalid, could not recognize #{unrecognized.inspect}"
end
end
are you reading the list from a file?
can't you have it all in memory?
maybe a finger tree may help you
if not, there's not more than "read a chunk of data from the file and grep into"
Read the word list into memory, and for each word, make an entry into a hash table:
def init_word_tester
#words = {}
File.foreach("word.list") {|word|
#words[word.chomp] = 1
}
end
now you can just check every word against your hash:
def test_word word
return #words[word]
end