Difference between 2 ways working with pipes using ARGF? - ruby

Using ARGF I can create Ruby programs that respect pipelines. Suppose, I to constantly read new entries:
$ tail -f log/test.log | my_prog
I can do this using:
ARGF.each_line do |line|
...
end
Also, I found another way:
while input = ARGF.gets
input.each_line do |line|
...
end
end
Looks like, both variants do the same thing or there is a difference between them? If so, what is it?
Thanks in advance.

As Stefan mentioned, you did a little mistake in second case. Proper way of using "ARGF.gets" approach in your case will look like:
while input = ARGF.gets
# input here represents a line
end
If you rewrite the second example as above, you will not have difference in behavior.
Actual difference you may notice between ARGF#gets and ARGF#each_line is in semantics: each_line accepts block or returns enumerator and gets returns a next line if it is available.
Another option is to use Kernel#gets. Beware it's behavior may differ from ARGF#gets in some cases, especially if you change a separator:
A separator of nil reads the entire contents, and a zero-length separator reads the input one paragraph at a time, where paragraphs are divided by two consecutive newlines.
But for reading (and then printing) constantly from stdin you may use it as follows:
print while gets

Related

How to delete specific lines in text file?

Suppose, I have an input.txt file with the following text:
First line
Second line
Third line
Fourth line
I want to delete, for example, the second and fourth lines to get this:
First line
Third line
So far, I've managed to delete only one the second line using this code
require 'fileutils'
File.open('output.txt', 'w') do |out_file|
File.foreach('input.txt') do |line|
out_file.puts line unless line =~ /Second/
end
end
FileUtils.mv('output.txt', 'input.txt')
What is the right way to delete multiple lines in text file in Ruby?
Deleting lines cleanly and efficiently from a text file is "difficult" in the general case, but can be simple if you can constrain the problem somewhat.
Here are some questions from SO that have asked a similar question:
How do I remove lines of data in the middle of a text file with Ruby
Deleting a specific line in a text file?
Deleting a line in a text file
Delete a line of information from a text file
There are numerous others, as well.
In your case, if your input file is relatively small, you can easily afford to use the approach that you're using. Really, the only thing that would need to change to meet your criteria is to modify your input file loop and condition to this:
File.open('output.txt', 'w') do |out_file|
File.foreach('input.txt').with_index do |line,line_number|
out_file.puts line if line_number.even? # <== line numbers start at 0
end
end
The changes are to capture the line number, using the with_index method, which can be used due to the fact that File#foreach returns an Enumerator when called without a block; the block now applies to with_index, and gains the line number as a second block argument. Simply using the line number in your comparison gives you the criteria that you specified.
This approach will scale, even for somewhat large files, whereas solutions that read the entire file into memory have a fairly low upper limit on file size. With this solution, you're more constrained by available disk space and speed at which you can read/write the file; for instance, doing this to space-limited online storage may not work as well as you'd like. Writing to local disk or thumb drive, assuming that you have space available, should be no problem at all.
Use File.readlines to get an array of the lines in your input file.
input_lines = File.readlines('input.txt')
Then select only those with an even index.
output_lines = input_lines.select.with_index { |_, i| i.even? }
Finally, write those in your output file.
File.open('output.txt', 'w') do |f|
output_lines.each do |line|
f.write line
end
end

What does $/ mean in Ruby?

I was reading about Ruby serialization (http://www.skorks.com/2010/04/serializing-and-deserializing-objects-with-ruby/) and came across the following code. What does $/ mean? I assume $ refers to an object?
array = []
$/="\n\n"
File.open("/home/alan/tmp/blah.yaml", "r").each do |object|
array << YAML::load(object)
end
$/ is a pre-defined variable. It's used as the input record separator, and has a default value of "\n".
Functions like gets uses $/ to determine how to separate the input. For example:
$/="\n\n"
str = gets
puts str
So you have to enter ENTER twice to end the input for str.
Reference: Pre-defined variables
This code is trying to read each object into an array element, so you need to tell it where one ends and the next begins. The line $/="\n\n" is setting what ruby uses to to break apart your file into.
$/ is known as the "input record separator" and is the value used to split up your file when you are reading it in. By default this value is set to new line, so when you read in a file, each line will be put into an array. What setting this value, you are telling ruby that one new line is not the end of a break, instead use the string given.
For example, if I have a comma separated file, I can write $/="," then if I do something like your code on a file like this:
foo, bar, magic, space
I would create an array directly, without having to split again:
["foo", " bar", " magic", " space"]
So your line will look for two newline characters, and split on each group of two instead of on every newline. You will only get two newline characters following each other when one line is empty. So this line tells Ruby, when reading files, break on empty lines instead of every line.
I found in this page something probably interesting:
http://www.zenspider.com/Languages/Ruby/QuickRef.html#18
$/ # The input record separator (eg #gets). Defaults to newline.
The $ means it is a global variable.
This one is however special as it is used by Ruby. Ruby uses that variable as a input record separator
For a full list with the special global variables see:
http://www.rubyist.net/~slagell/ruby/globalvars.html

Line count in csv doesn't match

I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.

Why does Ruby's 'gets' includes the closing newline?

I never need the ending newline I get from gets. Half of the time I forget to chomp it and it is a pain in the....
Why is it there?
Like puts (which sounds similar), it is designed to work with lines, using the \n character.
gets takes an optional argument that is used for "splitting" the input (or "just reading till it arrives). It defaults to the special global variable $/, which contains a \n by default.
gets is a pretty generic method for readings streams and includes this separator. If it would not do it, parts of the stream content would be lost.
var = gets.chomp
This puts it all on one line for you.
If you look at the documentation of IO#gets, you'll notice that the method takes an optional parameter sep which defaults to $/ (the input record separator). You can decide to split input on other things than newlines, e.g. paragraphs ("a zero-length separator reads the input a paragraph at a time (two successive newlines in the input separate paragraphs)"):
>> gets('')
dsfasdf
fasfds
dsafadsf #=> "dsfasdf\nfasfds\n\n"
From a performance perspective, the better question would be "why should I get rid of it?". It's not a big cost, but under the hood you have to pay to chomp the string being returned. While you may never have had a case where you need it, you've surely had plenty of cases where you don't care -- gets s; puts stuff() if s =~ /y/i, etc. In those cases, you'll see a (tiny, tiny) performance improvement by not chomping.
How I auto-detect line endings:
# file open in binary mode
line_ending = 13.chr + 10.chr
check = file.read(1000)
case check
when /\r\n/
# already set
when /\n/
line_ending = 10.chr
when /\r/
line_ending = 13.chr
end
file.rewind
while !file.eof?
line = file.gets(line_ending).chomp
...
end

Finding lines in a text file matching a regular expression

Can anyone explain how I could use regular expressions in Ruby to only return the matches of a string.
For example, if the code reads in a .txt file with a series of names in it:
John Smith
James Jones
David Brown
Tom Davidson
etc etc
..and the word to match is typed in as being 'ohn', it would then just return 'John Smith', but none of the other names.
Note: Instead of using File.each_line, use IO.foreach in modern Rubies instead. For instance:
[1] pry(main)> IO.foreach('./.bashrc') do |l|
[1] pry(main)* puts l
[1] pry(main)* end
export PATH=~/bin:$PATH
export EDITOR='vi'
export VISUAL=$EDITOR
Progress happens and things change.
Here are some different ways to get where you're going.
Notice first I'm using a more idiomatic way of writing the code for reading lines from a file. Ruby's IO and File libraries make it very easy to open, read and close the file in a nice neat package.
File.each_line('file.txt') do |li|
puts li if (li['ohn'])
end
That looks for 'ohn' anywhere in the line, but doesn't bother with a regular expression.
File.each_line('file.txt') do |li|
puts li if (li[/ohn/])
end
That looks for the same string, only it uses a regex to get there. Functionally it's the same as the first example.
File.each_line('file.txt') do |li|
puts li if (li[/ohn\b/])
end
This is a bit smarter way of looking for names that end with 'ohn'. It uses regex but also specifies that the pattern has to occur at the end of a word. \b means "word-boundary".
Also, when reading files, it's important to always think ahead about whether the file being read could ever exceed the RAM available to your app. It's easy to read an entire file into memory in one pass, then process it from RAM, but you can cripple or kill your app or machine if you exceed the physical RAM available to you.
Do you know if the code shown by the other answers is in fact loading the entire file into RAM or is somehow optimized by streaming from the readlines function to the select function?
From the IO#readlines documentation:
Reads the entire file specified by name as individual lines, and returns those lines in an array. Lines are separated by sep.
An additional consideration is memory allocation during a large, bulk read. Even if you have sufficient RAM, you can run into situations where a language chokes as it reads in the data, finds it hasn't allocated enough memory to the variable, and has to pause as it grabs more. That cycle repeats until the entire file is loaded.
I became sensitive to this many years ago when I was loading a very big data file into a Perl app on HP's biggest mini, that I managed. The app would pause for a couple seconds periodically and I couldn't figure out why. I dropped into the debugger and couldn't find the problem. Finally, by tracing the run using old-school print statements I isolated the pauses to a file "slurp". I had plenty of RAM, and plenty of processing power, but Perl wasn't allocating enough memory. I switched to reading line by line and the app flew through its processing. Ruby, like Perl, has good I/O, and can read a big file very quickly when it's reading line-by-line. I have never found a good reason for slurping a text file, except when it's possible to have content I want spread across several lines, but that is not a common occurrence.
Maybe I'm not understanding the problem fully, but you could do something like:
File.readlines("path/to/file.txt").select { |line| line =~ /ohn/ }
to get an array of all the lines that match your criteria.
query = 'ohn'
names = File.readlines('names.txt')
matches = names.select { |name| name[/#{query}/i] }
#=> ["John Smith"]
Remove the i at the end of the regex if you wish the query to be case sensitive.
Old question, but Array#grep can also be used to search a list of strings
File.readlines("names.txt").grep /#{query}/i

Resources