ruby each_line reads line break too? - ruby

I'm trying to read data from a text file and join it with a post string. When there's only one line in the file, it works fine. But with 2 lines, my request is failed. Is each_line reading the line break? How can I correct it?
File.open('sfzh.txt','r'){|f|
f.each_line{|row|
send(row)
}
I did bypass this issue with split and extra delimiter. But it just looks ugly.

Yes, each_line includes line breaks. But you can strip them easily using chomp:
File.foreach('test1.rb') do |line|
send line.chomp
end

Another way is to map strip onto each line as it is returned. To read a file line-by-line, stripping whitespace and do something with each line you can do the following:
File.open("path to file").readlines.map(&:strip).each do |line|
(do something with line)
end

Related

Ruby remove blank lines from a file which include spaces

I am trying to remove blank lines from a file.
My current code is:
def remove_blank_lines_from_file(file)
File.write(file, File.read(file).gsub(/^$\n/, ''))
end
The above code removes only the empty lines, but I also want to remove the lines which include spaces.
How could I do it?
Since you nevertheless load the whole file into memory, this might be easier to read:
File.write(file, File.readlines(file).reject { |s| s.strip.empty? }.join)
Just remove those lines, containing the spaces only.

Remove line break if line does not start with KEYWORD

I have a flat file with lines that look like
KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....
How do I go about removing the linebreak so that
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
turns into
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
This is in a HP-UNIX environment and I can move the file to another system (windows box with powershell and ruby installed).
I don't know what tools are you using, but you can use this regex to match every \n (or maybe \r) that isn't followed by KEYWORD so you can replace it for SPACE and you would have it.
DEMO
Regex: \r(?!KEYWORD) (With global modifier)
Ruby's Array has a nice method called slice_before that it inherits from Enumerable, which comes to the rescue here:
require 'pp'
text = 'KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....'
pp text.split("\n").slice_before(/^KEYWORD/).map{ |a| a.join(' ') }
=> ["KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING",
"KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING",
"KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE",
"KEYWORD|....."]
This code just splits your text on line breaks, then uses slice_before to break the resulting array into sub-arrays, one for each block of text starting with /^KEYWORD/. Then it walks through the resulting sub-arrays, joining them with a single space. Any line that wasn't pre-split will be left alone. Ones that were broken are rejoined.
For real use you'd probably want to replace pp with a regular puts.
As for moving the code to Windows with Ruby, why? Install Ruby on HP-Unix and run it there. It's a more natural fit.
this short awk oneliner should do the job:
awk '/^KEYWORD/{print ""}{printf $0}' file
This might work for you (GNU sed):
sed ':a;$!{N;/\n.*|/!{s/\n/ /;ba}};P;D' file
Keep two lines in the pattern space and if the second line doesn't contain a | replace the newline with a space and repeat until it does or the the end of the file is reached.
This assumes the last field is the field that overflows, otherwise use the KEYWORD such:
sed ':a;$!{N;/\nKEYWORD/!{s/\n/ /;ba}};P;D' file
Powershell way:
[System.IO.File]::ReadAllText( "c:\myfile.txt" ) -replace "`r`n(?!KEYWORD)", ' '
You can use sed or awk (preferred) for this ยป
sed -n 's|\r||g;$!{1{x;d};H};${H;x;s|\n\(KEYWORD\)|\r\1|g;s|\n||g;s|\r|\n|g;p}' file.txt
awk 'BEGIN{ORS="";}NR==1{print;next;}/^KEYWORD/{print"\n";print;next;}{print;}' file.txt
Note: Write each command (sed, awk) in one line

CSV.read Illegal quoting in line x

I am using ruby CSV.read with massive data. From time to time the library encounters poorly formatted lines, for instance:
"Illegal quoting in line 53657."
It would be easier to ignore the line and skip it, then to go through each csv and fix the formatting. How can I do this?
I had this problem in a line like 123,456,a"b"c
The problem is the CSV parser is expecting ", if they appear, to entirely surround the comma-delimited text.
Solution use a quote character besides " that I was sure would not appear in my data:
CSV.read(filename, :quote_char => "|")
The liberal_parsing option is available starting in Ruby 2.4 for cases like this. From the documentation:
When set to a true value, CSV will attempt to parse input not conformant with RFC 4180, such as double quotes in unquoted fields.
To enable it, pass it as an option to the CSV read/parse/new methods:
CSV.read(filename, liberal_parsing: true)
Don't let CSV both read and parse the file.
Just read the file yourself and hand each line to CSV.parse_line, and then rescue any exceptions it throws.
Try forcing double quote character " as quote char:
require 'csv'
CSV.foreach(file,{headers: :first_row, quote_char: "\x00"}) do |line|
p line
end
Apparently this error can also be caused by unprintable BOM characters. This thread suggests using a file mode to force a conversion, which is what finally worked for me.
require 'csv'
CSV.open(#filename, 'r:bom|utf-8') do |csv|
# do something
end

How to efficiently parse large text files in Ruby

I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed the process.
#...
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
This file in particular has 642,868 lines:
$ wc -l nginx.log /code/src/myimport
642868 ../nginx.log
Does anyone know of a more efficient (memory/cpu) way to process each line in this file?
UPDATE
The code inside of the f.each_line from above is simply matching a regex against the line. If the match fails, I add the line to a #skipped array. If it passes, I format the matches into a hash (keyed by the "fields" of the match) and append it to a #results array.
# regex built in `def initialize` (not on each line iteration)
#regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
#... loop lines
match = line.match(#regex)
if match.nil?
#skipped << line
else
#results << convert_to_hash(match)
end
I'm completely open to this being an inefficient process. I could make the code inside of convert_to_hash use a precomputed lambda instead of figuring out the computation each time. I guess I just assumed it was the line iteration itself that was the problem, not the per-line code.
I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?
This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem
If you are using bash (or similar) you might be able to optimize like this:
In input.rb:
while x = gets
# Parse
end
then in bash:
cat nginx.log | ruby -n input.rb
The -n flag tells ruby to assume 'while gets(); ... end' loop around your script, which might cause it to do something special to optimize.
You might also want to look into a prewritten solution to the problem, as that will be faster.

Ruby and Netbeans problem

I'm reading a file line by line in a simple program and when I print the the lines to the screen the last line can't be seen at the ouput window in Netbeans 6.5.1 IDE on Windows XP but when I run the program directly from the command line as "ruby main.rb" there is not a problem (i.e the last line can be seen).I'm using Ruby 1.8.6.Here is the entire code :
File.open("songs.txt","r") do |file|
file.each do |line|
print line
end
end
This will work better if you use puts which will append a newline terminator if there is not already one at the end of the line, forcing a buffer flush.
I've never run across this before myself, but my guess would be that your final line doesn't have a trailing line break, so the Netbeans console isn't flushing the line. Try adding $stdout.flush at the end of the program and see what happens.
By the way, you can simplify this code slightly by rewriting it using foreach:
File.foreach("songs.txt","r") do |file|
print line
end

Resources