The code below comes from the documentation for the Ruby Gem rroc. I desperately need to calculate the AUC for my AI project. However I have virtually no knowledge of Ruby file I/O, not having had occasion to learn. The documentation says rroc expects an n by 2 array but the first line of code below suggest that the data is in a csv file and it will be formatted into my_data for roc to calculate the auc.
I have tried every conceivable combination of csv data and arrays as both files for the first line to read or direct input into the line calculating auc. At best the code works, without error but gives a useless output of 0. My hope is that if I had a fuller understand of what that line does, I could either fix the problem or give up on the gem since a previous version of this gem was shown to be obsolete and this one's 8 years old. I took the data from the article referenced by the gem author and am pretty sure it's not the problem, but then,...
So, to refine the question: from that statement, can we tell what kind of data should be in 'some_data.cvs'? And what will be done to it to make my_data?
require 'rroc'
my_data = open('some_data.csv').readlines.collect { |l| l.strip.split(",").map(&:to_f) }
auc = ROC.auc(my_data)
puts auc
Below I've copied the output for two runs, the first with array data read in, the second with csv values (each in separate files). I added a line to read out the input file just to be sure.
RoyiMac:ruby $ ruby PDaucT.rb
[[90, 1], [80, 1], [70,-1], [60,1], [55,1], [54,1], [53,-1], [52,-1], [51,1], [50,-1], [40,1], [39,-1], [38,1], [37,-1], [36,-1], [35,-1], [34,1], [33,-1], [30,1], [10,-1]]
0.0
RoyiMac:ruby $ ruby PDaucT.rb
90,1,80,1,70,-1,60,1,55,1,54,1,53,-1,52,-1,51,1,50,-1,40,1,39,-1,38,1,37,-1,36,-1,35,-1,34,1,33,-1,30,1,10,-1
0.0
The explanation of the code:
open('some_data.csv') # open the some_data.csv file
.readlines # returns an array with each element being a line
.collect { |l| # for each line do the following tranformation
l.strip # remove proceeding and trailing whitespace characters
.split(',') # split the line based on the "," character (returning an array)
.map(&:to_f) # call .to_f on each element in the array, converting them to a float value
}
map/collect are aliases of each other.
However, like tadman already said in the comments you're better of using the csv standard library. The same can be achieved with:
require 'csv'
my_data = CSV.read('some_data.csv', converters: :float)
# should output
#=> [[90, 1], [80, 1], [70,-1], [60,1], [55,1], [54,1], [53,-1], [52,-1], [51,1], [50,-1], [40,1], [39,-1], [38,1], [37,-1], [36,-1], [35,-1], [34,1], [33,-1], [30,1], [10,-1]]
This is my code:
File.open(file_name) do |file|
file.each_line do |line|;
if line =~ (/SAPK/) || (line =~ /^\t/ and tabs == true) || (line =~ /^ / and spaces == true)
file = File.open("./1.log", "a"); puts "found a line #{line}"; file.write("#{line}".lstrip!)
end
end
end
File.open("./2.log", "a") { |file| file.puts File.readlines("./1.log").uniq }
I want to to insert all the lines that match a specific string, start with tab or start with space to a file 1.log, all lines should be with space/tab at the beginning so I removed them.
I want get the unique lines in 1.log and write them to 2.log
It will be great if some can go over the code and tell me if something is not correct.
When using files in Ruby, what is the difference between the w+ and a modes?
I know:
w+ - Create an empty file for both reading and writing.
a - Append to a file.The file is created if it does not exist.
But both options append to the file, I though w+ should behave like >, instead of >> ,so I guess w+ also like >> ?
Thanks !!
There's a lot of confusion in this code and it isn't helped by your habit of jamming things on a single line for no reason. Do try and keep your code clean, as the functionality should be obvious. There's also a lot of quirky anti-patterns like stringifying strings and testing booleans vs booleans which you should really avoid doing.
One thing you'll want to do is employ Tempfile for those situations where you need an intermediate file.
Here's a reworked version that's cleaned up:
Tempfile.open do |temp|
File.open(file_name) do |input|
input.each_line do |line|
if line.match(/SAPK/) || (line.match(/^\t/) and tabs) || (line.match(/^ /) and spaces)
puts "found a line #{line}"
temp.write(line.lstrip!)
end
end
end
File.open("./2.log", "w+") do |file|
# Rewind the temporary file to read data back
temp.rewind
file.write(temp.readlines.uniq)
end
end
Now a and w+ are largely similar, it's just two ways that are offered for people familiar with whatever notation. It's like how Array has both length and size which do the same thing. Pick one and use it consistently or your code will be confusing.
My criticism over things like x == true is because something that narrowly specific usually means that x could take on a multitude of values and true is one particular case we're trying to handle, something that implies that we should be aware it might be false and many other things. It's a red herring and will only invite questions.
The new Mac OS update moved the system Ruby up to 2.0, which is great, but now I'm seeing errors in a lot of my scripts that I don't know how to fix. Specifically, I had code that called for files using mdfind and then read them, like this:
files = %x{mdfind -onlyin /Users/Username/Dropbox/Tasks 'kMDItemContentModificationDate >= "$time.today(-1)"'}
files.each do |file|
Now I'm getting an error that says
undefined method `each' for #<String:0x007f83521865c8> (NoMethodError)"
It seems as if each now needs a qualifier. I tried each_line but that yielded additional errors down the line. Is there a simple replacement for this that I'm overlooking?
Ruby 1.8 used to have String#each which was doing implicit splitting.
each(separator=$/) {|substr| block } => str
Splits str using the supplied parameter as the record separator ($/ by default), passing each substring in turn to the supplied block. If a zero-length record separator is supplied, the string is split into paragraphs delimited by multiple successive newlines.
Explicit splitting should work in modern rubies, I believe.
files.split($/).each do |file|
Where $/ is newline char. You can use explicit char, since your script is not portable anyway.
files.split("\n").each do |file|
Update
or you can just use an alias of now-extinct each
files.each_line do |file|
I'm currently opening a file taken at runtime via ARGV:
File.open(ARGV[0]) do |f|
f.each_line do |line|
Once a match is found I print output to the user.
if line.match(/(strcpy)/i)
puts "[!] strcpy does not check for buffer overflows when copying to destination."
puts "[!] Consider using strncpy or strlcpy (warning, strncpy is easily misused)."
puts " #{line}"
end
I want to know how to print out the line number for the matching line in the (ARGV[0]) file.
Using print __LINE__ shows the line number from the Ruby script. I've tried many different variations of print __LINE__ with different string interpolations of #{line} with no success. Is there a way I can print out the line number from the file?
When Ruby's IO class opens a file, it sets the $. global variable to 0. For each line that is read that variable is incremented. So, to know what line has been read simply use $..
Look in the English module for $. or $INPUT_LINE_NUMBER.
We can also use the lineno method that is part of the IO class. I find that a bit more convoluted because we need an IO stream object to tack that onto, while $. will work always.
I'd write the loop more simply:
File.foreach(ARGV[0]) do |line|
Something to think about is, if you're on a *nix system, you can use the OS's built-in grep or fgrep tool to greatly speed up your processing. The "grep" family of applications are highly optimized for doing what you want, and can find all occurrences, only the first, can use regular expressions or fixed strings, and can easily be called using Ruby's %x or backtick operators.
puts `grep -inm1 abacus /usr/share/dict/words`
Which outputs:
34:abacus
-inm1 means "ignore character-case", "output line numbers", "stop after the first occurrence"
I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed the process.
#...
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
This file in particular has 642,868 lines:
$ wc -l nginx.log /code/src/myimport
642868 ../nginx.log
Does anyone know of a more efficient (memory/cpu) way to process each line in this file?
UPDATE
The code inside of the f.each_line from above is simply matching a regex against the line. If the match fails, I add the line to a #skipped array. If it passes, I format the matches into a hash (keyed by the "fields" of the match) and append it to a #results array.
# regex built in `def initialize` (not on each line iteration)
#regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
#... loop lines
match = line.match(#regex)
if match.nil?
#skipped << line
else
#results << convert_to_hash(match)
end
I'm completely open to this being an inefficient process. I could make the code inside of convert_to_hash use a precomputed lambda instead of figuring out the computation each time. I guess I just assumed it was the line iteration itself that was the problem, not the per-line code.
I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?
This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem
If you are using bash (or similar) you might be able to optimize like this:
In input.rb:
while x = gets
# Parse
end
then in bash:
cat nginx.log | ruby -n input.rb
The -n flag tells ruby to assume 'while gets(); ... end' loop around your script, which might cause it to do something special to optimize.
You might also want to look into a prewritten solution to the problem, as that will be faster.