How to compare data in two CSV files - ruby

I have two CSV files which have the same structure and ideally should have the same data.
I want to compare the data in them using Ruby and wanted to know if we already have a Ruby function for the same.

If you want to check whether files are identical you can simply use identical? which is an alias for compare_file:
FileUtils.identical?('file1.csv', 'file2.csv')
If you want to see the differences you might want to use diffy:
gem install diffy
puts Diffy::Diff.new('file1.csv', 'file2.csv', :source => 'files')
It produces diff-like output which can be nicely formatted as HTML:
puts Diffy::Diff.new('file1.csv', 'file2.csv', :source => 'files').to_s(:html_simple)

As Summea commented, look at the CSV class.
Then use:
#Will store each line of each file as an array of fields (so an array of arrays).
file1_lines = CSV.read("file1.csv")
file2_lines = CSV.read("file2.csv")
for i in 0..file1_lines.size
if (file1_lines[i] == file2_lines[i]
puts "Same #{file1_lines[i]}"
else
puts "#{file1_lines[i]} != #{file2_lines[i]}"
end
end
Note that using for in Ruby is quite rare. You normally iterate using an each on the collections, but there are two of them here.
Also, pay attention that one of the list may be longer than the other, but this should get you started.

Related

Having a CSV file and letting a user edit

In ruby, if I have a CSV like this:
make,model,color,doors,email
dodge,charger,black,4,practice1#whatever.com
ford,focus,blue,5,practice2#whatever.com
nissan,350z,black,2,practice3#whatever.com
mazda,miata,white,2,practice4#whatever.com
honda,civid,brown,4,practice5#whatever.com
corvette,stingray,red,2,practice6#whatever.com
ford,fiesta,blue,5,practice7#whatever.com
bmw,m4,black,2,practice8#whatever.com
audi,a5,blue,2,practice9#whatever.com
subaru,brz,black,2,practice10#whatever.com
lexus,rc,black,2,practice11#whatever.com
I want to allow a user to enter an email and be able to edit any one of the options listed. For example, a user enters the email "practice11#whatever.com" and it will output "lexus,rc,black,2,practice11#whatever.com". Then from here the program will output some message that will tell the user to select to edit by "make,model,color,doors,email", and then be able to change whatever is there. Like lets say they choose "color", then they can change the color from "black" to "blue" of "practice11#whatever.com" line. I believe this can be done using a hash and using key-values but I am not sure how to exactly make the editing part work.
this is my current code:
require "csv"
csv = CSV.read('cars.csv', headers: true)
demo = gets.chomp
print csv.find {|row| row['email'] == demo}
all it does it takes in the csv file and allows a user to enter in an email and it will output that specific line.
So - your question is a bit vague and involves a number of implied questions, such as "how do I write code that can ask for different options and act accordingly" - so it might help if you clarify exactly what you are trying to ask.
From the looks of it, you seem most interested in understanding how to modify the CSV table, and to get info about the CSV fields/table/data etc..
And for this, you have two friends: The ruby 'p' method and the docs.
The 'p' method allows you to inspect objects. "p someObject" is the same as calling 'puts someObject.inspect' - and it's very handy, as is "puts someObject.class" to find out what type of object you're dealing with.
In this case, you can change the last line of your code a bit to get some info:
puts csv.class
got = csv.find {|row| row['email'] == demo}
p got
And suddenly we learn we are dealing with a CSV::Table
This is not surprising, let's head over to the docs. I don't know what version of ruby you're using, but 2.6.1 is current enough to have the info we need and is plenty old at this point, so you probably have access to it:
https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html
Tells us that if we do the CSV.read using headers:
"If headers specified, reading methods return an instance of CSV::Table, consisting of CSV::Row."
So now we know we have a CSV::Table (which is much like an array/list but with some convenience methods (such as the 'find' that you are using).
And a CSV::Row is basically a hash that maintains it's order and is, as expected, keyed according to the headers.
So we can do:
p got.fields
p got['model']
got['model'] = 'edsel'
p got['model']
p got.fields
And not surprisingly, the CSV::Table has a 'to_s' method that let's us print out the CSV:
puts csv.to_s
You can probably take it from here.

How to check for multiple words inside a folder

I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).
Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.
Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end

Find out if CSV file contains empty field in Ruby?

Using Ruby 1.9.3, I want to read in a CSV file with headers and scan each single field to see if it is left empty and does not contain a value, like foo,,bar,foofoo,barbar(the second one).
My approach is as follows:
require 'CSV'
#read csv file line by line
CSV.foreach(filename,headers:true) do |row|
#loop through each element within the current row
for i in (0..row.length-1)
#check for empty fields
if !row[i]
puts "empty field"
end
end
end
Well, this works, but when processing a file with ~18 million fields, this is quite slow, and I have many of them. Is there any faster and more elegant ways to do this?
Using grep
Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:
File.new(filename).grep(/(^,|,(,|$))/)
It's about 10 times faster. If you need access to the fields you can use CSV.parse:
require 'csv'
File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
CSV.parse(row_string) do |row|
puts row[1]
end
end
Using a native CSV parser
Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.
You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.
There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:
require 'excelsior'
Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
I tested this code with a file like yours (72M, ~30k entries à 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.
Using CSV
As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:
require 'csv'
CSV.foreach('/tmp/big.csv') do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
This won't improve the speed though.
Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:
File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field
I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.
You can check all columns at once using Enumerable#any?
CSV.foreach(filename,headers:true) do |row|
puts "empty field" if row.any?(&:nil?)
end
I think the grep solution will still be faster. Shelling out to the linux grep command would be the fastest.

Extract JSON values from remote api with Ruby

I'm trying to grab some data from last.fm and use it in a simple sinatra app. I've worked out how to open the document but having issues extracting the data in ruby here is the first list of the API data I'd like to grab the name:
{"similarartists":{"artist":[{"name":"Sonny & Cher"}]}
This is just an extract of the return, I'm using this in my rb file:
require 'json'
require 'open-uri'
data = JSON.parse(open("http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist=editors&api_key=xxx&format=json").read)
puts data["similarartists"]["artist"]["name"]
It doesn't seem to be working I get can't convert String into Integer (TypeError) on ruby 1.9.3 but the name in the JSON isn't an integer? If I just put the following:
puts data["similarartists"]["artist"]
It returns the whole thing, but I want to grab inside of that and get the name.
"name"=>"Interpol"
I don't understand why it would complain about integers when the name is a string? Hope someone can help me!
Based on the comments thread, the issue is a misunderstanding of the structure of the data returned from the API call.
The exact issue was the structure had an array of artists under the artist key so to get at the name you need to do:
data['similarartists']['artist'][0]['name']
Note though that you should only do that if you are sure there will only be one artist. The nature of the return data suggests that won't always be the case so you might be better off pulling all names depending on your use doing something like:
data['similarartists']['artist'].map {|a| a['name']}.join(',')
That will join all of the artist names together comma separated.
In the future, you can track this issue down by looking at the full structure of the return data and making sure you see the correct structure. The docs on the API may indicate some help here too.
You also might check if someone has made a gem for accessing the API. Often a gem will up-level some of this raw output and give you a nice object to work with. I suggest searching GitHub for a last.fm gem.
The problem is that you are trying to access an Array with the index "name", Ruby tries to convert this to an Integer and fails which results in the Error message you are seeing.
If you test the class of data["similarartists"]["artist"].class you will see that it returns Array. So basically what is happening is that the JSON.parse() called created as the value of data["similarartists"]["artist"] an Array of Hashes. To access all of the artist names you can simply iterate through this array:
require 'json'
require 'open-uri'
data = JSON.parse(open("http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist=editors&api_key=29da5a0e01ca2d1524cac596d5462d67&format=jso\
n").read)
# iterate through the Array of returned artists and print their names
data["similarartists"]["artist"].each do |artist|
puts artist["name"]
end
# output
# Interpol
# White Lies
# The Cinematics
# Smith & Burrows
# The National
# Julian Plenti
# She Wants Revenge
# etc ...
If you only want the first entry for Interpol you can just use index [0]:
puts data["similarartists"]["artist"][0]["name"]

Ruby parse comma separated text file

I need some help with a Ruby script I can call from the console. The script needs to parse a simple .txt file with comma separated values.
value 1, value2, value3, etc...
The values needs to be added to the database.
Any suggestions?
array = File.read("csv_file.txt").split(",").map(&:strip)
You will get the values in the array and use it to store to database. If you want more functions, you can make use of FasterCSV gem.
Ruby 1.9.2 has a very good CSV library which is useful for this stuff: http://www.ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html
On earlier versions of Ruby you could use http://fastercsv.rubyforge.org/ (which essentially became CSV in 1.9.2)
You could do it manually by reading the file into a string and using .split(',') but I'd go with one of the libraries above.
Quick and dirty solution:
result = []
File.open("<path-to-file>","r") do |handle|
handle.each_line do |line|
result << line.split(",").strip
end
end # closes automatically when EOF reached
result.flatten!
result # => big array of values
Now you can iterate the result array and save the values to the database.
This simple file iteration doesn't take care for order or special fields, because it wasn't mentioned in the question.
Something easy to get you started:
IO.readlines("csv_file.txt", '').each do |line|
values = line.split(",").collect(&:strip)
# do something with the values?
end
Hope this helps.

Resources