What is the best way to parse a large CSV file in ruby. My CSV file is almost 1 GB. I want to filter the data in CSV according to some conditions.
You don't specifically say, but I think most people commenting feel this is likely to be a homework question. If so you should read "How do I ask and answer homework questions?". If not read "How do I ask a good question?".
As G4143 stated in the comment Ruby has an excellent CSV class which should fit your needs.
Here are a couple of quick examples using foreach which the documentation describes as being intended as the primary method for reading CSV files. The method reads one line at a time from the file so it should work well with large files. Here is a basic example of how you might filter out a subset of Csv records using it, but I would encourage you to read the CSV class documentation and follow-up with more specific questions, showing what you have tried so far if you have trouble.
The basic idea is to start with an empty array, use foreach to get each row and if that row meets your filtering criteria, added to the initially empty filtered results array.
test.csv:
a, b, c
1,2,3
4,5,6
require 'csv'
filtered = []
CSV.foreach("test.csv") do |row|
filtered << row if row[0] == "1"
end
filtered
=> [["1", "2", "3"]]
In the case where the first line of the file is a "header" you can pass in an option to treat it as such:
require 'csv'
filtered = []
CSV.foreach("test.csv", :headers => true) do |row|
filtered << row if row["a"] == "1"
end
filtered
=> [#<CSV::Row "a":"1" " b":"2" " c":"3">]
Related
I'm opening CSV using Ruby:
CSV.foreach(file_name, "r+") do |row|
next if row[0] == 'id'
update_row! row
end
and I don't really care about headers row.
I don't like next if row[1] == 'id' inside loop. Is there anyway to tell CSV to skip headers row and just iterate through rows with data ?
I assume provided CSVs always have a header row.
There are a few ways you could handle this. The simplest method would be to pass the {headers: true} option to your loop:
CSV.foreach(file_name, headers: true) do |row|
update_row! row
end
Notice how there is no mode specified - this is because according to the documentation, CSV::foreach takes only the file and options hash as its arguments (as opposed to, say, CSV::open, which does allow one to specify mode.
Alternatively, you could read the data into an array (rather than using foreach), and shift the array before iterating over it:
my_csv= CSV.read(filename)
my_csv.shift
my_csv.each do |row|
update_row! row
end
According to Ruby doc:
options = {:headers=>true}
CSV.foreach(file_name, options) ...
should suffice.
A simple thing to do that works when reading files line-by-line is:
CSV.foreach(file_name, "r+") do |row|
next if $. == 1
update_row! row
end
$. is a global variable in Ruby that contains the line-number of the file being read.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
What exactly does CSV do? (I know its a gem)
What is SLICE_SIZE? (Line 13)
Why is csv insided straight braces?(Line 16) CSV do |csv|
Can You explain the whole lines 17, 18, 19?I am completely lost on that?
-----------------Code--------------------------------------------------
require 'twitter'
require 'csv'
def twitter_client
#twitter_client ||= Twitter::REST::Client.new do |config|
config.consumer_key = ""
config.consumer_secret = ""
config.access_token = ""
config.access_token_secret = ""
end
end
SLICE_SIZE = 100
def fetch_all_friends(twitter_username)
CSV do|csv|
twitter_client.follower_ids(twitter_username).each_slice
(SLICE_SIZE).with_index do |slice, i|
twitter_client.users(slice).each_with_index do |f, j|
csv << [i * SLICE_SIZE + j + 1, f.name,
f.screen_name]
end
end
end
end
CSV is a class, implementing handling of CSV data, as follows from its documentation.
each_slice is a method of Enumerable which takes only so many elements from source collection per each iteration. This is done to reduce memory requirements of the computation or, possibly, to delay fetching more data until current chunk is processed, for example. The SLICE_SIZE value is how many elements to take.
CSV do |csv| initializes a csv object and passes it down to the block as a parameter. This is a way to organize code so that e.g. initialization is separate from business logic of the block. The block is delimited by do and end keywords.
The next two lines are actually a single statement:
twitter_client.follower_ids(twitter_username).each_slice(SLICE_SIZE).with_index do |slice, i|
It takes a collection of follower ids of twitter_username from the Twitter API, takes elements from there in chunks of SLICE_SIZE and passes each chunk and its index into another block. Contents of this block are executed as many times as there are chunks of SLICE_SIZE in the follower_ids.
The next line
twitter_client.users(slice).each_with_index do |f, j|
works only on the element of current chunk. It takes each element of the chunk and passes it along with its index within the chunk to yet another block.
Up to this point it's more of collection handling than actual business logic.
The most inner statement
csv << [i * SLICE_SIZE + j + 1, f.name, f.screen_name]
creates an array of three fields: an index of some sort, the Twitter-supplied name and screen_name. This array represents a CSV line. The << operator pushes that array into the csv object which was mentioned before. The csv object adds it to what it has collected by this time.
When this code finishes you will have a csv object filled with data received from the Twitter API and ready to be saved to disk.
There used to be a gem faster_csv that you needed to call but not anymore. You just require the CSV library.
It is usually used to either read or create CSV files.
It is a bit hard to answer your question when there are no line numbers. You can learn more about Ruby's CSV library at: http://ruby-doc.org/stdlib-2.2.2/libdoc/csv/rdoc/CSV.html
I want to write a Kiba Etl script which has a source from a CSV to Destination CSV with a list of transformation rules among which the 2nd transformer is an Aggregation in which operation such as select name, sum(euro) group by name
Kiba ETL Script file
source CsvSource, 'users.csv', col_sep: ';', headers: true, header_converters: :symbol
transform VerifyFieldsPresence, [:name, :euro]
transform AggregateFields, { sum: :euro, group_by: :name}
transform RenameField,from: :euro, to: :total_amount
destination CsvDestination, 'result.csv', [:name, :total_amount]
users.csv
date;euro;name
7/3/2015;10;Jack
7/3/2015;85;Jill
8/3/2015;6;Jack
8/3/2015;12;Jill
9/3/2015;99;Mack
result.csv (expected result)
total_amount;name
16;Jack
97;Jill
99;Mack
As etl transformers execute one after the other on a single row at one time, But my 2nd transformer behavior depends on the entire collection of row which I cant access it in the class which is passed to transform method.
transform AggregateFields, { sum: :euro, group_by: :name }
Is there possibly any which this behavior can be achieved using kiba gem
Thank you in Advance
EDIT: it's 2020, and Kiba ETL v3 includes a much better way to do this. Check out this article https://thibautbarrere.com/2020/03/05/new-in-kiba-etl-v3 for all the relevant information.
Kiba author here! You can achieve that in many different ways, depending mainly on the data size and your actual needs. Here are a couple of possibilities.
Aggregating using a variable in your Kiba script
require 'awesome_print'
transform do |r|
r[:amount] = BigDecimal.new(r[:amount])
r
end
total_amounts = Hash.new(0)
transform do |r|
total_amounts[r[:name]] += r[:amount]
r
end
post_process do
# pretty print here, but you could save to a CSV too
ap total_amounts
end
This is the simplest way, yet this is quite flexible.
It will keep your aggregates in memory though, so this may be good enough or not, depending on your scenario. Note that currently Kiba is mono-threaded (but "Kiba Pro" will be multi-threaded), so there is no need to add a lock or use a thread-safe structure for the aggregate, for now.
Calling TextQL from post_process blocks
Another quick and easy way to aggregate is to generate a non-aggregated CSV file first, then leverage TextQl to actually do the aggregation, like this:
destination CsvSource, 'non-aggregated-output.csv', [:name, :amount]
post_process do
query = <<SQL
select
name,
/* apparently sqlite has reduced precision, round to 2 for now */
round(sum(amount), 2) as total_amount
from tbl group by name
SQL
textql('non-aggregated-output.csv', query, 'aggregated-output.csv')
end
With the following helpers defined:
def system!(cmd)
raise "Failed to run command #{command}" unless system(command)
end
def textql(source_file, query, output_file)
system! "cat #{source_file} | textql -header -output-header=true -sql \"#{query}\" > #{output_file}"
# this one uses csvfix to pretty print the table
system! "cat #{output_file} | csvfix ascii_table"
end
Be careful with the precision though when doing computations.
Writing an in-memory aggregating destination
A useful trick that can work here is to wrap a given destination with a class to do the aggregation. Here is how it could look like:
class InMemoryAggregate
def initialize(sum:, group_by:, destination:)
#aggregate = Hash.new(0)
#sum = sum
#group_by = group_by
# this relies a bit on the internals of Kiba, but not too much
#destination = destination.shift.new(*destination)
end
def write(row)
# do not write, but count here instead
#aggregate[row[#group_by]] += row[#sum]
end
def close
# use close to actually do the writing
#aggregate.each do |k,v|
# reformat BigDecimal additions here
value = '%0.2f' % v
#destination.write(#group_by => k, #sum => value)
end
#destination.close
end
end
which you can use this way:
# convert your string into an actual number
transform do |r|
r[:amount] = BigDecimal.new(r[:amount])
r
end
destination CsvDestination, 'non-aggregated.csv', [:name, :amount]
destination InMemoryAggregate,
sum: :amount, group_by: :name,
destination: [
CsvDestination, 'aggregated.csv', [:name, :amount]
]
post_process do
system!("cat aggregated.csv | csvfix ascii_table")
end
The nice thing about this version is that you can reuse your aggregator with different destinations (like a database one, or anything else).
Note though that this will keep all the aggregates in memory, like the first version.
Inserting into a store with aggregating capabilities
Another way (especially useful if you have very large volumes) is to send the resulting data into something that will be able to aggregate the data for you. It could be a regular SQL database, Redis, or anything more fancy, which you would then be able to query as needed.
So as I said, the implementation will largely depend on your actual needs. Hope you will find something that works for you here!
I have a CSV file like:
123,hat,19.99
321,cap,13.99
I have this code:
products_file = File.open('text.txt')
while ! products_file.eof?
line = products_file.gets.chomp
puts line.inspect
products[ line[0].to_i] = [line[1], line[2].to_f]
end
products_file.close
which is reading the file. While it's not at the end of the file, it reads each line. I don't need the line.inspect in there. but it stores the file in an array inside of my products hash.
Now I want to pull the min and max value from the hash.
My code so far is:
read_file = File.open('text.txt', "r+").read
read_file.(?) |line|
products[ products.length] = gets.chomp.to_f
products.min_by { |x| x.size }
smallest = products
puts "Your highest priced product is #{smallest}"
Right now I don't have anything after read_file.(?) |line| so I get an error. I tried using min or max but neither worked.
Without using CSV
If I understand your question correctly, you don't have to use CSV class methods: just read the file (less header) into an array and determine the min and max as follows:
arr = ["123,hat,19.99", "321,cap,13.99",
"222,shoes,33.41", "255,shirt,19.95"]
arr.map { |s| s.split(',').last.to_f }.minmax
#=> [13.99, 33.41]
or
arr.map { |s| s[/\d+\.\d+$/].to_f }.minmax
#=> [13.99, 33.41]
If you want the associated records:
arr.minmax_by { |s| s.split(',').last.to_f }
=> ["321,cap,13.99", "222,shoes,33.41"]
With CSV
If you wish to use CSV to read the file into an array:
arr = [["123", "hat", "19.99"],
["321", "cap", "13.99"],
["222", "shoes", "33.41"],
["255", "shirt", "19.95"]]
then
arr.map(&:last).minmax
# => ["13.99", "33.41"]
or
arr.minmax_by(&:last)
#=> [["321", "cap", "13.99"],
# ["222", "shoes", "33.41"]]
if you want the records. Note that in the CSV examples I didn't convert the last field to a float, assuming that all records have two decimal digits.
You should use the built-in CSV class as such:
require 'CSV'
data = CSV.read("text.txt")
data.sort!{ |row1, row2| row1[2].to_f <=> row2[2].to_f }
least_expensive = data.first
most_expensive = data.last
The Array#sort! method modifies data in place, so it is sorted based on the condition in the block for later usage. As you can see, the block sorts based on the values in each row at index 2 - in your case, the prices. Incidentally, you don't need to convert these values to floats - strings will sort the same way. Using to_f stops working if you have leading non-digit characters (eg, $), so you might find it better just keep the values as strings during your sort.
Then you can grab the most and least expensive, or the 5 most expensive, or whatever, at your leisure.
What's the most efficient way to iterate through an entire table using Datamapper?
If I do this, does Datamapper try to pull the entire result set into memory before performing the iteration? Assume, for the sake of argument, that I have millions of records and that this is infeasible:
Author.all.each do |a|
puts a.title
end
Is there a way that I can tell Datamapper to load the results in chunks? Is it smart enough to know to do this automatically?
Thanks, Nicolas, I actually came up with a similar solution. I've accepted your answer since it makes use of Datamapper's dm-pagination system, but I'm wondering if this would do equally as well (or worse):
while authors = Author.slice(offset, CHUNK) do
authors.each do |a|
# do something with a
end
offset += CHUNK
end
Datamapper will run just one sql query for the example above so it will have to keep the whole result set in memory.
I think you should use some sort of pagination if your collection is big.
Using dm-pagination you could do something like:
PAGE_SIZE = 20
pager = Author.page(:per_page => PAGE_SIZE).pager # This will run a count query
(1..pager.total_pages).each do |page_number|
Author.page(:per_page => PAGE_SIZE, :page => page_number).each do |a|
puts a.title
end
end
You can play around with different values for PAGE_SIZE to find a good trade-off between the number of sql queries and memory usage.
What you want is the dm-chunked_query plugin: (example from the docs)
require 'dm-chunked_query'
MyModel.each_chunk(20) do |chunk|
chunk.each do |resource|
# ...
end
end
This will allow you to iterate over all the records in the model, in chunks of 20 records at a time.
EDIT: the example above had an extra #each after #each_chunk, and it was unnecessary. The gem author updated the README example, and I changed the above code to match.