CSV only returns strings. I need to keep value types - ruby

I'm trying to parse through a CSV file and grab every row and upload it to Postgres. The problem is that CSV.foreach returns every value as a string and Postgres won't accept string values in double columns.
Is there an easy way to keep the value types? Or am I going to have to go column by column and convert the strings into doubles and date formats?
require 'csv'
CSV.foreach("C:\\test\\file.csv") do |row|
print row
end
All I need is the values to keep their type and not be returned as a string. I don't know if this is possible with CSV. I have it working just fine when using spreadsheet gem to parse through .xls files.

CSVs do not natively have types; a CSV contains simple comma-separated text. When you view a CSV, you are seeing everything there is to the file. In an Excel file, there is a lot of hidden metadata that tracks the type of each cell.
When you #foreach through a CSV, each row is given as an array of string values. A row might look something like
[ "2.33", "4", "Hello" ]
with each value given as a string. You may think of "2.33" as a float/double, but CSV parsers only know to think of it as a string.
You can convert strings to other types using Ruby's type conversion functions, assuming each column contains only one type (which, since you're using an SQL database, is a pretty safe assumption).
You could write something like this, to convert the values in each row to specific types. This example converts the first row to a float (which should work with Postgres' `double), converts the second row to an integer, and the third row to a string.
require 'csv'
CSV.foreach("C:\\test\\file.csv") do |row|
puts [ row[0].to_f, row[1].to_i, row[2].to_s ]
end
Given the sample row from above, this function would print an array like
>> [ 2.33, 4, "Hello" ]
You should be able to use these converted values in whatever else you're doing with Postgres.

require 'csv'
CSV.foreach("test.txt", converters: :all) do |row|
print row
end
This should convert numerics and datetimes. For integers and floats this works perfectly, but I was not able to get an actual conversion to DateTime going.

Related

Phoenix/Ecto - integer values in a Postgres array column not being retrieved

Was trying to build schemas for an existing set of tables using Ecto 2.1, in a Phoenix 1.3.0 app.
Example:
defmodule Book do
use Ecto.Schema
schema "books" do
field :title, :string
field :owner_ids, {:array, :integer}
field :borrower_ids, {:array, :integer}
field :published, :boolean
end
end
On the console when I do Book |> first |> Repo.one, I see the owner_ids are printed properly ["29"], but the borrower_ids shows '$'. Verified using psql that borrower_ids for that row in the table does have a list of values in the table, exactly like the owner_ids column.
All other columns in the table print just fine. Anything I am missing here?
Update: Rails/ActiveRecord 5.1.4 was able to retrieve this table and row just fine.
'$' is a list containing the number 36:
iex> [36]
'$'
In a nutshell, every time Elixir sees a list of integers representing ASCII characters, it prints them between single quotes, because that's how Erlang strings are represented (also called charlists).
The i helper in IEx is very useful in those situations. When you see a value that you don't understand, you can use it to ask for more information:
iex(2)> i '$'
Term
'$'
Data type
List
Description
This is a list of integers that is printed as a sequence of characters
delimited by single quotes because all the integers in it represent valid
ASCII characters. Conventionally, such lists of integers are referred to
as "charlists" (more precisely, a charlist is a list of Unicode codepoints,
and ASCII is a subset of Unicode).
Raw representation
[36]
Reference modules
List

Importing CSV data to update existing records with Rails

I'm having a bit of trouble getting a CSV into my application witch I'd like to use to update existing and create records. My CSV data only has two headers Date and Total. I've create a import method in my model which creates everything but if I can the CSV and upload it won't update existing records, it just creates duplicates?
Here is my method, as you can see I'm finding the row by Date heading once matched using find_by, then creating a new record if this returns false and update with the data from the current row if matched but that doesn't seem to be the case, I just get duplicate rows.
def self.import(file)
CSV.foreach(file.path, headers: true) do |row|
entry = find_by(Date: row["Date"]) || new
entry.update row.to_hash
entry.save!
end
end
I hope I'm understanding this correctly. As discovered in the comments below, the CSV date format is DD-MM-YYYY and the database is storing the date as YYYY-MM-DD.
As we found in the comment thread for the question, the date was being persisted to the database in yyyy-mm-dd format.
The date being read in from the CSV file was in mm-dd-yyyy format. Doing a find_by using this format never returned results, as the format differed from that used in the database.
Date.parse will convert the string read from the CSV file into a true Date object which can be successfully compared against the date stored in the database.
So, rather than:
entry = find_by(Date: row["Date"]) || new
Use:
entry = find_by(Date: Date.parse(row["Date"])) || new

Parsing large txt files in ruby taking a lot of time?

below is the code to download a txt file from internet approx 9000 lines and populate the database, I have tried a lot but it takes a lot of time more than 7 minutes. I am using win 7 64 bit and ruby 1.9.3. Is there a way to do it faster ??
require 'open-uri'
require 'dbi'
dbh = DBI.connect("DBI:Mysql:mfmodel:localhost","root","")
#file = open('http://www.amfiindia.com/spages/NAV0.txt')
file = File.open('test.txt','r')
lines = file.lines
2.times { lines.next }
curSubType = ''
curType = ''
curCompName = ''
lines.each do |line|
line.strip!
if line[-1] == ')'
curType,curSubType = line.split('(')
curSubType.chop!
elsif line[-4..-1] == 'Fund'
curCompName = line.split(" Mutual Fund")[0]
elsif line == ''
next
else
sCode,isin_div,isin_re,sName,nav,rePrice,salePrice,date = line.split(';')
sCode = Integer(sCode)
sth = dbh.prepare "call mfmodel.populate(?,?,?,?,?,?,?)"
sth.execute curCompName,curSubType,curType,sCode,isin_div,isin_re,sName
end
end
dbh.do "commit"
dbh.disconnect
file.close
106799;-;-;HDFC ARBITRAGE FUND RETAIL PLAN DIVIDEND OPTION;10.352;10.3;10.352;29-Jun-2012
This is the format of data to be inserted in the table. Now there are 8000 such lines and how can I do an insert by combining all that and call the procedure just once. Also, does mysql support arrays and iteration to do such a thing inside the routine. Please give your suggestions.Thanks.
EDIT
I have to make insertion's into the tables depending on whether they are already exist or not, also I need to make use of conditional comparison's before inserting into the table. I definitely can't write SQL statements for these, so I wrote SQL stored procedures. Now I have a list #the_data, how do I pass that to the procedure and then iterate through it all on MySQL side. Any ideas ?
insert into mfmodel.company_masters (company_name) values
#{#the_data.map {|str| "('#{str[0]}')"}.join(',')}
this makes 100 insertions but 35 of them are redundant so I need to search the table for existing entries before doing a insertion.
Any Ideas ? thanks
From your comment, it looks like you are spending all your time executing DB queries. On a recent Ruby project, I also had to optimize some slow code which was importing data from CSV files into the database. I got about a 500x performance increase by importing all the data by using a single bulk INSERT query, rather than 1 query for each row of the CSV file. I accumulated all the data in an array, and then built a single SQL query using string interpolation and Array#join.
From your comments, it seems that you may not know how to build and execute dynamic SQL for a bulk INSERT. First get your data in a nested array, with the fields to be inserted in a known order. Just for an example, imagine we have data like this:
some_data = [['106799', 'HDFC FUND'], ['112933', 'SOME OTHER FUND']]
You seem to be using Rails and MySQL, so the dynamic SQL will have to use MySQL syntax. To build and execute the INSERT, you can do something like:
ActiveRecord::Base.connection.execute(<<SQL)
INSERT INTO some_table (a_column, another_column)
VALUES #{some_data.map { |num,str| "(#{num},'#{str}')" }.join(',')};
SQL
You said that you need to insert data into 2 different tables. That's not a problem; just accumulate the data for each table in a different array, and execute 2 dynamic queries, perhaps inside a transaction. 2 queries will be much faster than 9000.
Again, you said in the comments that you may need to update some records rather than inserting. That was also the case in the "CSV import" case which I mentioned above. The solution is only slightly more complicated:
# sometimes code speaks more eloquently than prose
require 'set'
already_imported = Set.new
MyModel.select("unique_column_which_also_appears_in_imported_files").each do |x|
already_imported << x.unique_column_which_also_appears_in_imported_files
end
to_insert,to_update = [],[]
imported_data.each do |row|
# for the following line, don't let different data types
# (like String vs. Numeric) get ya
# if you need to convert the imported data to match correctly against what's
# already in the DB, do it!
if already_imported.include? row[index_of_unique_column]
to_update << row
else
to_insert << row
end
end
Then you must build a dynamic INSERT and a dynamic UPDATE for each table involved. Google for UPDATE syntax if you need it, and go wild with all your favorite string processing functions!
Going back to the sample code above, note the difference between numeric and string fields. If it is possible that the strings may contain single quotes, you will have to make sure that all the single quotes are escaped. The behavior of String#gsub may be surprise you when you try to do this: it assigns a special meaning to \'. The best way I have found so far to escape single quotes is: string.gsub("'") { "\\'" }. Perhaps other posters know a better way.
If you are inserting dates, make sure they are converted to MySQL's date syntax.
Yes, I know that "roll-your-own" SQL sanitization is very iffy. There may even be security bugs with the above approach; if so, I hope my better-informed peers will set me straight. But the performance gains are just too great to ignore. Again, if this can be done using a prepared query with placeholders, and you know how, please post!
Looking at your code, it looks like you are inserting the data using a stored procedure (mfmodel.populate). Even if you do want to use a stored procedure for this, why do you have dbh.prepare in the loop? You should be able to move that line outside of lines.each.
You might want to try exporting the data as csv and loading it with 'load data infile... replace'. It seems cleaner/easier than trying to construct bulk insert queries.

Ruby csv: no rows found in single column import file

I'm happily using the Ruby 1.9.3 CSV library to import CSV files (csv/rdoc)
But when the file has only a single column, no data rows are found, even though it can find the header field.
require 'csv'
csv = CSV.new(File.open(import_dir + "#{table}.csv"), :headers => true, :col_sep => ';')
csv.each do |row|
each doesn't return any elements for a single column file. This code is working fine for all other files
The file is simply:
name
sample account
The code finds the header "name" but sees no data rows. I tried quoting the value and adding extra rows. If I add a second column before or after, the data rows can be seen.
Any ideas?
This was being caused by a bug in the app code, nothing to do with the ruby CSV library

Parsing Excel and Google Docs Spreadsheet with column headers or titles

I have excel or google docs spreadsheet files I need to parse. The columns may be ordered differently from file to file but there is always a title in the first row.
I'm looking for a way to use column titles in referencing cells when reading a file with Roo or another similarly purposed gem.
Here's what I'd like to be able to do
Get the 4th for the column titled Widget Count no matter what the column position is:
thisWidgetCount = cell[4,'Widget Count'];
I realize I could just walk the columns and build a hash of title to column name but it seems like this is likely something someone has already wrapped into a lib.
You can extend Roo, or just write as a helper:
oo = Openoffice.new("simple_spreadsheet.ods")
this_widget_count = tcell(oo, 3, 'Widget Count')
def tcell(roo, line, column_title)
column = first_column.upto(last_column).map{ |column| roo.cell(1, column) }.index( column_type ) + 1
roo.cell(line, column)
end
And of course it is better to preload all captions in an Array, because in our case you are fetching titles every time (and that is bad idea in performance)

Resources