Parsing large txt files in ruby taking a lot of time? - ruby

below is the code to download a txt file from internet approx 9000 lines and populate the database, I have tried a lot but it takes a lot of time more than 7 minutes. I am using win 7 64 bit and ruby 1.9.3. Is there a way to do it faster ??
require 'open-uri'
require 'dbi'
dbh = DBI.connect("DBI:Mysql:mfmodel:localhost","root","")
#file = open('http://www.amfiindia.com/spages/NAV0.txt')
file = File.open('test.txt','r')
lines = file.lines
2.times { lines.next }
curSubType = ''
curType = ''
curCompName = ''
lines.each do |line|
line.strip!
if line[-1] == ')'
curType,curSubType = line.split('(')
curSubType.chop!
elsif line[-4..-1] == 'Fund'
curCompName = line.split(" Mutual Fund")[0]
elsif line == ''
next
else
sCode,isin_div,isin_re,sName,nav,rePrice,salePrice,date = line.split(';')
sCode = Integer(sCode)
sth = dbh.prepare "call mfmodel.populate(?,?,?,?,?,?,?)"
sth.execute curCompName,curSubType,curType,sCode,isin_div,isin_re,sName
end
end
dbh.do "commit"
dbh.disconnect
file.close
106799;-;-;HDFC ARBITRAGE FUND RETAIL PLAN DIVIDEND OPTION;10.352;10.3;10.352;29-Jun-2012
This is the format of data to be inserted in the table. Now there are 8000 such lines and how can I do an insert by combining all that and call the procedure just once. Also, does mysql support arrays and iteration to do such a thing inside the routine. Please give your suggestions.Thanks.
EDIT
I have to make insertion's into the tables depending on whether they are already exist or not, also I need to make use of conditional comparison's before inserting into the table. I definitely can't write SQL statements for these, so I wrote SQL stored procedures. Now I have a list #the_data, how do I pass that to the procedure and then iterate through it all on MySQL side. Any ideas ?
insert into mfmodel.company_masters (company_name) values
#{#the_data.map {|str| "('#{str[0]}')"}.join(',')}
this makes 100 insertions but 35 of them are redundant so I need to search the table for existing entries before doing a insertion.
Any Ideas ? thanks

From your comment, it looks like you are spending all your time executing DB queries. On a recent Ruby project, I also had to optimize some slow code which was importing data from CSV files into the database. I got about a 500x performance increase by importing all the data by using a single bulk INSERT query, rather than 1 query for each row of the CSV file. I accumulated all the data in an array, and then built a single SQL query using string interpolation and Array#join.
From your comments, it seems that you may not know how to build and execute dynamic SQL for a bulk INSERT. First get your data in a nested array, with the fields to be inserted in a known order. Just for an example, imagine we have data like this:
some_data = [['106799', 'HDFC FUND'], ['112933', 'SOME OTHER FUND']]
You seem to be using Rails and MySQL, so the dynamic SQL will have to use MySQL syntax. To build and execute the INSERT, you can do something like:
ActiveRecord::Base.connection.execute(<<SQL)
INSERT INTO some_table (a_column, another_column)
VALUES #{some_data.map { |num,str| "(#{num},'#{str}')" }.join(',')};
SQL
You said that you need to insert data into 2 different tables. That's not a problem; just accumulate the data for each table in a different array, and execute 2 dynamic queries, perhaps inside a transaction. 2 queries will be much faster than 9000.
Again, you said in the comments that you may need to update some records rather than inserting. That was also the case in the "CSV import" case which I mentioned above. The solution is only slightly more complicated:
# sometimes code speaks more eloquently than prose
require 'set'
already_imported = Set.new
MyModel.select("unique_column_which_also_appears_in_imported_files").each do |x|
already_imported << x.unique_column_which_also_appears_in_imported_files
end
to_insert,to_update = [],[]
imported_data.each do |row|
# for the following line, don't let different data types
# (like String vs. Numeric) get ya
# if you need to convert the imported data to match correctly against what's
# already in the DB, do it!
if already_imported.include? row[index_of_unique_column]
to_update << row
else
to_insert << row
end
end
Then you must build a dynamic INSERT and a dynamic UPDATE for each table involved. Google for UPDATE syntax if you need it, and go wild with all your favorite string processing functions!
Going back to the sample code above, note the difference between numeric and string fields. If it is possible that the strings may contain single quotes, you will have to make sure that all the single quotes are escaped. The behavior of String#gsub may be surprise you when you try to do this: it assigns a special meaning to \'. The best way I have found so far to escape single quotes is: string.gsub("'") { "\\'" }. Perhaps other posters know a better way.
If you are inserting dates, make sure they are converted to MySQL's date syntax.
Yes, I know that "roll-your-own" SQL sanitization is very iffy. There may even be security bugs with the above approach; if so, I hope my better-informed peers will set me straight. But the performance gains are just too great to ignore. Again, if this can be done using a prepared query with placeholders, and you know how, please post!
Looking at your code, it looks like you are inserting the data using a stored procedure (mfmodel.populate). Even if you do want to use a stored procedure for this, why do you have dbh.prepare in the loop? You should be able to move that line outside of lines.each.

You might want to try exporting the data as csv and loading it with 'load data infile... replace'. It seems cleaner/easier than trying to construct bulk insert queries.

Related

Accessing SQL array element from Libreoffice Basic

I have a postgresql database that contains program data. In Libreoffice Calc, I have Basic macros that interact with the postgresql database and uses Calc as the user client. One of the postgresql tables has an array and I can't index into that array directly from Basic.
Here is the table setup, as shown in pgAdmin:
sq_num integer,
year_start integer,
id serial NOT NULL,
"roleArray" text[]
Say I want to SELECT roleArray[50]. My every attempt to do this out of Basic results in the entire array being passed. I can certainly split the array myself and get the element I'm after, but I was using SQL arrays to help automate this stuff.
My Basic code uses a Libreoffice Base file for the connection to the postgresql database. Going to the Base file, I cannot create a query that will select an individual element and not return the entire array UNLESS I select the button "Run SQL command directly" and run this query:
SELECT "roleArray"['50'] FROM myTableThatHasArrays
Then I get element 50 from every record as intended.
I believe there is a bug report that describes this, where the Base command parser can't handle indexing an array. My question is what is the best method to overcome this?
The best scenario is to be able to index an element in the SQL array directly from Basic.
It sounds like you used XRow.getString, which (sensibly enough) retrieves the array as a single large string. Instead, use XRow.getArray and then XArray.getArray. Here is a working example:
sSQL = "SELECT id, ""roleArray""[2] FROM mytablethathasarrays;"
oResult = oStatement.executeQuery(sSQL)
s = ""
Do While oResult.next()
sql_array = oResult.getArray(2)
basic_array = sql_array.getArray(Null)
s = s & oResult.getInt(1) & " " & basic_array(1) & CHR$(10)
Loop
MsgBox s

ActiveRecord#create does not allow to set the id

I need to batch insert > 100.000 records.
The id will not be created by the DB and I have to use a a given UUID:
Doing this in a loop using mymodel.new assigning the ID, then save the record will work but is way too slow (appr. 20min)
When I create an array 'records' and use mymodel.create(records) I run into the 'cannot mass assign id' problem.
I've tried all solutions I could find:
'attr_acccessible :id, ...' for the model. works for all but id.
(re)define 'def self.attributes_protected_by_default [] end' - no effect
one advice was to use 'create' with ':without_protection => true', but create does not take more than one argument.
.So neither of these solutions helped.
What else can I do?
Finally, I found a solution which might not be elegant in a Rails way but it solves my performance problem:
At first I tried what #Albin suggested only to find that create(records) does not work much faster (still > 15min).
My solution now is:
Create a temporary CSV file
db_tmp = File.open("tmp_file", "w")
records = ""
#data_records.each do |row|
records << "#{row['id']},#{row['id']},#{field_1},#{row['field_2']}, ... \n"
end
db_tmp.write(records)
db_tmp.close
Execute sql with a load data command
sql = "load data infile 'tmp_file' into table my_table
fields optionally enclosed by '\"' terminated by ','
(id,field_1,field_2, ... )"
ActiveRecord::Base.connection.execute(sql)
The whole process now lasts less than 1 (!) minute, including getting the data over the network and parsing the original json message into a hash.
I'm aware that this does not clarify how create could be tricked into allowing ID assignment but the performance problem is solved.
Another point is that my solution bypasses any validation defined for the model. This is not a problem because in this case I know I can rely on integrity of the data I'm receiving - and if there's a problem load would fail and execute would raise an exception.

SQLITE3 strings in where clauses seem confused

I'm wondering if anyone has any clarification on the difference between the following statements using sqlite3 gem with ruby 1.9.x:
#db.execute("INSERT INTO table(a,b,c) VALUES (?,?,?)",
some_int, other_int, some_string)
and
#db.execute("INSERT INTO table(a,b,c) VALUES (#{some_int},"+
+"#{some_int}, #{some_string})")
My problem is: When I use the first method for insertion, I can't query for the "c" column using the following statement:
SELECT * FROM table WHERE c='some magic value'
I can use this:
"SELECT * FROM table WHERE c=?", "some magic value"
but what I really want to use is
"SELECT * FROM table WHERE c IN ('#{options.join("','")}')"
And this doesn't work with the type of inserts.
Does anyone know what the difference is at the database level that is preventing the IN from working properly?
I figured this out quite a while ago, but forgot to come back and point it out, in case someone finds this question at another time.
The difference turns out to be blobs. Apparently when you use the first form above (the substitution method using (?,?)) SQLite3 uses blogs to enter the data. However, if you construct an ordinary SQL statement, it's inserted as a regular string and the two aren't equivalent.
Insert is not possible to row query but row query used in get data that time this one working.
SQLite in you used in mobile app that time not work bat this row query you write in SQLite Browse in that work

How do I query for when a field doesn't begin with a letter?

I'm tasked with adding an option to our search, which will return results where a given field doesn't begin with a letter of the alphabet. (The .StartsWith(letter) part wasn't so hard).
But I'm rather unsure about how to get the results that don't fall within the A-Z set, and equally hoping it generates some moderately efficient SQL underneath.
Any help appreciated - thanks.
In C# use the following construct, assuming db as a data context:
var query = from row in db.SomeTable
where !System.Data.Linq.SqlClient.SqlMethods.Like(row.SomeField, "[A-Z]%")
select row;
This is only supported in LINQ to SQL queries. All rules of the T-SQL LIKE operator apply.
You could also use less effective solution:
var query = from row in db.SomeTable
where row.SomeField[0] < 'A' || row.SomeField[0] > 'Z'
select row;
This gets translated into SUBSTRING, CAST, and UNICODE constructs.
Finally, you could use VB, where there appears to be a native support for the Like method.
Though SQL provides the ability to check a range of characters in a LIKE statement using bracket notation ([a-f]% for example), I haven't seen a linq to sql construct that supports this directly.
A couple thoughts:
First, if the result set is relatively small, you could do a .ToList() and filter in memory after the fact.
Alternatively, if you have the ability to change the data model, you could set up additional fields or tables to help index the data and improve the search.
--EDIT--
Made changes per Ruslan's comment below.
Well, I have no idea if this will work because I have never tried it and don't have a compiler nearby to try it, but the first thing I would try is
var query = from x in db.SomeTable
where x.SomeField != null &&
x.SomeField.Length >= 1 &&
x.SomeField.Substring(0, 1).All(c => !Char.IsLetter(c))
select x;
The possiblility exists that LINQ to SQL fails to convert this to SQL.

Using LINQ to query flat text files with fixed-length records?

I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };
For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.
I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.

Resources