I need to batch insert > 100.000 records.
The id will not be created by the DB and I have to use a a given UUID:
Doing this in a loop using mymodel.new assigning the ID, then save the record will work but is way too slow (appr. 20min)
When I create an array 'records' and use mymodel.create(records) I run into the 'cannot mass assign id' problem.
I've tried all solutions I could find:
'attr_acccessible :id, ...' for the model. works for all but id.
(re)define 'def self.attributes_protected_by_default [] end' - no effect
one advice was to use 'create' with ':without_protection => true', but create does not take more than one argument.
.So neither of these solutions helped.
What else can I do?
Finally, I found a solution which might not be elegant in a Rails way but it solves my performance problem:
At first I tried what #Albin suggested only to find that create(records) does not work much faster (still > 15min).
My solution now is:
Create a temporary CSV file
db_tmp = File.open("tmp_file", "w")
records = ""
#data_records.each do |row|
records << "#{row['id']},#{row['id']},#{field_1},#{row['field_2']}, ... \n"
end
db_tmp.write(records)
db_tmp.close
Execute sql with a load data command
sql = "load data infile 'tmp_file' into table my_table
fields optionally enclosed by '\"' terminated by ','
(id,field_1,field_2, ... )"
ActiveRecord::Base.connection.execute(sql)
The whole process now lasts less than 1 (!) minute, including getting the data over the network and parsing the original json message into a hash.
I'm aware that this does not clarify how create could be tricked into allowing ID assignment but the performance problem is solved.
Another point is that my solution bypasses any validation defined for the model. This is not a problem because in this case I know I can rely on integrity of the data I'm receiving - and if there's a problem load would fail and execute would raise an exception.
Related
I've been tasked to alter the company's Event Export from our PlayFab Environment into Azure. Initially, we've set it up to Export all events but after looking at the data we do have some data exported that we don't want for legal reasons. I was exploring the Use custom query method and was trying to build the query to get all data except the columns I want to exclude. The problem is that these columns are nested. I've tried using the project-away query to exclude one column for now but when I run the below query
['events.all']
| project-away EventData.ColumnToExclude
| limit 100
I get this error
I'm assuming it's because it is not supporting nested columns. Is there an easy way to exclude the column without having to flatten the data or list all my columns (our developers might create new events without notice so that won't work)?
UPDATE 1:
I've found that project-away is the syntax to remove a column from a table but what I needed is a way to remove a key from a json/dynamic object so found that using bag_remove_keys() is the correct approach
['events.all']
| project EventData=bag_remove_keys(EventData, dynamic(['Key1', 'Key2', '$.Key3.SubKey1'])
But now I am facing another issue. When I use the '$.' notation for subkeys I get the below error
Query execution has resulted in error (0x8000FFFF): Partial query failure: Catastrophic failure (message: 'PropertyBagEncoder::Add: non-contiguous entries: ', details: '').
[0]Kusto.Data.Exceptions.KustoDataStreamException: Query execution has resulted in error (0x8000FFFF): Partial query failure: Catastrophic failure (message: 'PropertyBagEncoder::Add: non-contiguous entries: ', details: '').
Timestamp=2022-01-31T13:54:56.5237026Z
If I don't list any subkeys I don't get this issue and I can't understand why
UPDATE 2:
I found that bag_remove_keys has a bug. On the below query I get the described error in UPDATE 1
datatable(d:dynamic)
[
dynamic(
{
"test1": "val",
"test2": {},
"test3": "val"
}
)
]
| extend d1=bag_remove_keys(d, dynamic(['$.SomeKey.Sub1', '$.SomeKey.Sub2']))
However, if I move the "test2" key at the end I don't get an error but d1 will not show the "test2" key in the output.
Also, if I have a key in bag_remove_keys() that matches a key from the input like | extend d1=bag_remove_keys(d, dynamic(['$.SomeKey.Sub1', '$.SomeKey.Sub2', 'test1'])) then, again it will not error but will remove "test2" from the output
Thanks for reporting it Andrei, it is a bug and we are working on a fix.
Update - fix had been checked in and will be deployed within two weeks, please open a support ticket if you need it earlier.
I have this following piece of code
coll_name = "#{some_name}_#{differentiator}"
coll_object = #database[coll_name]
idExist = coll_object.find({"adSet.name" => NAME}).first()
if idExist.nil?
docId = coll_object.insert(request)
else
docId = idExist["_id"]
end
return docId
differentiator can be the same or different from the Loop that is code is called.So everytime there can be a new collection or same collection.Now if the same collection is recieved then there might be an object with name = NAME. In that case no insert should be carried out.However i have observed that documents with the same NAME are getting inserted.Can anybody helpout on this problem.
The explanation for this behavior could be a race condition: The duplicate is inserted by another thread/process between line 3 and 5 of your application. Two threads try to create the same name at the same time, the database returns for both that the name doesn't exist yet, and when those replies arrived, both insert the document.
To prevent this from happening, create an unique index on the name-field. This will prevent MongoDB from inserting two documents with the same name. When you do this, you could remove the check for existence before inserting. Just try to insert the document, and then call getLastError to find out if it worked. When it didn't, retrieve the existing document with an additional query.
below is the code to download a txt file from internet approx 9000 lines and populate the database, I have tried a lot but it takes a lot of time more than 7 minutes. I am using win 7 64 bit and ruby 1.9.3. Is there a way to do it faster ??
require 'open-uri'
require 'dbi'
dbh = DBI.connect("DBI:Mysql:mfmodel:localhost","root","")
#file = open('http://www.amfiindia.com/spages/NAV0.txt')
file = File.open('test.txt','r')
lines = file.lines
2.times { lines.next }
curSubType = ''
curType = ''
curCompName = ''
lines.each do |line|
line.strip!
if line[-1] == ')'
curType,curSubType = line.split('(')
curSubType.chop!
elsif line[-4..-1] == 'Fund'
curCompName = line.split(" Mutual Fund")[0]
elsif line == ''
next
else
sCode,isin_div,isin_re,sName,nav,rePrice,salePrice,date = line.split(';')
sCode = Integer(sCode)
sth = dbh.prepare "call mfmodel.populate(?,?,?,?,?,?,?)"
sth.execute curCompName,curSubType,curType,sCode,isin_div,isin_re,sName
end
end
dbh.do "commit"
dbh.disconnect
file.close
106799;-;-;HDFC ARBITRAGE FUND RETAIL PLAN DIVIDEND OPTION;10.352;10.3;10.352;29-Jun-2012
This is the format of data to be inserted in the table. Now there are 8000 such lines and how can I do an insert by combining all that and call the procedure just once. Also, does mysql support arrays and iteration to do such a thing inside the routine. Please give your suggestions.Thanks.
EDIT
I have to make insertion's into the tables depending on whether they are already exist or not, also I need to make use of conditional comparison's before inserting into the table. I definitely can't write SQL statements for these, so I wrote SQL stored procedures. Now I have a list #the_data, how do I pass that to the procedure and then iterate through it all on MySQL side. Any ideas ?
insert into mfmodel.company_masters (company_name) values
#{#the_data.map {|str| "('#{str[0]}')"}.join(',')}
this makes 100 insertions but 35 of them are redundant so I need to search the table for existing entries before doing a insertion.
Any Ideas ? thanks
From your comment, it looks like you are spending all your time executing DB queries. On a recent Ruby project, I also had to optimize some slow code which was importing data from CSV files into the database. I got about a 500x performance increase by importing all the data by using a single bulk INSERT query, rather than 1 query for each row of the CSV file. I accumulated all the data in an array, and then built a single SQL query using string interpolation and Array#join.
From your comments, it seems that you may not know how to build and execute dynamic SQL for a bulk INSERT. First get your data in a nested array, with the fields to be inserted in a known order. Just for an example, imagine we have data like this:
some_data = [['106799', 'HDFC FUND'], ['112933', 'SOME OTHER FUND']]
You seem to be using Rails and MySQL, so the dynamic SQL will have to use MySQL syntax. To build and execute the INSERT, you can do something like:
ActiveRecord::Base.connection.execute(<<SQL)
INSERT INTO some_table (a_column, another_column)
VALUES #{some_data.map { |num,str| "(#{num},'#{str}')" }.join(',')};
SQL
You said that you need to insert data into 2 different tables. That's not a problem; just accumulate the data for each table in a different array, and execute 2 dynamic queries, perhaps inside a transaction. 2 queries will be much faster than 9000.
Again, you said in the comments that you may need to update some records rather than inserting. That was also the case in the "CSV import" case which I mentioned above. The solution is only slightly more complicated:
# sometimes code speaks more eloquently than prose
require 'set'
already_imported = Set.new
MyModel.select("unique_column_which_also_appears_in_imported_files").each do |x|
already_imported << x.unique_column_which_also_appears_in_imported_files
end
to_insert,to_update = [],[]
imported_data.each do |row|
# for the following line, don't let different data types
# (like String vs. Numeric) get ya
# if you need to convert the imported data to match correctly against what's
# already in the DB, do it!
if already_imported.include? row[index_of_unique_column]
to_update << row
else
to_insert << row
end
end
Then you must build a dynamic INSERT and a dynamic UPDATE for each table involved. Google for UPDATE syntax if you need it, and go wild with all your favorite string processing functions!
Going back to the sample code above, note the difference between numeric and string fields. If it is possible that the strings may contain single quotes, you will have to make sure that all the single quotes are escaped. The behavior of String#gsub may be surprise you when you try to do this: it assigns a special meaning to \'. The best way I have found so far to escape single quotes is: string.gsub("'") { "\\'" }. Perhaps other posters know a better way.
If you are inserting dates, make sure they are converted to MySQL's date syntax.
Yes, I know that "roll-your-own" SQL sanitization is very iffy. There may even be security bugs with the above approach; if so, I hope my better-informed peers will set me straight. But the performance gains are just too great to ignore. Again, if this can be done using a prepared query with placeholders, and you know how, please post!
Looking at your code, it looks like you are inserting the data using a stored procedure (mfmodel.populate). Even if you do want to use a stored procedure for this, why do you have dbh.prepare in the loop? You should be able to move that line outside of lines.each.
You might want to try exporting the data as csv and loading it with 'load data infile... replace'. It seems cleaner/easier than trying to construct bulk insert queries.
I'm wondering if anyone has any clarification on the difference between the following statements using sqlite3 gem with ruby 1.9.x:
#db.execute("INSERT INTO table(a,b,c) VALUES (?,?,?)",
some_int, other_int, some_string)
and
#db.execute("INSERT INTO table(a,b,c) VALUES (#{some_int},"+
+"#{some_int}, #{some_string})")
My problem is: When I use the first method for insertion, I can't query for the "c" column using the following statement:
SELECT * FROM table WHERE c='some magic value'
I can use this:
"SELECT * FROM table WHERE c=?", "some magic value"
but what I really want to use is
"SELECT * FROM table WHERE c IN ('#{options.join("','")}')"
And this doesn't work with the type of inserts.
Does anyone know what the difference is at the database level that is preventing the IN from working properly?
I figured this out quite a while ago, but forgot to come back and point it out, in case someone finds this question at another time.
The difference turns out to be blobs. Apparently when you use the first form above (the substitution method using (?,?)) SQLite3 uses blogs to enter the data. However, if you construct an ordinary SQL statement, it's inserted as a regular string and the two aren't equivalent.
Insert is not possible to row query but row query used in get data that time this one working.
SQLite in you used in mobile app that time not work bat this row query you write in SQLite Browse in that work
I am writing a simple program to insert rows into a table.But when i started writing the program i got a doubt. In my program i will get duplicate input some times. That time i have to notify the user that this already exists.
Which of the Following Approaches is good to Use to achieve this
Directly Perform Insert statement will get the primary key violation error if it is duplicate notify otherwise it will be inserted. One Query to Perform
First make a search for the primary key values. If found a Value Prompt User. Otherwise perform insert operation.For a non-duplicate row this approach takes 2 queries.
Please let me know trade-offs between these approaches. Which one is best to follow ?
Regards,
Sunny.
I would choose the 2nd approach.
The first one would cause an exception to be thrown which is known to be very expensive...
The 2nd approach would use a SELECT count(*) FROM mytable WHERE key = userinput which will be very fast and the INSERT statement for which you can use the same DB connection object (assuming OO ;) ).
Using prepared statements will pre-optimize the queries and I think that will make the 2nd approach much better and mre flexible than the first one.
EDIT: depending on your DBMS you can also use a if not exists clause
EDIT2: I think Java would throw a SQLExcpetion no matter what went wrong, i.e. using the 1st approach you wouldn't be able to differ between a duplicate entry or an unavailable database without having to parse the error message - which is again a point for using SELECT+INSERT (or if not exists)