Best DBI method to select 30,000 rows into an array - oracle

I am new to Perl and getting to grips with it.
This is a two phased question.
Question 1:
The Perl DBI has several methods to fetch data from the database tables.
I am returning one column of data from an Oracle DB into an array
in a manner that each resultset is put into its own element, we are talking about 30,000 rows of data.
Here is my code
#arr_oracle_rs = #{$dbh->selectcol_arrayref($oracle_select)};
I was wondering if this is the fastest way to do this or should I use another DBI method?
It's hard to tell because of networks latency, current DB load, etc on the quickest and most efficient method to use to get the data and place into an array.
I am just wondering if people with the knowledge can tell me if I am using the correct method for such a task.
https://metacpan.org/pod/DBI#Database-Handle-Methods
selectrow_array()
selectrow_arrayref()
selectrow_hashref()
selectall_arrayref()
selectall_hashref()
selectcol_arrayref()
Question 2:
What is the best method to determine if the select query above ran successfully?

I'm assuming you are using DBD::Oracle.
Don't try and second guess what is happening under the hood. For a start, by default DBD::Oracle fetches multiple rows in one go (see https://metacpan.org/pod/DBD::Oracle#RowCacheSize).
Secondly in your example you already have an array and you copy it to another array - that is a waste of time and memory, i.e., selectcol_arrayref returns a reference to an array and you dereference it then copy it to #arr_oracle_rs. Just use the array ref returned.
Thirdly, we cannot say what is quickest since you've not told us how you are going to work with the returned array. Depending on what you are doing with the array it may actually be quicker to bind the column, repeatedly call fetch and do whatever you need per row (requires less memory and no repeated creation of a scalar) or it may be quicker to get all the rows in one go (requires more memory and lots of scalar creation).
I haven't actually looked at how selectcol_arrayref works but as it has to pick the first column from each row it /might/ be just as well to use selectall_arrayref IF you end up using selectall method.
As in all these things you'll have to benchmark your solutions yourself.
As mpapec said, RaiseError is your friend.

Related

Efficient sqlite query based on list of primary keys

For querying an sqlite table based on a list of IDs (i.e. distinct primary keys) I am using following statement (example based on the Chinook Database):
SELECT * FROM Customer WHERE CustomerId IN (1,2,3,8,20,35)
However, my actual list of IDs might become rather large (>1000). Thus, I was wondering if this approach using the IN statement is the most efficient or if there is a better/optimized way to query an sqlite table based on a list of primary keys.
If the number of elements in the IN is large enough, SQLite constructs a temporary index for them. This is likely to be more efficient than creating a temporary table manually.
The length of the IN list is limited only be the maximum length of an SQL statement, and by memory.
Because the statement you wrote does not include any instructions to SQLite about how to find the rows you want the concept of "optimizing" doesn't really exist -- there's nothing to optimize. The job of planning the best algorithm to retrieve the data belongs to the SQLite query optimizer.
Some databases do have idiosyncrasies in their query optimizers which can lead to performance issues but I wouldn't expect SQLite to have any trouble finding the correct algorithm for this simple query, even with lots of values in the IN list. I would only worry about trying to guide the query optimizer to another execution plan if and when you find that there's a performance problem.
SQLite Optimizer Overview
IN (expression-list) does use an index if available.
Beyond that, I can't glean any guarantees from it, so the following is subject to a performance measaurement.
Axis 1: how to pass the expression-list
hardocde as string. Overhead for int-to-string conversion and string-to-int parsing
bind parameters (i.e. the statement is ... WHERE CustomerID in (?,?,?,?,?,?,?,?,?,?....), which is easier to build from a predefined string than hardcoded values). Prevents int → string → int conversion, but the default limit for number of parameters is 999. This can be increased by SQLITE_LIMIT_VARIABLE_NUMBER, but might lead to excessive allocations.
Temporary table. Possibly less efficient than any of the above methods after the statement is prepared, but that doesn't help if most time is spent preparing the statement
Axis 2: Statement optimization
If the same expression-list is used in multiple queries against changing CustomerIDs, one of the following may help:
reusing a prepared statement with hardcoded values (i.e. don't pass 1001 parameters)
create a temporary table for the CustomerIDs with index (so the index is created once, not on the fly for every query)
If the expression-list is different with every query, ist is probably best to let SQLite do its job. The following might be an improvement
create a temp table for the expression-list
bulk-insert expression-list elements using union all
use a sub query
(from my experience with SQLite, I'd expect it to be on par or slightly worse)
Axis 3 Ask Richard
the sqlite mailing list (yeah I know, that technology even older than rotary phones!) is pretty active with often excellent advise, including from the author of SQLite. 90% chance someone will dismiss you ass "Measure before asking suhc a question!", 10% chance someone gives you detailed insight.

ActiveRecord in batches? after_commit produces O(n) trouble

I'm looking for a good idiomatic rails pattern or gem to handle the problem of inefficient after_commit model callbacks. We want to stick with a callback to guarantee data integrity but we'd like it to run once whether it's for one record or for a whole batch of records wrapped in a transaction.
Here's a use-case:
A Portfolio has many positions.
On Position there is an after_commit hook to re-calculate numbers in reference to its sibling positions in the portfolio.
That's fine for directly editing one position.
However...
There's we now have an importer for bringing in lots of positions spanning many portfolios in one big INSERT. So each invocation of this callback queries all siblings and it's invoked once for each sibling - so reads are O(n**2) instead of O(n) and writes are O(n) where they should be O(1).
'Why not just put the callback on the parent portfolio?' Because the parent doesn't necessarily get touched during a relevant update. We can't risk the kind of inconsistent state that could result from leaving a gap like that.
Is there anything out there which can leverage the fact that we're committing all the records at once in a transaction? In principle it shouldn't be too hard to figure out which records changed.
A nice interface would be something like after_batch_commit which might provide a light object with all the changed data or at least the ids of affected rows.
There are lots of unrelated parts of our app that are asking for a solution like this.
One solution could be inserting them all in one SQL statement then validating them afterwards.
Possible ways of inserting them in a single statement is suggested in this post.
INSERT multiple records using ruby on rails active record
Or you could even build the sql to insert all the records in one trip to the database.
The code could look something like this:
max_id = Position.maximum(:id)
Postion.insert_many(data) # not actual code
faulty_positions = Position.where("id > ?", max_id).reject(&:valid?)
remove_and_or_log_faulty_positions(faulty_positions)
This way you only have have to touch the database three times per N entries in your data. If it is large data sets it might be good to do it in batches as you mention.

Oracle PL/SQL: choosing the update/merge column dynamically

I have a table with data relating to several moments in time that I have to keep updated. To save space and time, however, each row in my table refers to a given day and hourly and quarter-hourly data for that day are scattered throughout the several columns in that same row. When updating the data for a particular moment in time I, therefore, must choose the column that has to be be updated through some programming logic in my PL/SQL procedures and functions.
Is there a way to dynamically choose the column or columns involved in an update/merge operation without having to assemble the query string anew every time? Performance is a concern and the throughput must be high, so I can't do anything that would perform poorly.
Edit: I am aware of normalization issues. However I still would like to know a good way for choosing the columns to be updated/merged dynamically and programatically.
The only way to dynamically choose what column or columns to use for a DML statement is to use dynamic SQL. And the only way to use dynamic SQL is to generate a SQL statement that can then be prepared and executed. Of course, you can assemble the string in a more or less efficient manner, you can potentially parse the statement once and execute it multiple times, etc. in order to minimize the expense of using dynamic SQL. But using dynamic SQL that performs close to what you'd get with static SQL requires quite a bit more work.
I'd echo Ben's point-- it doesn't appear that you are saving time by structuring your table this way. You'll likely get much better performance by normalizing the table properly. I'm not sure what space you believe you are saving but I would tend to doubt that denormalizing your table structure is going to save you much if anything in terms of space.
One way to do what is required is to create a package with all possible updates (which aren't that many, as I'll only update one field at a given time) and then choosing which query to use depending on my internal logic. This would, however, lead to a big if/else or switch/case-like statement. Is there a way to achieve similar results with better performance?

Best-performing method for associating arbitrary key/value pairs with a table row in a Postgres DB?

I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?
That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.
hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.
As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.

How can I sort by a transformable attribute in an NSFetchedResultsController?

I'm using NSValueTransformers to encrypt attributes (strings, dates, etc.) in my Core Data model, but I'm pretty sure it's interfering with the sorting in my NSFetchedResultsController.
Does anyone know if there's a way to get around this? I suppose it depends on how the sort is performed; if it's always only performed directly on the database, then I'm probably out of luck. If it sorts on the objects themselves, then perhaps there's a way to activate the transformation before the sort occurs.
I'm guessing it's directly on the database, though, since the sort would be key in grabbing subsets of the collection, which is the main benefit of NSFetchedResultsController anyway.
Note: I should add that there's some strange behavior here... the collection doesn't sort in the first session (the session where the objects are created), but it does sort in subsequent sessions (where the objects already exist and are just being retrieved). So perhaps sorting does work with transformables, but maybe there is caveat in that they have to be saved first or something like that (?)
If you are sorting within the NSFetchedResultsController then it is against the store (i.e. database). However, you can perform a "secondary" sort against the results when they are in memory and therefore decrypted by calling -sortedArrayUsingDescriptors:
update
I believe your inconsistent behavior is probably based on what is already in memory vs. what is being read directly from disk.

Resources