I have a hypothetical situation that I'd like to solve, but I can't find the ideal answer. Suppose you have a huge data set that could be returned from a query, how do you paginate it so that the impact on memory is minimal? The datoms API, iterating over the datoms and filtering one by one? The index-range API, but I would have to do the same thing as in the datoms API, iterate over the items and filter one by one? Perform an initial query that would return only ids, and the paginate those ids so that they could be used in another query to retrieve the entire data set?
In SQL you usually can define a pagination in the query itself:
SELECT col1, col2, ...
FROM ...
WHERE ...
ORDER BY -- this is a MUST there must be ORDER BY statement
-- the paging comes here
OFFSET 10 ROWS -- skip 10 rows
FETCH NEXT 10 ROWS ONLY; -- take 10 rows
There are many things to consider.
First, at the time of writing, the Datalog implementation that ships with Datomic is eager, and does not spill to disk, which means the result set of a Datalog query must fit in memory.
This does not mean that Datalog is incompatible with a large result, because you can have each Datalog query deal only with a small part of the data. For instance, you can use Datalog to compute the 'logical' part of the query (what entities to return), and the Entity API or the Pull API to (lazily) compute the 'content' part of the query (what attributes to return for each entity). Given that an Entity Id is just a Java Long (8 bytes), this can save you one of two orders of magnitude of memory footprint. Example using the Entity API:
(defn export-customers
[db search-criteria]
(->>
;; logical part - Datalog-based, eager
(d/q '[:find [?customer ...] :in % $ ?search-criteria :where
(customer-matches-criteria ?search-criteria ?customer)]
(my-rules) db search-criteria)
;; content part - Entity API based, lazy
(map (fn [eid]
(let [customer (d/entity db eid)]
(select-keys customer
[:customer/id
:customer/email
:customer/firstName
:customer/lastName
:customer/subscription-time]))))
))
You can complement this approach by eagerly storing the whole result in a secondary blob store, and then poll against that for pagination.
If your query logic is not too complex, you could also imagine not using Datalog at all, e.g by using raw index access (e.g using the Datoms API or the Index Range API) in a lazy way.
Finally, you should consider that maybe Datomic is not the right fit for servicing your analytical queries. Because change detection is trivial with Datomic, it's fairly easy to stream derived data to secondary stores that will be better equipped to compute analytical queries (e.g ElasticSearch, Google BigQuery, PostgreSQL, etc.)
Have you seen this page: http://docs.datomic.com/query.html#memory-usage
It seems to say that all intermediate results must fit into memory. I assume this applies to the final result as well.
You might try asking over at: https://forum.datomic.com/
Side note: When Datomic returns and entity, it is a form of "lazy map" that is not fully visible explicitly makes it concrete such as
(let [plain-map (into {} entity-map) ]
(println plain-map))
Related
I Have a Spring boot project where I would like to execute a specific query in a database from x different threads while preventing different threads from reading the same database entries. So far I was able to run the query in multiple threads but had no luck on finding a way to "split" the read load. My code so far is as follows:
#Async
#Transactional
public CompletableFuture<Book> scanDatabase() {
final List<Book> books = booksRepository.findAllBooks();
return CompletableFuture.completedFuture(books);
}
Any ideas on how should I approach this?
There are plenty of ways to do that.
If you have a numeric field in the data that is somewhat random you can add a condition to your where clause like ... and some_value % :N = :i with :N being a parameter for the number of threads and :i being the index of the specific thread (0 based).
If you don't have a numeric field you can create one by using a hash function and apply it on some other field in order to turn it into something numeric. See your database specific documentation for available hash functions.
You could use an analytic function like ROW_NUMBER() to create a numeric value to be use in the condition.
You could query the number of rows in a first query and then query a the right Slice using Spring Datas pagination feature.
And many more variants.
They all have in common that the complete set of rows must not change during the processing, otherwise you may get rows queried multiple times or not at all.
If you can't guarantee that you need to mark the records to be processed by a thread before actually selecting them, for example by marking them in an extra field or by using a FOR UPDATE clause in your query.
And finally there is the question if this is really what you need.
Querying the data in multiple threads probably doesn't make the querying part faster since it makes the query more complex and doesn't speed up those parts that typically limit the throughput: network between application and database and I/O in the database.
So it might be a better approach to select the data with one query and iterate through it, passing it on to a pool of thread for processing.
You also might want to take a look at Spring Batch which might be helpful with processing large amounts of data.
Given the following SQL, the ManufacturerIdUpperCase is the partition key, and a lower cased value is passed as a hint to direct Cosmos to the correct partition. The "boat.OwnerIdUpperCase" in an indexed property. Will Cosmos use the ownerId to narrow the scan to the subset of documents for this owner, or does the use of the other two UPPER calls require a full collection scan?
SELECT * FROM boat
WHERE boat.ManufacturerIdUpperCase= #ManufacturerId
AND UPPER(boat.Owner.Type)= UPPER(#OwnerType)
AND boat.OwnerIdUppererCase= #BoatOwnerId)
AND UPPER(boat.BoatType) = UPPER(#BoatType)
I'm trying to decide if I need to maintain a lowercase copy of every property included in the various WHERE clauses, or, if I can do this for one of the remaining UPPER conversions on an indexed property that will reduce the scope of the dataset such that a scan is only required on the resulting subset, not the entire partition?
I've read the old posts like the one below, and run the SQL in the sandbox as proposed. In the simple scenario, I am seeing the same result as the author. However, my work scenario is more complex as described above.
DocumentDB: Performance impact of built-in string functions (like UPPER)
Victor, welcome to StackOverflow! I am from the Cosmos DB engineering team.
In this particular query, since all the filter predicates are intersections (ANDs), and not unions (ORs), Cosmos DB will narrow down the set of documents to evaluate and will not do a full scan. Please ensure that all the 4 fields (/ManufacturerIdUpperCase, /Owner/Type, /OwnerIdUppererCase, /BoatType) are indexed (added as part of "includedPaths" in the indexingPolicy).
For querying an sqlite table based on a list of IDs (i.e. distinct primary keys) I am using following statement (example based on the Chinook Database):
SELECT * FROM Customer WHERE CustomerId IN (1,2,3,8,20,35)
However, my actual list of IDs might become rather large (>1000). Thus, I was wondering if this approach using the IN statement is the most efficient or if there is a better/optimized way to query an sqlite table based on a list of primary keys.
If the number of elements in the IN is large enough, SQLite constructs a temporary index for them. This is likely to be more efficient than creating a temporary table manually.
The length of the IN list is limited only be the maximum length of an SQL statement, and by memory.
Because the statement you wrote does not include any instructions to SQLite about how to find the rows you want the concept of "optimizing" doesn't really exist -- there's nothing to optimize. The job of planning the best algorithm to retrieve the data belongs to the SQLite query optimizer.
Some databases do have idiosyncrasies in their query optimizers which can lead to performance issues but I wouldn't expect SQLite to have any trouble finding the correct algorithm for this simple query, even with lots of values in the IN list. I would only worry about trying to guide the query optimizer to another execution plan if and when you find that there's a performance problem.
SQLite Optimizer Overview
IN (expression-list) does use an index if available.
Beyond that, I can't glean any guarantees from it, so the following is subject to a performance measaurement.
Axis 1: how to pass the expression-list
hardocde as string. Overhead for int-to-string conversion and string-to-int parsing
bind parameters (i.e. the statement is ... WHERE CustomerID in (?,?,?,?,?,?,?,?,?,?....), which is easier to build from a predefined string than hardcoded values). Prevents int → string → int conversion, but the default limit for number of parameters is 999. This can be increased by SQLITE_LIMIT_VARIABLE_NUMBER, but might lead to excessive allocations.
Temporary table. Possibly less efficient than any of the above methods after the statement is prepared, but that doesn't help if most time is spent preparing the statement
Axis 2: Statement optimization
If the same expression-list is used in multiple queries against changing CustomerIDs, one of the following may help:
reusing a prepared statement with hardcoded values (i.e. don't pass 1001 parameters)
create a temporary table for the CustomerIDs with index (so the index is created once, not on the fly for every query)
If the expression-list is different with every query, ist is probably best to let SQLite do its job. The following might be an improvement
create a temp table for the expression-list
bulk-insert expression-list elements using union all
use a sub query
(from my experience with SQLite, I'd expect it to be on par or slightly worse)
Axis 3 Ask Richard
the sqlite mailing list (yeah I know, that technology even older than rotary phones!) is pretty active with often excellent advise, including from the author of SQLite. 90% chance someone will dismiss you ass "Measure before asking suhc a question!", 10% chance someone gives you detailed insight.
Title is probably not very clear so let me explain.
I want to process a in-process join (nodeJs) on 2 tables*, Session and SessionAction. (1-N)
Since these tables are rather big (millions of records both) my idea was to get slices based on an orderBy sessionId (which they both share), and sort of lock-step walk through both tables in batches.
This however proves to be awefully slow. I'm using pseudo code as follows for both the tables to get the batches:
table('x').orderBy({index:"sessionId"}.filter(row.sessionId > start && row.sessionId < y)
It seems that even though I'm essentially filtering on a attribute sessionId which has got an index, the query planner is not smart enough to see this and every query does a complete tablescan to do the orderby before filtering afterwards (or so it seems)
Of course, this is incredibly wasteful but I don't see another option. E.g.:
Order after filter is not supported by Rethink.
Getting a slice of the ordered table doesn't work either, since slice-enumeration (i.e.: the xth until the yth record) for lack of a better work doesn't add up between the 2 tables.
Questions:
Is my approach indeed expected to be slow, due to having to do a table scan at each iteration/batch?
If so, how could I design my queries to get it working faster?
*) It's too involved to do it using Rethink Reql only.
filter is never indexed in RethinkDB. (In general a particular command will only use a secondary index if you pass index as one of its optional arguments.) You can write that query like this to avoid scanning over the whole table:
r.table('x').orderBy({index: 'sessionID'}).between(start, y, {index: 'sessionId'})
I am writing a stored procedure to perform a dynamic search that spans 10+ database tables. With millions of records in each table and a dynamic set of search parameters*, I am having some trouble optimizing the procedure.
Is there a "best practice" for building these kinds of queries? E.g. Use strings to build a dynamic query, use a huge list of IF THEN .. ELSE statements, etc? Can anyone provide a simple example or point me to some literature that will help? Here's some psuedocode for the stored procedure I am developing, which accepts a collection of parameters and a ref cursor.
v_query = "SELECT .....";
v_name = ... -- retrieve "name" parameter from collection
if v_name is not null then
v_query := v_query || ' AND table.Name = ' || v_name;
end if;
open search_cursor for v_query;
...
*By "dynamic set of search parameters," I mean that I pass in a collection of parameters. I figured this would be easier than making the caller pass in 20 parameters if they only want to search on one.
There are problems with using the static query approach; also be very careful about using the CURSOR_SHARING=FORCE option - it can really raise hell with your system if you haven't done a coverage test to ensure that all your other queries will work the way you want.
Problems with static queries:
The (x is null or x = col) predicates tend to kill any chance of using indexes. Since the query plan is computed at the time query is parsed the first time, the indexes you use will be based on the values for the first run of the query; later runs, which may not constrain on the same columns, will still use the same indexes.
Having one static statement with substitution variables will prevent the optimizer from making an intelligent choice about which index to use based on the data distribution. In a dynamic query (or in the first run of a query with bind variables), Oracle will see how selective your constraint is; a highly selective constraint will become a prime candidate for index use. For example, if your table had a row for every person in the U.S., STATE='Alaska' will be much more likely to use the index on STATE than STATE='California'.
Of course, in both these cases, if the dynamic columns in your WHERE clause are not indexed anyway, it doesn't matter, although I'd be surprised if that were the case in a database the size you're talking about.
Also, consider the real cost of all that hard parsing. Yes, hard parses serialize system resources, which makes them expensive, but only in the context of high volume queries. By their nature, ad-hoc queries do not get run very often. The cost you pay for all the hard parses you incur in an entire day will likely be hundreds of times less than the cost of a single query that uses the wrong indexes.
In the past, I've implemented these systems pretty much like you've done here - a base query portion, then iterating over a constraint list and adding WHERE clause predicates. I don't think it's hard for someone to maintain or understand, especially if you're talking about constraints that don't involve adding a lot of subqueries or extra tables to the FROM clause.
One thing to consider: If this system is primarily an offline one (in other words, not constantly being updated or inserted into - populated by periodic loads of bulk data), you may want to look into using BITMAP indexes. Bitmap indexes differ from regular b-tree indexes in that multiple indexes on a single table can be used simultaneously, and bitmap indexes are much, much smaller on disk than b-trees. They work very well for applications like this - where you will have a variety of constraints that can't be defined at design time. You will only want to put bitmap indexes on columns that have relatively few distinct values - say, one value constitutes no less than 1/1000 of the table - so don't use bitmaps on unique columns.
However, the downside is that bitmap indexes will noticeably degrade the performance of inserts and updates. The best practice for bitmaps is to use them in data warehouse applications, and they are dropped prior to loads and recreated afterwards.
Except in very particular cases, I don't think it is advisable (or even possible) to try to generate an optimized query. My advice is not to use dynamic SQL if you can : hard to read, hard to debug, hard to optimize, hard to maintain.
First, write a generic query that will work with any parameter sent to your procedure. According to your example, that would give something like :
SELECT * FROM table WHERE ((v_name IS NULL) OR (table.Name=v_name));
As you see, you could easily add other parameters to this query without using dynamic SQL. This query is much easier to read and debug. Ask your DBA for optimization tips.
Then, if you have a particular set of parameters that you know are often passed together, you could write a particular query for this set that you could specifically optimize. Pseudocode :
IF particular_set
THEN
/* Specific query */
ELSE
/* Generic query */
END IF;
The difficult part is to try not to have too many specific queries here, or you could fall into a maintenance hell.
We've had a similar requirement for one of our clients. They have half a dozen tables with millions of rows, and they wanted adhoc search capability on most of the columns.
The solution was a separate package for each table, which would take the search criteria and construct the SQL to run the search. We took advantage of the old system that was being replaced, to discover what the most common types of searches the users were doing, and made sure that those searches ran the best, by tuning the queries that were being generated (supported by the strategic use of indexes). Because each package was only responsible for queries against one table, it could have specific code designed to work with that table (including the odd hint, in a few rare cases).
One question/problem that you'll need to address is, do you hard-code the criteria (e.g. WHERE SURNAME='SMITH') or use bind variables? Using bind variables reduces hard parsing, which reduces load on the database server; however it can be impractical to use bind variables when the SQL is dynamically generated. The way we ended up going was to set CURSOR_SHARING=FORCE (which has its own disadvantages) which was a reasonable compromise in our case.
Read http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6711305251199