How to organize work pool based on PostgreSQL table? - performance

Suppose I have a big table in PostgreSQL (more than 500Gb) - work pool. Also I have a number of worker processes, getting works from work pool.
What is the most efficient way to release manager, that would return next string from the
'work pool' table as response to workers requests. May be some kind of cursor, iterator or whatever?
UPD I have forgotten one key thing - table is constant. No INSERT or UPDATE operations are allowed. We just reading from it.

PGQ may be or may not be suitable for the problem. It covers similar problem areas, so have a look.

I whanted to be redirected to this and this. Thanks to http://habrahabr.ru/qa/22030/, user ToSHiC and strib.

Related

Dropping a table partition avoiding the error ORA-00054

I need your opinion in this situation. I’ll try to explain the scenario. I have a Windows service that stores data in an Oracle database periodically. The table where this data is being stored is partitioned by date (Interval-Date Range Partitioning). The database also has a dbms_scheduler job that, among other operations, truncates and drops older partitions.
This approach has been working for some time, but recently I had an ORA-00054 error. After some investigation, the error was reproduced with the following steps:
Open one sqlplus session, disable auto-commit, and insert data in the
partitioned table, without committing the changes;
Open another sqlplus session and truncate/drop an old partition (DDL
operations are automatically committed, if I’m not mistaken). We
will then get the ORA-00054 error.
There are some constraints worthy to be mentioned:
I don’t have DBA access to the database;
This is a legacy application and a complete refactoring isn’t
feasible;
So, in your opinion, is there any way of dropping these old partitions, without the risk of running into an ORA-00054 error and without the intervention of the DBA? I can just delete the data, but the number of empty partitions will grow everyday.
Many thanks in advance.
This error means somebody (or something) is working with the data in the partition you are trying to drop. That is, the lock is granted at the partition level. If nobody was using the partition your job could drop it.
Now you say this is a legacy app and you don't want to, or can't, refactor it. Fair enough. But there is clearly something not right if you have a process which is zapping data that some other process is using. I don't agree with #tbone's suggestion of just looping until the lock is released: you can't just get rid of data which somebody is using with establishing why they are still working with data that they apparently should not be using.
So, the first step is to find out what the locking session is doing. Why are they still amending this data your background job wants to retire? Here's a script which will help you establish which session has the lock.
Except that you "don't have DBA access to the database". Hmmm, that's a curly one. Basically this is not a problem which can be resolved without DBA access.
It seems like you have several issues to deal with. Unfortunately for you, they are political and architectural rather than technical, and there's not much we can do to help you further.
How about wrapping the truncate or drop in pl/sql that tries the operation in a loop, waiting x seconds between tries, for a max num of tries. Then use dbms_scheduler to call that procedure/function.
Maybe this can help. Seems to be the same issue as the one that you discribe.
(ignore the comic sans, if you can) :)

One data store. Multiple processes. Will this SQL prevent race conditions?

I'm trying to create a Ruby script that spawns several concurrent child processes, each of which needs to access the same data store (a queue of some type) and do something with the data. The problem is that each row of data should be processed only once, and a child process has no way of knowing whether another child process might be operating on the same data at the same instant.
I haven't picked a data store yet, but I'm leaning toward PostgreSQL simply because it's what I'm used to. I've seen the following SQL fragment suggested as a way to avoid race conditions, because the UPDATE clause supposedly locks the table row before the SELECT takes place:
UPDATE jobs
SET status = 'processed'
WHERE id = (
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
) RETURNING id, data_to_process;
But will this really work? It doesn't seem intuitive the Postgres (or any other database) could lock the table row before performing the SELECT, since the SELECT has to be executed to determine which table row needs to be locked for updating. In other words, I'm concerned that this SQL fragment won't really prevent two separate processes from select and operating on the same table row.
Am I being paranoid? And are there better options than traditional RDBMSs to handle concurrency situations like this?
As you said, use a queue. The standard solution for this in PostgreSQL is PgQ. It has all these concurrency problems worked out for you.
Do you really want many concurrent child processes that must operate serially on a single data store? I suggest that you create one writer process who has sole access to the database (whatever you use) and accepts requests from the other processes to do the database operations you want. Then do the appropriate queue management in that thread rather than making your database do it, and you are assured that only one process accesses the database at any time.
The situation you are describing is called "Non-repeatable read". There are two ways to solve this.
The preferred way would be to set the transaction isolation level to at least REPEATABLE READ. This will mean that any row that concurrent updates of the nature you described will fail. if two processes update the same rows in overlapping transactions one of them will be canceled, its changes ignored, and will return an error. That transaction will have to be retried. This is achieved by calling
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
At the start of the transaction. I can't seem to find documentation that explains an idiomatic way of doing this for ruby; you may have to emit that sql explicitly.
The other option is to manage the locking of tables explicitly, which can cause a transaction to block (and possibly deadlock) until the table is free. Transactions won't fail in the same way as they do above, but contention will be much higher, and so I won't describe the details.
That's pretty close to the approach I took when I wrote pg_message_queue, which is a simple queue implementation for PostgreSQL. Unlike PgQ, it requires no components outside of PostgreSQL to use.
It will work just fine. MVCC will come to the rescue.

Can I substitute savepoints for starting new transactions in Oracle?

Right now the process that we're using for inserting sets of records is something like this:
(and note that "set of records" means something like a person's record along with their addresses, phone numbers, or any other joined tables).
Start a transaction.
Insert a set of records that are related.
Commit if everything was successful, roll back otherwise.
Go back to step 1 for the next set of records.
Should we be doing something more like this?
Start a transaction at the beginning of the script
Start a save point for each set of records.
Insert a set of related records.
Roll back to the savepoint if there is an error, go on if everything is successful.
Commit the transaction at the beginning of the script.
After having some issues with ORA-01555 and reading a few Ask Tom articles (like this one), I'm thinking about trying out the second process. Of course, as Tom points out, starting a new transaction is something that should be defined by business needs. Is the second process worth trying out, or is it a bad idea?
A transaction should be a meaningful Unit Of Work. But what constitutes a Unit Of Work depends upon context. In an OLTP system a Unit Of Work would be a single Person, along with their address information, etc. But it sounds as if you are implementing some form of batch processing, which is loading lots of Persons.
If you are having problems with ORA-1555 it is almost certainly because you are have a long running query supplying data which is being updated by other transactions. Committing inside your loop contributes to the cyclical use of UNDO segments, and so will tend to increase the likelihood that the segments you are relying on to provide read consistency will have been reused. So, not doing that is probably a good idea.
Whether using SAVEPOINTs is the solution is a different matter. I'm not sure what advantage that would give you in your situation. As you are working with Oracle10g perhaps you should consider using bulk DML error logging instead.
Alternatively you might wish to rewrite the driving query so that it works with smaller chunks of data. Without knowing more about the specifics of your process I can't give specific advice. But in general, instead of opening one cursor for 10000 records it might be better to open it twenty times for 500 rows a pop. The other thing to consider is whether the insertion process can be made more efficient, say by using bulk collection and FORALL.
Some thoughts...
Seems to me one of the points of the asktom link was to size your rollback/undo appropriately to avoid the 1555's. Is there some reason this is not possible? As he points out, it's far cheaper to buy disk than it is to write/maintain code to handle getting around rollback limitations (although I had to do a double-take after reading the $250 pricetag for a 36Gb drive - that thread started in 2002! Good illustration of Moore's Law!)
This link (Burleson) shows one possible issue with savepoints.
Is your transaction in actuality steps 2,3, and 5 in your second scenario? If so, that's what I'd do - commit each transaction. Sounds a bit to me like scenario 1 is a collection of transactions rolled into one?

How does Facebook do it?

Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....

Exclusive table (read) lock on Oracle 10g?

Is there a way to exclusively lock a table for reading in Oracle (10g) ? I am not very familiar with Oracle, so I asked the DBA and he said it's impossible to lock a table for reading in Oracle?
I am actually looking for something like the SQL Server (TABLOCKX HOLDLOCK) hints.
EDIT:
In response to some of the answers: the reason I need to lock a table for reading is to implement a queue that can be read by multiple clients, but it should be impossible for 2 clients to read the same record. So what actually happens is:
Lock table
Read next item in queue
Remove item from the queue
Remove table lock
Maybe there's another way of doing this (more efficiently)?
If you just want to prevent any other session from modifying the data you can issue
LOCK TABLE whatever
/
This blocks other sessions from updating the data but we cannot block other peple from reading it.
Note that in Oracle such table locking is rarely required, because Oracle operates a policy of read consistency. Which means if we run a query that takes fifteen minutes to run the last row returned will be consistent with the first row; in other words, if the result set had been sorted in reverse order we would still see exactly the same rows.
edit
If you want to implement a queue (without actually using Oracle's built-in Advanced Queueing functionality) then SELECT ... FOR UPDATE is the way to go. This construct allows one session to select and lock one or more rows. Other sessions can update the unlocked rows. However, implementing a genuine queue is quite cumbersome, unless you are using 11g. It is only in the latest version that Oracle have supported the SKIP LOCKED clause. Find out more.
1. Lock table
2. Read next item in queue
3. Remove item from the queue
4. Remove table lock
Under this model a lot of sessions are going to be doing nothing but waiting for the lock, which seems a waste. Advanced Queuing would be a better solution.
If you want a 'roll-your-own' solution, you can look into SKIP LOCKED. It wasn't documented until 11g, but it is present in 10g. In this algorithm you would do
1. SELECT item FROM queue WHERE ... FOR UPDATE SKIP LOCKED
2. Process item
3. Delete the item from the queue
4. COMMIT
That would allow multiple processes to consume items off the queue.
The TABLOCKX and HOLDLOCK hints you mentioned appear to be used for writes, not reads (based on http://www.tek-tips.com/faqs.cfm?fid=3141). If that's what you're after, would a SELECT FOR UPDATE fit your need?
UPDATE: Based on your update, SELECT FOR UPDATE should work, assuming all clients use it.
UPDATE 2: You may not be in a position to do anything about it right now, but this sort of problem is actually an ideal fit for something other than a relational database, such as AMQP.
If you mean, lock a table so that no other session can read from the table, then no, you can't. Why would you want to do that anyway?

Resources