I'm trying to figure out whether or not SubSonics AddMany() method is faster than a simple foreach loop. I poked around a bit on the SubSonic site but didn't see much on performance stats.
What I currently have. (.ForEach() just has some validation it it, other than that it works just like forEach(.....){ do stuff})
records.ForEach(record =>
{
newRepository.Add(record);
recordsProcessed++;
if (cleanUp) oldRepository.Delete<T>(record);
});
Which would change too
newRepository.AddMany(records);
if (cleanUp) oldRepository.DeleteMany<T>(records);
If you notice with this method I lose the count of how many records I've processed which isn't critical... But it would be nice to be able to display to the user how many records were moved with this tool.
So my questions boil down to: Would AddMany() be noticeably faster to use? And is there any way to get a count of the number of records actually copied over? If it succeeds can I assume all the records were processed? If one record fails, does the whole process fail?
Thanks in advance.
Just to clarify, AddMany() generates individual queries per row and submits them via a batch; DeleteMany() generates a single query. Please consult the source code and the generated SQL when you want to know what happens to your queries.
Your first approach is slow: 2*N queries. However, if you submit the queries using a batch it would be faster.
Your second approach is faster: N+1 queries. You can find how many will be added simply by enumerating 'records'.
If there is a risk of exceeding capacity limits on the size of a batch, then submit 50 or 100 at a time with little penalty.
Your final question depends on transactions. If the whole operation is one transaction, it will commit of abort as one. Otherwise, each query will stand alone. Your choice.
Related
I have a database filled with rows and multiple threads that are accessing these rows, inputting some of the data from them in a function, producing an output, and then filling the row's missing columns with the output.
Here's the issue: Each row has an unprocessed flag which is, by default, true. So each thread is looking for rows with this flag. But each thread is getting the SAME row, it turns out...because the row is being marked as processed after the thread's job is complete, which may happen after a few seconds.
One way I avoided this was to insert a currently_processed flag for each row, mark it as false, and once a thread accesses the row, change it to true. Then when the thread is done, just change if back to false. The problem with this is that I have to use some sort of locking and not allow any other thread to do anything until this occurs. I was wondering if there's an alternative approach where I wouldn't have to do thread locking (via a mutex or something) and thus slow down the whole process.
If it helps, the code is in Ruby, but this problem is language agnostic, but here's the code to demonstrate the type of threading I'm using. So nothing special, threading on the lowest level like almost all languages have:
3.times do
Thread.new do
row = get_database_row
result = do_some_processing(row)
insert_results_into_row(result)
end
end.each(&:join)
The "real" answer here is that you need a database transaction. When one thread gets that row, then the database needs to know that this row is currently up for processing.
You can't resolve that within your application! You see, when two threads look at the same row at the same time, they could both try to write that flag ... and yep, it for sure changes to "currently processed"; and then both threads will update row data and write that back. Maybe that is not the problem if any processing results in the same final result; but if not, then all kinds of data integrity problems will arise.
So the real answer is that you step back and look how your specific database is designed in order to deal with such things.
I was wondering if there's an alternative approach where I wouldn't have to do thread locking (via a mutex or something) and thus slow down the whole process.
There are some ways to do this:
1) One common dispatcher for all threads. It should read all rows and put them into shared queue from where processing theads will get rows.
2) Go deeper into DB, find out if it supports something like oracles's "select for update skip locking" syntax and utilize it. For oracle you need to use his syntax in cursor and make somewhat cumbersome interaction, but at least it can work this way.
3) Partition input by, say, index of worker thread. So 1st worker out of 3 will only process rows 1,4,7 etc. 2nd worker will only process rows 2, 5, 8 etc.
I am performing an import of data wrapped in a CMSTransactionScope.
What would be the most efficient and practical way to import data in parallel and rollback if any errors? The problem I see is that, with it being parallel, I don't know if I can have the inserted objects be part of the transaction if they are apart of a new thread.
Is there any way to do this or should it be handled differently?
If you're running the code in parallel in order to achieve better performance and you are basically inserting rows one by one then it's unlikely that it'll perform any better than it would while running in a single thread.
In this case I'd recommend using one thread in combination with CMSTransactionScope, and potentially ConnectionHelper.BulkInsert.
Anyway, if you still want to run your queries in parallel then you need to implement some kind of synchronization (locking, for instance) to ensure that all statements are executed before the code hits CMSTransactionScope.Commit() (this basically means a performance loss). Otherwise, queries would get executed in separate transactions. Moreover, you have to make sure that the CMSTransactionScope object always gets instantiated with the same IDataConnection (this should happen by default when you don't pass a connection to the constructor).
The second approach seems error prone to me and I'd rather take a look at different ways of optimizing the code (using async, etc.)
I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.
In our system there is a (quite common) case where user's action can trigger operation that involves setting/removing labels onto/from nodes and relationships amounting to a total order of hundreds of thousands entities. (Remove label A from 100K nodes, set label B to 80K labels, set property [x,y,z] to 20K nodes and so on). Of course, I can't squeeze them all in one transaction and, thanks to the fact that these nodes can easily be separated into a large number of subsets, I perform the actions inside some number of separate transactions, which, of course, breaks all the ACIDity, but satisfies us in terms of performance. If I, however, try to nest those transaction into a single large one to rule them all, that top-level transaction tries to track all internal transactions' updates to DB, which, of course, results in an extremely poor performance.
What can you guys recommend me to solve the problem?
My config (well, its relevant parts):
"org.neo4j.server.database.mode" : "HA",
"use_memory_mapped_buffers" : "true",
"neostore.nodestore.db.mapped_memory" : "450M",
"neostore.relationshipstore.db.mapped_memory" : "450M",
"neostore.propertystore.db.mapped_memory" : "450M",
"neostore.propertystore.db.strings.mapped_memory" : "300M",
"neostore.propertystore.db.arrays.mapped_memory" : "50M",
"cache_type" : "hpc",
"dense_node_threshold" : "15",
"query_cache_size" : "150"
Any hints and clues are much appreciated :)
You are right that modifying hundreds of thousands of entities as a result of a user action in the same transaction isn't going to be performant. Nested transactions in Neo4j are just "placebo" transactions, as you correctly point out.
I would start by thinking about alternative strategies to achieve your goal (which I know nothing about) without needing to update so many entities.
If an alternative isn't possible, I would ask whether it is ok for the updates to happen a short time after the user action. If the answer is yes, then I would store a message about the user action in a persistent queue, which I would process asynchronously. That way, the user call returns quickly and the update happens eventually.
Finally, if it is acceptable for the time between the user action and the large update to take even longer, I would consider and "agent" that continuously crawls the graph and updates the labels of the entities that it encounters, as opposed to transaction-driven updates. Have a look at GraphAware NodeRank for inspiration.
I've got a list of Foo IDs. I need to call a stored procedure for each ID.
e.g.
Guid[] siteIds = ...; // typically contains 100 to 300 elements
foreach (var id in siteIds)
{
db.MySproc(id); // Executes some stored procedure.
}
Each call is pretty independent of the other rows, this shouldn't be contentious in the database.
My question: would it be beneficial to parallelize this using Parallel.ForEach? Or is database IO going to be a bottleneck, and more threads would just result in more contention?
I would measure it myself, however, it's difficult to measure this on my test environment where the data and load is much smaller than our real web server.
Out of curiosity, why do you want to optmize it with Parallel.ForEach and spawn threads / open connections / pass data / get response for every item instead of writing a simple "sproc" that will work with list of IDs instead of single ID?
From the first look, it should get you a lot more noticable improvement.
I would think that the Parallel.ForEach would work, assuming that your DB server can handle the ~150-300 concurrent operations.
The only way to know for sure is to measure both.