Can there be any problems deleting records when iterating over them in Tarantool? - tarantool

Capture this:
for _, t in box.space.session.index.user:pairs({uid}) do
local session_id = t[F.session.id]
box.space.session:delete({session_id})
end
Can there be any performance or correctness problems?
Tarantool 1.9

It's totally fine. However, be advised of the following:
if you don't use transactions, you may or may not walk over records that are being inserted while you're iterating over space.
adding transactions would boost performance
but, transactions that consist of more than several thousand records may freeze tarantool for too long, so final decision depends on your typical load

Related

Find Common Elements in two Big Data Set in a Reasonable Time

I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name').
DFa has around 610 million rows,
DFb has 3000 milllion rows.
Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80".
It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method.
Is there any other solution to deal with this? Not using spark solution is also acceptable if the runtime is acceptable.
More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets.
First compute 64 bit hashes of your ids. The comparison will be a lot faster on the hashes, than on the string ids.
My basic idea is:
Build a hash table from DFa.
As you compute the hashes for DFb, you do a lookup in the table. If there's nothing there then drop the entry (no match). If you get a hit compare the actual IDs to make sure you don't get a false positive.
The complexity is O(N). Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches.
The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better.
Since we already have the hash, we only need to know if it's exists. So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy). However this is not a common datastructure so you'll need to implement it.
But there's something even better, something even more compact. That is called bloom filter. This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside. The best part is that you should be able to find libraries for it already available.
So everything together it looks like this:
Compute hashes from DFa and build a bloom filter.
Compute hashes from DFb and check against the bloom filter. If you get a match lookup the ID in DFa to make sure it's a real match and add it to the result.
This is a typical usecase in any big data environment. You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors.
You can read more about broadcasted joins here
Broadcast-Joins

INSERT: when should I not be using the APPEND hint?

I'm trying to insert batches of data in an Oracle table, with an INSERT statement, namely:
INSERT INTO t1 SELECT * FROM all_objects;
I've come across the APPEND hint, which seems to increase performance in some cases.
Are there situations where it might decrease performance and it should not be used?
Thanks
The append hint does a direct path insert, which is the same direct path insert used by SQL*Loader, when specified. For large datasets, you should see dramatic improvements.
One of the main caveats you need to be aware of is that one of the reasons it is so fast is that it inserts all new rows past the high water mark. This means that if you are frequently deleting rows and re-inserting, a conventional insert could potentially be better than a direct path because it will reclaim the freed space from the deleted rows.
If, for example, you had a table with 5 million rows where you did a delete from followed by a direct path insert, after a few iterations you would notice things slow to a crawl. The insert itself would continue to be nice and speedy, but your queries against the table will gradually get worse.
The only way I know of to reset the HWM is to truncate the table. If you plan to use direct path on a table with minimal dead rows, or if you are going to somehow reset the HWM, then I think in most cases it will be fine -- preferable, in fact, if you are inserting large amounts of data.
Here's a nice article that explains the details of the differences:
https://sshailesh.wordpress.com/2009/05/03/conventional-path-load-and-direct-path-load-simple-to-use-in-complex-situations/
A final parting shot -- with all Oracle hints, know everything you can before you use them. Using them haphazardly can be haphazard to your health.
I think performance will may be decreased in the special case if your select retrievs only one or a small number of rows.
So in this I would use not the append hint. The OracleBase article describes very well the impact of the APPEND hint. He also provides the link to the manual page
There are 3 different situations:
The APPEND hint will not have any effect because it will be silently ignored. This will happen if a trigger is defined on the table or a reference constraint or under some other circumstances.
The append hint will raise an error message or a statement following the statement with the APPEND hint will raise an error message. Her you have two possibilities: either you remove the APPEND hint or you split the transaction in two or more separate transactions.
The append hint will work. Here you will get better performance if you use the APPEND hint (except if you have only a small number of rows to insert as stated at the beginning). But you will also need more space when using the append hint. The insert will use news extents for the data and not fill them in the free space of the existing extends. If you do a parallel insert each process uses its own extents. This may in a lot of unused space and be a drawback in some situations.
It might negatively affect performance if you are using it for inserting small sets of data.
That's because it will allocate new space every time instead of reusing free space, so using it with multiple small sets can fragment your table which may result on performance issues.
That hint is a good idea for large inserts scheduled for times where usage is low.

Time series in Cassandra when measures can go "back in time"

this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
WITH CLUSTERING ORDER BY (measure_time DESC);
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.

Sphinxsearch - Switching from 'regular' index to distributed index - Does this alter sorting in any way?

We recently did some research on how to speed things up a bit in sphinxsearch.
We found a great way to speed things up is to use a distributed index.
We ran real-life tests, and found that queries execute somewhere between 35-40% faster when a distributed index is used.
What I mean by distributed is basically our regular index, split up into 4 (the box hosting this index has 4 cores) via adding AND id % 4/3/2/1 = 0 into each source, for each of the parts of the index.
FYI, id is our primary key / auto increment.
So what this should do instead of having one huge index is split it up into 4.
And then we just use index type = distributed + local .... local .... local .... local .... for a 'put all the parts together' index.
We did some quick testing, the same results come back... Only 35-40% faster :)
So, before we implement this site wide, we would like to know:
Does switching to a distributed index like the one mentioned above impact sorting in any way?
We ask this because we use sphinx for a number of SEO related items, and we NEED to keep the order of the results the same.
I should also mention, queries, all query options, etc stay the same. Any and all changes were done on the daemon end.
Thanks!
Sorting should be unaffected. You suffer a bigger performance hit when using distribution indexes and high offsets. But the first few pages will be fine.
As far as I know the gotcha are using grouping/clustering and kill-lists. But if not using them, should be nothing to worry about.

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

Resources