Cassandra Timing out because of TTL expiration - amazon-ec2

Im using a DataStax Community v 2.1.2-1 (AMI v 2.5) with preinstalled default settings+ increased read time out to 10sec here is the issue
create table simplenotification_ttl (
user_id varchar,
real_time timestamp,
insert_time timeuuid,
read boolean,
msg varchar, PRIMARY KEY (user_id, real_time, insert_time));
Insert Query:
insert into simplenotification_ttl (user_id, real_time, insert_time, read)
values ('test_3',14401440123, now(),false) using TTL 800;
For same 'test_3' I inserted 33,000 tuples. [This problem does not happen for 24,000 tuples]
Gradually i see
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
count
-------
15681
(1 rows)
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
count
-------
12737
(1 rows)
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
**errors={}, last_host=127.0.0.1**
I have experimented this many times even on different tables. Once this happens, even if i insert with same user_id and do a retrieval with limit 1. It times out.
I require TTL to work properly ie give count 0 after speculated time. How to solve this issue?
Thanks
[My other node related setup is using m3.large with 2 nodes EC2Snitch]

You're running into a problem where the number of tombstones (deleted values) is passing a threshold, and then timing out.
You can see this if you turn on tracing and then try your select statement, for example:
cqlsh> tracing on;
cqlsh> select count(*) from test.simple;
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------+--------------+--------------+----------------
...snip...
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 23:36:59,324 | 172.31.0.85 | 123932
Scanned 1 rows and matched 1 | 23:36:59,325 | 172.31.0.85 | 124575
Timed out; received 0 of 1 responses for range 2 of 4 | 23:37:09,200 | 172.31.13.33 | 10002216
You're kind of running into an anti-pattern for Cassandra where data is stored for just a short time before being deleted. There are a few options for handling this better, including revisiting your data model if needed. Here are some resources:
The cassandra.yaml configuration file - See section on tombstone settings
Cassandra anti-patterns: Queues and queue-like datasets
About deletes
For your sample problem, I tried lowering the gc_grace_seconds setting to 300 (5 minutes). That causes the tombstones to be cleaned up more frequently than the default 10 days, but that may or not be appropriate based on your application. Read up on the implications of deletes and you can adjust as needed for your application.

Related

Delta Detection in Oracle table with 2 billion records using composite key

Initial Load on Day 1
id
key
fkid
1
0
100
1
1
200
2
0
300
Load on Day 2
id
key
fkid
1
0
100
1
1
200
2
0
300
3
1
400
4
0
500
Need to find delta records
Load on Day 2
id
key
address
3
1
400
4
0
500
Problem Statement
Need to find delta records in minimum time with following facts
1: I have to process around 2 billion records initially from a table as mentioned below
2: Also need to find delta with minimal time so that I can process it quickly
Questions :
1: Will it be a time consuming process to identify delta especially during production downtime ?
2: How long should it take to identify delta with 3 numeric columns in a table out of which
id & key forms a composite key.
Solution tried :
1: Use full join and extract delta with case nvl condition but looks to be costly.
nvl(node1.id, node2.id) id,
nvl(node1.key, node2.key) key,
nvl(node1.fkid, node2.fkid) fkid
FROM
TABLE_DAY_1 node1
FULL JOIN TABLE_DAY_2 node2 ON node2.id = node1.id
WHERE
node2.id IS NULL
OR node1.id IS NULL;```
You need two separate statements to handle this, one to detect new & changed rows, a separate one to detect deleted rows.
While it is cumberson to write, the fastest comparison is field-by-field, so:
SELECT /*+ parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ *
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id = node2.id(+)
AND (node2.id IS NULL -- new rows
OR node1.col1 <> node2.col2 -- changed val on non-nullable col
OR NVL(node1.col3,' ') <> NVL(node2.col3,' ') -- changed val on nullable string
OR NVL(node1.col4,-1) <> NVL(node2.col4,-1) -- changed val on nullable numeric, etc..
)
Then for deleted rows:
SELECT /*+ parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ node2.id
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id(+) = node2.id
AND node1.id IS NULL -- deleted rows
You will want to make sure Oracle does a full table scan. If you have lots of CPUs and parallel query is enabled on your database, make sure the query uses parallel query (hence the hint). And you want a hash join between them. Work with your DBA to ensure you have enough temporary space to pull this off, and enough PGA to at least handle this with a single pass workarea rather than multipass.

Reduce resource consumption in ClickHouse

The table
CREATE TABLE events
(
site_id UInt64,
name String
-- other columns
)
ENGINE = CollapsingMergeTree(sign_flag)
PARTITION BY site_id
ORDER BY (name)
SETTINGS index_granularity = 8192;
The query
SELECT 'wtf',
*
FROM events
WHERE site_id = 1 AND
name = 'some_name'
LIMIT 100000;
The log
SELECT formatReadableSize(read_bytes) AS read_bytes,
formatReadableSize(memory_usage) AS memory_usage,
formatReadableQuantity(read_rows) AS read_rows,
query_duration_ms / 1000 AS query_duration_sec,
query
FROM system.query_log
WHERE query LIKE '%wtf%'
ORDER BY
event_time DESC
LIMIT 100;
+------------+--------------+--------------+--------------------+
| read_bytes | memory_usage | read_rows | query_duration_sec |
+------------+--------------+--------------+--------------------+
| 578.41 MiB | 131.95 MiB | 1.01 million | 10.773 |
+------------+--------------+--------------+--------------------+
I think there are very large numbers in the log.
How to optimize it or I miss something about server config ?
Consider defining another primary key - for this query ORDER BY (name, site_id).
Choosing PK is a pretty important part of the design, to choose right one need to observe all picture of use cases.
See for more details:
ClickHouse: Selecting the Primary Key
StackOverflow #62556274.

Cassandra adding row vs. adding columns performance

I want to store time series log from many difference device into cassandra
I have 2 strategies:
The first one, add a column for each new event
---------------------------------------------------------------
device1 | 2016-4-3, "visit /" | 2016-4-4, "exit /" | ...
----------------------------------------------------------------
device2 | 2016-4-3, "visit /home" | 2016-4-4, "exit /home" | ...
----------------------------------------------------------------
the second one, add a row for each new event just like sql
--------------------------------
device1 | 2016-4-3 | "visit /" |
--------------------------------
device1 | 2016-4-4 | "exit /" |
--------------------------------
.... | ... | ....
which one will give more inserting performance
This is actually a confusion over how Cassandra works. In Cassandra we think about data modeling as "partitions" and "rows".
A partition contains many logical groupings of columns we call a "row". The ordering of rows within a Partition is based on a Clustering Key which is a set of columns in that row.
In IOT use cases this typically plays out as a Partition representing a single device. Then the rows within the partition representing events emitted by that device. The Clustering Key is set to the emission time (more often a TIMEUUID for the event. This builds up partitions that look like
DeviceID -> [TimeUUID_1, (DataA, DataB, DataC) ], [TimeUUID_2, (DataA, DataB, DataC) ] ...
This partition would have been described by a schema like
CREATE TABLE timeseries (
DeviceID UUID,
EventTime TIMEUUID,
DataA Text,
DataB Text,
DataC Text,
PRIMARY KEY (DeviceID, EventTIme)
)
For more examples see time series data-modeling
Which details a few different styles of modeling time series data based on these concepts.
You are trying to model on a non-existent problem. You should only model based on your queries.
A typical (reverse) time-series model is:
CREATE TABLE mytable(
device int,
ts ts,
event text,
PRIMARY KEY (device , ts)
) WITH CLUSTERING ORDER BY (ts DESC);
where you can easily (and efficiently) retrieve all the events for a particular device with
SELECT * FROM mytable WHERE device = ?;
and you can further restrict your results to a specific time window with
SELECT * FROM mytable WHERE device = ? AND ts >= ? AND ts <= ?;

Constant-time index for string column on Oracle database

I have an orders table. The table belongs to a multi-tenant application, so there are orders from several merchants in the same table. The table stores hundreds of millions of records. There are two relevant columns for this question:
MerchantID, an integer storing the merchant's unique ID
TransactionID, a string identifying the transaction
I want to know whether there is an efficient index to do the following:
Enforce a unique constraint on Transaction ID for each Merchant ID. The constraint should be enforced in constant time.
Do constant time queries involving exact matches on both columns (for instance, SELECT * FROM <table> WHERE TransactionID = 'ff089f89feaac87b98a' AND MerchantID = 24)
Further info:
I am using Oracle 11g. Maybe this Oracle article is relevant to my question?
I cannot change the column's data type.
constant time means an index performing in O(1) time complexity. Like a hashmap.
Hash clusters can provide O(1) access time, but not O(1) constraint enforcement time. However, in practice the constant access time of a hash cluster is worse than the O(log N) access time of a regular b-tree index. Also, clusters are more difficult to configure and do not scale well for some operations.
Create Hash Cluster
drop table orders_cluster;
drop cluster cluster1;
create cluster cluster1
(
MerchantID number,
TransactionID varchar2(20)
)
single table hashkeys 10000; --This number is important, choose wisely!
create table orders_cluster
(
id number,
MerchantID number,
TransactionID varchar2(20)
) cluster cluster1(merchantid, transactionid);
--Add 1 million rows. 20 seconds.
begin
for i in 1 .. 10 loop
insert into orders_cluster
select rownum + i * 100000, mod(level, 100)+ i * 100000, level
from dual connect by level <= 100000;
commit;
end loop;
end;
/
create unique index orders_cluster_idx on orders_cluster(merchantid, transactionid);
begin
dbms_stats.gather_table_stats(user, 'ORDERS_CLUSTER');
end;
/
Create Regular Table (For Comparison)
drop table orders_table;
create table orders_table
(
id number,
MerchantID number,
TransactionID varchar2(20)
) nologging;
--Add 1 million rows. 2 seconds.
begin
for i in 1 .. 10 loop
insert into orders_table
select rownum + i * 100000, mod(level, 100)+ i * 100000, level
from dual connect by level <= 100000;
commit;
end loop;
end;
/
create unique index orders_table_idx on orders_table(merchantid, transactionid);
begin
dbms_stats.gather_table_stats(user, 'ORDERS_TABLE');
end;
/
Trace Example
SQL*Plus Autotrace is a quick way to find the explain plan and track I/O activity per statement. The number of I/O requests is labeled as "consistent gets" and is a decent way of measuring the amount of work done. This code demonstrates how the numbers were generated for other sections. The queries often need to be run more than once to warm things up.
SQL> set autotrace on;
SQL> select * from orders_cluster where merchantid = 100001 and transactionid = '2';
no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 621801084
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 16 | 1 (0)| 00:00:01 |
|* 1 | TABLE ACCESS HASH| ORDERS_CLUSTER | 1 | 16 | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("MERCHANTID"=100001 AND "TRANSACTIONID"='2')
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
31 consistent gets
0 physical reads
0 redo size
485 bytes sent via SQL*Net to client
540 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
0 rows processed
SQL>
Find Optimal Hashkeys, Trade-Offs
For optimal read performance all the hash collisions should fit in one block (all Oracle I/O is done per block, usually 8K). Getting the ideal storage right is tricky and requires knowing the hash algorithm, storage size (not the same as the block size), and number of hash keys (the buckets). Oracle has a default algorithm and size so it is possible to focus on only one attribute, the number of hash keys.
More hash keys leads to fewer collisions. This is good for TABLE ACCESS HASH performance as there is only one block to read. Below are the number of consistent gets for different hashkey sizes. For comparison an index access is also included. With enough hashkeys the number of blocks decreases to the optimal number, 1.
Method Consistent Gets (for transactionid = 1, 20, 300, 4000, and 50000)
Index 4, 3, 3, 3, 3
Hashkeys 100 1, 31, 31, 31, 31
Hashkeys 1000 1, 3, 4, 4, 4
Hashkeys 10000 1, 1, 1, 1, 1
More hash keys also lead to more buckets, more wasted space, and a slower TABLE ACCESS FULL operation.
Table type Space in MB
HeapTable 24MB
Hashkeys 100 26MB
hashkeys 1000 30MB
hashkeys 10000 81MB
To reproduce my results, use a sample query like select * from orders_cluster where merchantid = 100001 and transactionid = '1'; and change the last value to 1, 20, 300, 4000, and 50000.
Performance Comparison
Consistent gets are predictable and easy to measure, but at the end of the day only the wall clock time matters. Surprisingly, the index access with 4 times more
consistent gets is still faster than the optimal hash cluster scenario.
--3.5 seconds for b-tree access.
declare
v_count number;
begin
for i in 1 .. 100000 loop
select count(*)
into v_count
from orders_table
where merchantid = 100000 and transactionid = '1';
end loop;
end;
/
--3.8 seconds for hash cluster access.
declare
v_count number;
begin
for i in 1 .. 100000 loop
select count(*)
into v_count
from orders_cluster
where merchantid = 100000 and transactionid = '1';
end loop;
end;
/
I also tried the test with variable predicates but the results were similar.
Does it Scale?
No, hash clusters do not scale. Despite the O(1) time complexity of TABLE ACCESS HASH, and the O(log n) time complexity of INDEX UNIQUE SCAN, hash clusters never seem to outperform b-tree indexes.
I tried the above sample code with 10 million rows. The hash cluster was painfully slow to load, and still under-performed the index on SELECT performance. I tried to scale it up to 100 million rows but the insert was going to take 11 days.
The good news is that b*trees scale well. Adding 100 million rows to the above example only require 3 levels in the index. I looked at all DBA_INDEXES for a large database environment (hundreds of databases and a petabyte of data) - the worst index had only 7 levels. And that was a pathological index on VARCHAR2(4000) columns. In most cases your b-tree indexes will stay shallow regardless of the table size.
In this case, O(log n) beats O(1).
But WHY?
Poor hash cluster performance is perhaps a victim of Oracle's attempt to simplify things and hide the kind of details necessary to make a hash cluster work well. Clusters are difficult to setup and use properly and would rarely provide a significant benefit anyway. Oracle has not put a lot of effort into them in the past few decades.
The commenters are correct that a simple b-tree index is best. But it's not obvious why that should be true and it's good to think about the algorithms used in the database.

Function-based Index using Substr and Instr

I have created a query doing this in ORACLE:
SELECT SUBSTR(title,1,INSTR(title,' ',1,1)) AS first_word, COUNT(*) AS word_count
FROM FILM
GROUP BY SUBSTR(title,1,INSTR(title,' ',1,1))
HAVING COUNT(*) >= 20;
Results after running:
539 rows selected. Elapsed: 00:00:00.22
I need to improve the performance of this and created a function-based index as so:
CREATE INDEX INDX_FIRSTWRD ON FILM(SUBSTR(title,1,INSTR(title,' ',1,1)));
After running the same query at the top of this post, I still get the same performance:
539 rows selected. Elapsed: 00:00:00.22
Is the index not being applied or overwritten or am I doing something wrong?
Thanks for any help you could provide. :)
EDIT:
Execution Plan:
----------------------------------------------------------
Plan hash value: 2033354507
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 20000 | 2968K| 138 (2)| 00:00:02 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 20000 | 2968K| 138 (2)| 00:00:02 |
| 3 | TABLE ACCESS FULL| FILM | 20000 | 2968K| 136 (0)| 00:00:02 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(COUNT(*)>=20)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
471 consistent gets
0 physical reads
0 redo size
14030 bytes sent via SQL*Net to client
908 bytes received via SQL*Net from client
37 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
539 rows processed
The problem is that the value you're using for the index may be null - if there is no space in the title (i.e. it's a one-word title like "Jaws") then your substr evaluates to null. That probably isn't what you want, incidentally - you probably want the end position to be conditional on whether there is a space at all, but that's beyond the scope of the question. (And even if you correct that logic, Oracle may still not be able to trust that the result can't be null, even if the underlying column is not nullable). Edit: see below for more on using nvl to handle single-word titles.
Since nulls aren't included in indexes, the single-title rows won't be indexed. But you're asking for all rows, and Oracle knows the index doesn't hold all rows, so it can't use the index to fulfil the query - even if you add a hint telling it to, it has to ignore that hint.
The only time the index will be used is if you include a filter that references the indexed value too, and explicitly or implicitly exclude nulls, e.g.:
SELECT SUBSTR(title,1,INSTR(title,' ',1,1)) AS first_word, COUNT(*) AS word_count
FROM FILM
WHERE SUBSTR(title,1,INSTR(title,' ',1,1)) IS NOT NULL
GROUP BY SUBSTR(title,1,INSTR(title,' ',1,1))
HAVING COUNT(*) >= 20;
(which also probably isn't what you actually want).
SQL Fiddle for queries with and without a filter, and with and without an index hint. (Click the 'execution plan' link against each result section to see whether it's doing a full table scan or a full index scan).
And another Fiddle showing that the index can't be used even with the filter if the filter still allows null values, again since they are not in the index.
Since SylvainLeroux brought it up, Oracle isn't quite clever enough to know the computed value can't be null if you coalesce it, even if the underlying column is not-null (as a function-based index or as a virtual column). Possibly because there could be a lot of branches to evaluate. But it is clever enough if you use the simpler and proprietary nvl instead:
CREATE INDEX INDX_FIRSTWRD
ON FILM(NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title));
SELECT NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title) AS first_word,
COUNT(*) AS word_count
FROM FILM
GROUP BY NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title)
HAVING COUNT(*) >= 20;
But only if title is defined as not-null. And coalesce does work if the virtual column is also declared not-null (thanks Sylvain).
SQL Fiddle with a function-based index and another with a virtual column.
539 rows selected. Elapsed: 00:00:00.22
Do you really think you need to tune the query which returns 539 rows in less than a second? 220 milliseconds, precicely! Think about it.
In your case, I think CBO does the best possible thing. And that is the reason it doesn't use the index. Because, to read every row from the table, using the index is an overhead. It needs to read the index and then do a table access by rowid. Probably, in your small table, it could read the entire table with less IO to fetch the data.
If the table is small enough to be in a single block, then, it just requires a one IO to fetch required data from single block with full table scan.
You can try to check the explain plan by hinting the query to use the index and see if anything really improves. Remember, you are trying unnecessarily to improve the performance of a query which executes in less than a second!

Resources