How do I fetch sorted logs in Cassandra? - sorting

In my application we store logs in Cassandra. User can see the logs after giving a start and an end date for the logs. We fetch the data on the basis of these dates and have implemented pagination as well such that the end date of page one becomes the start date for page 2.
Table:
CREATE TABLE audit_trail (
account_id bigint,
user_id bigint,
time timestamp,
category int,
ip_address text,
action_description text,
additional_data map<text,text>,
source int,
source_detail varchar,
PRIMARY KEY ( (account_id), time )
) WITH CLUSTERING ORDER BY (time DESC);
Problem:
The logs we get are not sorted but scattered. For example upon hitting the query for logs of day 1 to 10 we might be getting logs for day 10,8,9,2,1, or in any other order.
Aim:
to get the logs in sorted order such that logs from day 1 are shown at the top then day 2 and so on.
no data shuffling. As, upon collision the table is restructured in Cassandra which might give in data we already have seen in page 1 on page 2 again.
Data throughput is large, usually around 1000 logs per hour.

WITH CLUSTERING ORDER BY (time DESC);
Adding this at the end of the table solved the problem for me.

Related

Postgres primary key 'less than' operation is slow

Consider the following table
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);
If we have 100 million random data in this table.
Select age from company where id=2855265
Executed in less than a millisecond
Select age from company where id<353
Return less than 50 rows and Executed in less than a millisecond
Both query uses index
But the following query use full table scan and executed in 3 seconds
Select age from company where id<2855265
Return less than 500 rows
How can I speed up the query that select primary key less than variable?
Performance
The predicate id < 2855265 potentially returns a large percentage of rows in the table. Unless Postgres has information in table statistics to expect only around 500 rows, it might switch from an index scan to a bitmap index scan or even a sequential scan. Explanation:
Postgres not using index when index scan is much better option
We would need to see the output from EXPLAIN (ANALYZE, BUFFERS) for your queries.
When you repeat the query, do you get the same performance? There may be caching effects.
Either way, 3 seconds is way to slow for 500 rows, Postgres might be working with outdated or inexact table statistics. Or there may be issues with your server configuration (not enough resources). Or there can be several other not so common reasons, including hardware issues ...
If VACUUM ANALYZE did not help, VACUUM FULL ANALYZE might. It effectively rewrites the whole table and all indexes in pristine condition. Takes an exclusive lock on the table and might conflict with concurrent access!
I would also consider increasing the statistics target for the id column. Instructions:
Keep PostgreSQL from sometimes choosing a bad query plan
Table definition?
Whatever else you do, there seem to be various problems with your table definition:
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL, -- int is probably enough. "id" is a terrible column name
NAME TEXT NOT NULL, -- "name" is a teriible column name
AGE INT NOT NULL, -- typically bad idea to store age, store birthday instead
ADDRESS CHAR(50), -- never use char(n)!
SALARY REAL -- why would a company have a salary? never store money as real
);
You probably want something like this instead:
CREATE TABLE emmployee(
emploee_id serial PRIMARY KEY
company_id int NOT NULL -- REFERENCES company(company_id)?
, birthday date NOT NULL
, employee_name text NOT NULL
, address varchar(50) -- or just text
, salary int -- store amount as *Cents*
);
Related:
How to implement a many-to-many relationship in PostgreSQL?
Any downsides of using data type "text" for storing strings?
You will need to perform a VACUUM ANALYZE company; to update the planning.

make the optimizer use all columns of an index

we have a few tables storing temporal data that have natural a primary key consisting of 3 columns. Example: maximum temperature for this day. This is the Composite Primary key index (in this order):
id number(10): the id of the timeserie.
day date: the day for which this data was reported
kill_at timestamp: the last timestamp before this data was deleted or updated.
Simplified logic: When we make a forecast at 10:00am, then the last entry found for this id/day combination has his create_at changed to 9:59am and the newly calculated value is stored with a kill_at timestamp of '31.12.2999'.
typical queries on this table are:
1) where id=? and day=? and kill_at=?
2) where id=? and day between (? and ?) and kill_at=?
3) where id=? and day between (? and ?)
4) where id=?
There are plenty of timeseries that we do not forecast. That means we get one valued when it's measured and it never changes. But there are some timeseries that we forecast 200-300 times. So for one id/day combination there are 200+ entries with different values for kill_at.
We currently only have the primary key (id, day, kill_at) as the only (unique) index on this table. But when I query with query 2 (exact id and day range), then the optimizer decides to only use the first column of the index.
ID OPERATION OPTIONS OBJECT_NAME OPTIMIZER SEARCH_COLUMNS
0 SELECT STATEMENT ALL_ROWS 0
1 FILTER 0
2 TABLE ACCESS BY INDEX ROWID DPD 0
3 INDEX RANGE SCAN DPD_PK 1
This really hurts us for those timeseries that have been updates 200+ times.
Now I was looking for a way to force the optimizer to use all 3 columns of our index, but I can't find a hint for that. Is there one?
Or are there any other suggestions on how to speed up my query? We try to reduce the peak durations. The average Durations are of lesser concern.
what confuses me:
The above execution plan is what I see in dba_hist_sql_plan. It is the only execution plan for this statement. But when I let my client show the explain plan, then it is sometimes a 1 or a 3 for search_columns. But it never is 3 for when our application runs this Statement.
we actually found the cause of this problem. We're using JPA/JDBC and the JDBC date types weren't modeled correctly. While the oracle date type is with second precision, somebody (I now hate him) made the "day" attribute in our entity of type java.sql.Timestamp (although it is only day without time).
The effect is that Oracle will need to cast (use a function on) each entry in the table to make it a Timestamp before it can compare with the Timestamp query parameter. That way the index cannot be used properly.

Hadoop partitioning. How do you efficiently design a Hive/Impala table?

How do you efficiently design a Hive/Impala table considering the following facts?
The table receives tool data of about 100 million rows every
day. The date on which it receives the data is stored in a column in
the table along with its tool id.
Each tool receives about
500 runs per day which is identified by column run id. Each run id
contains data approximately of size 1 mb.
The default size of the block is 64 mb.
The table can be searched by date, tool id and run id in this order.
If you are doing analytics on this data then a solid choice with Impala is using Parquet format. What has worked well for our users is to partition the date by year, month, day based a date value on the record.
So for example CREATE TABLE foo (tool_id int, eff_dt timestamp) partition (year int, month int, day int) stored as parquet
When loading the data into this table we use something like this to create dynamic partitions:
INSERT INTO foo partition (year, month, day)
SELECT tool_id, eff_dt, year(eff_dt), month(eff_dt), day(eff_dt)
FROM source_table;
Then you train your users that if they want the best performance to add YEAR, MONTH, DAY to their WHERE clause so that it hits the partition for better performance. Then have them add the eff_dt in the SELECT statement so they have a date value in the format they like see in their final results.
In CDH, Parquet is storing by default data in 256MB chunks (which is configurable). Here is how to configure it: http://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Deleting very large table records where id not in another table

I have one table values that have 80 million records. Another table values_history that has 250 million records.
I want to filter the values_history table and want to keep the only data for which id is preset in values table.
delete from values_history where id not in (select id from values);
This query takes such a long time that I have to abort the process.
Please some idea to speed up the process.
Can I delete the records in bunch like 1000000 at a time?
I have extracted out the required record and inserted into temp table .This took 2 hrs after that i dropped the table then again inserted extracted data back to the main table whole process took 4 hrs around that is fine for me.I have dropped foreign key and all other constraint before that..

Resources