I am testing CrateDB with a data set of 80 million events sent from a web app, both as a normalized, relational solution, and also as a denormalized, single database solution.
I imported all 80 million denormalized events into a table, and ran the following aggregation query:
select productName, SUM(elapsed)/60 as total_minutes from denormalized
where country_code = 'NL' AND eventType = 'mediaPlay'
group by productName
order by total_minutes desc
limit 1000;
and the query took .009 seconds. Wowza! CrateDB is blazing fast!
Then I imported the sessionwide docs into one table called "sessions", and all the individual event docs in each session into another table called "events", and ran the following query:
select e.productName, SUM(e.elapsed)/60 as total_minutes from sessions s
join events e ON e.sessionGroup = s.sessionGroup
where s.country_code = 'NL' AND e.eventType = 'mediaPlay'
group by e.productName
order by total_minutes desc
limit 1000;
which took 21 seconds.
My question is, is there any way to get faster relational performance, maybe by creating indexes, or changing the query somehow?
Tangential thought:
We have been using Elasticsearch for analytics, obviously denormalizing the data, and it's plenty fast, but CrateDB seems to offer everything Elasticsearch does (fast queries on denormalized data, clustering, dynamic schema, full text search), plus the additional advantages of:
better SQL support
the option to deploy relational solutions on small data sets (wonderful to standardize on one DB, no context-switching or ramp up for developers who know SQL).
What CrateDB version are you using? If it is < 3.0, than upgrading will probably boost the join query a lot, see https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three/.
Related
We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.
For the purpose of this question, let's pretend I have the following table:
Transaction:
Id
ProductId
ProductName
City
State
Country
UnitCost
SellAmount
NumberOfTimesPurchased
Profit (NumberOfTimesPurchased * (SellAmount - UnitCost))
Basically, a single de-normalized table with a million plus rows in it. It is important to note that only two columns will ever by updated: Profit and NumberOfTimesPurchased. When a sale is made, the NumberOfTimesPurchased will be updated and the new profit amount will be re-calculated.
Now, I need to do some minimal reporting on this table, which consists of queries that aggregate and group. As an example:
SELECT
City, AVG(UnitCost), AVG(SellAmount),
SUM(NumberOfTimesPurchased), AVG(Profit)
FROM
Transaction
GROUP BY
City
SELECT
State, AVG(UnitCost), AVG(SellAmount), SUM(NumberOfTimesPurchased),
AVG(Profit)
FROM
Transaction
GROUP BY
State
SELECT
Country, AVG(UnitCost), AVG(SellAmount), SUM(NumberOfTimesPurchased),
AVG(Profit)
FROM
Transaction
GROUP BY
Country
SELECT
ProductId, ProductName, AVG(UnitCost), AVG(SellAmount),
SUM(NumberOfTimesPurchased), AVG(Profit)
FROM
Transaction
GROUP BY
ProductId, ProductName
These queries are quick: ~1 second. However, I've noticed that under load, performance significantly drops (from 1 second up to a minute when there are 20+ concurrent requests), and I'm guessing the reason is that each query performs a full table scan.
I've attempted to use indexed views for each query, however my update statement performance takes a beating since each view needs to be rebuilt. On the same note, I've attempted to create covering indexes for each query, but again my update statement performance is not acceptable.
Assuming full table scans are the culprit, do I have any realistic options to get the query time down while keeping update performance at acceptable levels?
Note that I cannot use column store indexes (I'm using the cheaper version of Azure SQL Database). I'd also like to stay away from any sort of roll-up implementation, as I need the data available immediately.
Finally - the example above is not a completely accurate representation of my table. I have 20 or so different columns that can be 'grouped', and 6 columns that can be updated. No inserts or deletes.
Because there are no WHERE clauses on your queries, the database engine can nothing but a table scan (or clustered index scan which is really the same thing). If there were covering indexes with containing all the columns from your query, then the engine would prefer those. If your real queries have WHERE clauses, then appropriate indexing with those columns as the leading columns of the index might help.
But I think your problem lies elsewhere. As far as concurrency goes you haven't put enough money in the meter. According to the main service tiers doc, the Basic tier for Azure SQL Database is for:
... supporting typically one single
active operation at a given time. Examples include databases used for
development or testing, or small-scale infrequently used applications.
Therefore you might want to think about splashing out for Premium edition to support both your concurrency requirement and columnstore indexes, which are perfectly suited to this type of query. Just for fun, I created a test-rig based on AdventureWorksDW2012 to try and recreate your problem which is here. Query performance was atrocious (> 20 secs). I'd be surprised if you weren't getting DTU warnings on your portal:
An upgrade to Standard (S0-S2) did boost performance so you should experiment. You could look at scaling up for busy query times and down when not required.
This table also looks a bit like a fact table, so you might want to consider refactoring this as a fact / dimensional model then use Azure Analysis Services on top to bring that sub-second performance.
Coincidentally there is a feedback item you can vote for to bring columnstore to Standard tier:
https://feedback.azure.com/forums/217321-sql-database/suggestions/6878001-make-sql-column-store-feature-available-for-standa
Recent comments suggest it is "in the work queue" as at May 2017;
Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?
It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.
I'm using a SQLite database (1GB+) in which there's only one table of stacked financial tick time series. I'm using RSQLite package to connect to the database. My situation is very simple: select a subset of data from one date to the other, e.g. from 20140101 to 20140930. Therefore the query is simple too:
SELECT * FROM DATA WHERE YMD BETWEEN 20140101 AND 20140930
The data is stored on a SSD but the reading time looks still slow. Are there some ways or tips to boost the performance of such simple subsetting query?
The database has 14 columns and on automatic index.
We have a large table, with several indices (say, I1-I5).
The usage pattern is as follows:
Application A: all select queries 100% use indices I1-I4 (assume that they are designed well enough that they will never use I5).
Application B: has only one select query (fairly frequently run), which contains 6 fields and for which a fifth index I5 was created as a covered index.
The first 2 fields of the covered index are date, and a security ID.
The table contains rows for ~100 dates (in date order, enforced by a clustered index I1), and tens of thousands of security identifiers.
Question: dies the order of columns in the covered index affect the performance of the select query in Application B?
I.e., would the query performance change if we switched around the first two fields of the index (date and security ID)?
Would the query performance change if we switch around one of the last fields?
I am assuming that the logical IOs would remain un-affected by any order of fields in the covered index (though I'm not 100% sure).
But will there be other performance effects? (Optimizer speed, caching, etc...)
The question is version-generic, but if it matters, we use Sybase 12.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
It depends. If you have a WHERE clause such as the following, you will get better performance out an index on (security_ID, date_column) than the converse:
WHERE date_column BETWEEN DATE '2009-01-01' AND DATE '2009-08-31'
AND security_ID = 373239
If you have a WHERE clause such as the following, you will get better performance out of an index on (date_column, security_ID) than the converse:
WHERE date_column = DATE '2009-09-01'
AND security_ID > 499231
If you have a WHERE clause such as the following, it really won't matter very much which column appears first:
WHERE date_column = DATE '2009-09-13'
AND security_ID = 211930
We'd need to know about the selectivity and conditions on the other columns in the index to know if there are other ways of organizing your index to gain more performance.
Just like your question is version generic, my answer is DBMS-generic.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
The problem is not the size of the table. Millions of rows is nothing for Sybase.
The problem is an absence of a test system.