Oracle: Efficient query over date interval - oracle

I have tables representing events with start and end time (stored as DATE, TIMESTAMP or TIMESTAMP WITH TIME ZONE format), e.g.
MY_TABLE:
DATA START END
A 1/11/2012 10:00 1/11/2012 12:00
B 2/12/2012 08:00 2/12/2012 16:00
And it frequently happens to have queries on such tables over time intervals, e.g.
SELECT data
FROM my_table
WHERE start BETWEEN t1 AND t2; -- usually we either use start or end time for every row.
Where t1 and t2 are DATE/TIMESTAMP values s.t. t1<= t2.
Since these queries are goin to be run on large tables, is there a better, that is, more effcient, way of performing queries like above?
Currently I don't know if any table with the structure above has a index on either of time-data columns, still it would be hardly a problem to add them. They have indexes on other non-time related columns.
I remember to have read somewehre that using analytic functions (like partition) these kind of query could be made much more efficient than simply using BETWEEN..AND. Unfortunately I can't find the link anymore and I don't know analytics functions, of which I have read just a few short introductions here and there on the net.
Since I have little time to investigate I'd like to ask you if you could confirm my theory and if you could lead me to an example related to my problem.
It goes without saying that I'm not asking you a quick answer to my problem, something to copy&paste, just a hint to understand if I'm looking in the right direction.
TIA
EDIT:
#jonearles : For the first statement I'd agree, but I'd like to know if the use of analytics functions isn't actually able to provide a more efficient query.
For the latter, yes I meant PARTITION BY clause. It occurs to me that is a silly specification, since analytical functions are expected to be used with a PARTITION BY clause.
I apologyze for the confusion, as I said before I haven't looked much into the subject.

Your query is probably fine. The simplest way to write a predicate is usually the best. That's what Oracle expects, and is most likely what Oracle is optimized for.
You probably want to look into creating objects to improve the access methods. Specifically indexes (if you're selecting a small amount of data) and partitions (if you're selecting a large amount of data).
"Partition" can mean at least three different things in Oracle. Perhaps you've confused the "partition by" clause in analytic functions with partitioning?

Related

Oracle partitioning recommendations

Due to being locked down by Corona, I don't have easy access to my more knowledgeable colleagues, so I'm hoping for a few possible recommendations here.
We do quarterly and yearly "freezes" of a number of statistical entities with a large number (1-200) of columns. Everyone then uses these "frozen" versions as a common basis for all statistical releases in Denmark. Currently, we simply create a new table for each version.
There's a demand to test if we can consolidate these several hundred tables to 26 entity-based tables to make programming against them easier, while not harming performance too much.
A "freeze" is approximately 1 million rows and consists of: Year + Period + Type + Version.
For example:
2018_21_P_V1 = Preliminary Data for 2018 first quarter version 1
2019_41_F_V2 = Final Data for 2019 yearly version 2
I am simply not very experienced in the world of partitions. My initial thought was to partition on Year + Period and Subpartiton on Type + Version, but I am no longer sure this is the right approach, nor do I have a clear picture of which partitioning type would solve the problem best.
I am hoping someone can recommend an approach as it would help me tremendously and save me a lot of time "brute force" testing a lot of different combinations.
Based on your current situation which you explained I highly recommend that "USE THE PARTITIONING". No doubt.
It's highly effective and easy to use. You can read Oracle documentation about partitioning or search on the web for that to understand how to start.
In general, when you partition a table, Oracle looks at each partition as a separate table so don't worry about the speed of fetching data.
The most important step is to choose the best field(s) to establish your partitions based on. I used the date format (20190506) in number or int data type for my daily basis. Or (201907) for a monthly basis. You should design and test it.
The next is to decide about the sub-partitions. In some cases, you don't really need one. It depends on your data structure and your expectations from the data. What do you want to do with the data? Which fields are more important? (used in where clause, ...)
Then make some index(es) for each partition. Very important.
Another important point is that using partitions may have some changes in the way you code in pl/sql. For example, you can not use 2 or more partitions in a single query at the same time. You should select and fetch data from different partitions one by one.
And don't worry about 1 million records. I used partitioning for tables way larger than this and it works fine.
Goodluck

Big Data Analytics Choice of Technology

I am asked to asses possible chice of technology we need to use for the problem described below. Possible options are Hadoop, Hive, and Pig. I do not have much experience with either of those. If you could point out a good source to read. I google and find tons of references but it is hard to find a step by step explanation or comparison.
Here is the task I need to solve.
Users enter sentences into the system. Sentences are broken out by words and stored in Cassandra column family. Each row is a single word (key) and column names are the time stamp this record was entered with no column values.
I need to be able to query the database and extract N words that are taken from the following breakdown:
a_1% must be the top words from period T1 from now into the past
a_2% must be the top words from period T2 from now into the past
a_3% must be the top words from period T3 from now into the past
a_n% must be the top words from period T_n from now into the past
a_1+a_2+...a_n = 100%
and T1, T2, etc are arbitrary time intervals.
any suggestion for a choice of technology I should use for this task would be greatly appreciate. We are using Cassandra and we are quite familiar with it. Now we need to decide which analytical tool to put on top of it.
Links or specifics would be quite appreciated.
If you have the data partitioned ( by time intervals ) in HIVE, finding such 'top words combination' sentences could be achived by one query in HIVE. Also HIVEQL sytnax might help with additional analytic in future, especially for people who know SQL. Question is how to integrate Cassandra with Hadoop. I hope someone might say something about it. GL!
EDITED: There is nice chapter about intergarating Cassandra and HIVE.
The term Big Data is not very much unknown for most of the tech guys albeit there is some sort of confusion about it in everyone's mind. If we explain the term from layman's point of view, then it means the large volume of structured as well as unstructured data. Now a very usual question will arise in our mind after knowing the definition of the term big data that how we can get this large amount of data? As an answer to this question, we can say that we usually produce data when we communicate with our friends or when we do some digital transactions or when we shop whenever we go online.
What are the solutions Big Data is providing which seem to be impossible even a few years ago?
We already know that information, photographs, text, voice, and video data is the base of big data and big data is now involved in so many projects for helping the mankind.

Asking for opinions : One sequence for all tables

Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored

Oracle Hierarchical Query Performance

We're looking at using Oracle Hierarchical queries to model potentially very large tree structures (potentially infinitely wide, and depth of 30+). My understanding is that hierarchal queries provide a method to write recursively joining SQL but they it does not provide any real performance enhancements over if you were to manually write an equivalent query... is this the case? What sort of experiences have people had, performance wise, with using oracle hierarchical queries?
Well the short answer is that without the hierarchical extension (connect by) you couldn't write a recursive query. You could programmitically issue many queries which were recurisively linked.
The rule of thumb with everything database is, especially oracle, is that if you can issue your result in a single query it will almost always be faster than doing it programatically.
My experiences have been with much smaller sets, so I can't speak for how well heirarchical queries will perform for large sets.
When doing these tree retrievals, you typically have these options
Query everything and assemble the tree on the client side.
Perform one query for each level of the tree, building on what you know that you need from the previous query results
Use the built in stuff Oracle provides (START WITH,CONNECT BY PRIOR).
Doing it all in the database will reduce unnecessary round trips or wasteful queries that pull too much data.
Try partitioning the data within you hierarchical table and then limiting the partition included in the query.
CREATE TABLE
loopy
(key NUMBER, key_hier number, info VARCHAR2, part NUMBER)
PARTITION BY
RANGE (part)
(
PARTITION low VALUES LESS THAN (1000),
PARTITION mid VALUES LESS THAN (10000),
PARTITION high VALUES LESS THAN (MAXVALUE)
);
SELECT
info
FROM
loopy PARTITION(mid)
CONNECT BY
key = key_hier
START WITH
key = <some value>;
The interesting problem now becomes your partitioning strategy. Oracle provides several options.
I've seen that using connect by can be slow but compared to what? There isn't really another option except building a result set using recursive PL/SQL calls (slower) or doing it on your client side.
You could try separating your data into a mapping (hierarchy definition) and lookup tables (the display data) and then joining them back together. I guess I wouldn't expect much of a gain assuming you are getting the hierarchy data from indexed fields but its worth a try.
Have you tried it using the connect by yet? I'm a big fan of trying different variations.

Does having several indices all starting with the same columns negatively affect Sybase optimizer speed or accuracy?

We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)

Resources