Does setting "NOT NULL" on a column in postgresql increase performance? - performance

I know this is a good idea in MySQL. If I recall correctly, in MySQL it allows indexes to work more efficiently.

Setting NOT NULL has no effect per se on performance. A few cycles for the check - irrelevant.
But you can improve performance by actually using NULLs instead of dummy values. Depending on data types, you can save a lot of disk space and RAM, thereby speeding up ... everything.
The null bitmap is only allocated if there are any NULL values in the row. It's one bit for every column in the row (NULL or not). For tables up to 8 columns the null bitmap is effectively completely free, using a spare byte between tuple header and row data. After that, space is allocated in multiples of MAXALIGN (typically 8 bytes, covering 64 columns). The difference is lost to padding. So you pay the full (low!) price for the first NULL value in each row. Additional NULL values can only save space.
The minimum storage requirement for any non-null value is 1 byte (boolean, "char", ...) or typically much more, plus (possibly) padding for alignment. Read up on data types or check the gory details in the system table pg_type.
More about null storage:
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
The manual.

It's always a good ideal to keep columns from being NULL if you can avoid it, because the semantics of using are so messy; see What is the deal with NULLs? for good a discussion of how those can get you into trouble.
In versions of PostgreSQL up to 8.2, the software didn't know how to do comparisons on the most common type index (the b-tree) in a way that would include finding NULL values in them. In the relevant bit of documentation on index types, you can see that described as "but note that IS NULL is not equivalent to = and is not indexable". The effective downside to this is that if you specify a query that requires including NULL values, the planner might not be able to satisfy it using the obvious index for that case. As a simple example, if you have an ORDER BY statement that could be accelerated with an index, but your query needs to return NULL values too, the optimizer can't use that index because the result will be missing any NULL data--and therefore be incomplete and useless. The optimizer knows this, and instead will do an unindexed scan of the table instead, which can be very expensive.
PostgreSQL improved this in 8.3, "an IS NULL condition on an index column can be used with a B-tree index". So the situations where you can be burned by trying to index something with NULL values have been reduced. But since NULL semantics are still really painful and you might run into a situation where even the 8.3 planner doesn't do what you expect because of them, you should still use NOT NULL whenever possible to lower your chances of running into a badly optimized query.

No, as long as you don't actually store NULLs in the table the indexes will look exactly the same (and equally efficient).
Setting the column to NOT NULL has a lot of other advantages though, so you should always set it to that when you don't plan to store NULLs in it :-)

Related

Does adding an explicit column of SHA-256 hash for a CLOB field improve the search(exact match) performance on that CLOB field

We have a requirement to implement a table(probably an orable db table or a mssql db table) as follows:
One column stores a string value, the length of this string value is highly variable, typically from several bytes to 500 megabytes(occasionally beyond 1 gigabytes )
Based on above, we decided to use CLOB type in db.(using system file is not an option somehow)
The table is very large up to several millions of records.
One of most frequent and important operation against this table is searching records by this CLOB column and the search string needs to EXACTlY match this CLOB column value.
My question is besides adding an index on CLOB column, whether we need to do some particular optimisation to improve the search performance?
One of my team member suggested adding an extra column in which to calculate SHA-256 hash of CLOB column above and search by this hash value instead of CLOB column. In terms of his opinion, the grounds of doing so are hash values are equal length other than variable so that indexing on that makes search faster.
However, I don't think this way makes big difference because assuming adding an explicit hash improves search performance database should be intelligent enough to do it by its own, likely storing this hash value in some hidden places of db system. Why bother we developers do it explicitly, on the other hand, this hash value theoretically creates collision although it's rare.
The only benefit I can imagine is when the client side of database does a search of which the keyword is very large, you can reduce network roundtrip by hashing this large value to a small length value, therefore network transferring is faster.
So any database gurus, please shed lights on this question. Many thanks!
Regular indexes don't work on CLOB columns. Instead you would need to create an Oracle Text index, which is primarily for full text searching of key words/phrases, rather than full text matching.
In contrast by computing a hash function on the column data, you can then create an index on the hash value since it's short enough to fit in a standard VARCHAR2 or RAW column. Such a hash function can significantly reduce your search space when trying to find exact matches.
Further your concern over hash collisions, while not unfounded can be mitigated. First off, hash collisions are relatively rare, but when they do occur, the documents are unlikely to be very similar, so a direct text comparison could be used in situations where a collision is detected. Alternatively due to the way hashing functions work, where small changes to the original document result in significant changes in the hash value, and where the same change to different documents affects the hash value differently, you could compute a secondary hash of a subset (or super set) of the original text to act as a collision avoidance mechanism.

When may it make sense to not have an index on a table and why?

For Oracle and being Relative to application tuning, when may it make sense to not have an index on a table and why?
There is a cost associated to having an index:
it takes up disk space
it slows down updates (index needs to be updated as well)
it makes query planning more complex (slightly slower, but more importantly increased potential for bad decisions)
These costs are supposed to be offset by the benefit of more efficient query processing (faster, fewer I/O).
If the index is not used enough to justify the cost, then having the index will be negative.
In particular, if your data distribution is low (think flags like 'Y' and 'N'), indexes won't help much. Think of it this way: if the number of distinct values in an index is low, the optimizer will probably choose not to use the index. An interesting aside is that if the column in the index is null, it might be much faster if your query criteria include actual values as nulls aren't indexed, which means that only the actual values (non null) are in that particular index,thereby not evaluating most of the rows in the table. In the "is null" case, it will never use an index - if you have a query with a "where" clause like "where mytable.mycolumn is null", abandon all indexes ye who enter here.
If a table has very little data (small number of rows) then it doesn't serve you to use an index. An index makes it quick to search on a specific attribute and if the application you are working with doesn't need a fast lookup then using an index does very little for you.

how to get around when normal index or bitmap index isn't useful

The columns that are in the where clause are not selective. They are all in 1 single table. In addition the expressions used are NOT EQUAL, OR, IS NULL, IS NOT NULL. The primary key is on the customer ID. I am not sure how to get around with this kind of data. Are there different indexing methods that can be created on table or other ways to solve the problem. I guess partitions won't be helpful either for breaking a table into one major section with large data. Any thoughts or workarounds will be useful.
I'm putting below the data for reference and sample queries for ease of understanding.
sample query
colA = 'Marketable' OR colA is null
NORMAL index: gets ignored due to OR and NULL operator. Moreover the queried data covers more than 95% of data in the table.
BITMAP index: gets ignored due to more than 96% data coverage.
sample query
colB = '7' OR colB = '6' OR colB = '5'
NORMAL or BITMAP: both not useful due to large data selection. Optimizer goes with full table scan using the primary key cust_id.
sample query
colC <> 'SPECIAL SEGMENT' OR colC is null (since the values can change, no specific value is passed)
combination sample query
NOT (colB = '6' OR colB = '3') AND
(colC <> 'SPECIAL SEGMENT' OR colC is null)
Full table scans are not evil. Index access is not always more efficient.
If you want to return the majority of the data in a table, you want to use a full table scan since that's the most efficient way of accessing large fractions of the data in the table. Indexes are great when you want to access relatively small fractions of the data in the table. But if you want most of the data, doing millions of index accesses is not going to be more efficient. In your first example, you want to return 9.2 million rows from a 9.3 million row table. A full table scan is the plan you want-- that's the most efficient way to retrieve 99% of the rows in the table. Anything else is going to be less efficient. You could, I suppose, potentially partition the table on A leading to full partition scans of the two large partitions. That's only going to cut, say 1% of the work your query needs to do, though, and may have negative impacts on other queries on that table.
Now, I'm always a bit suspicious about queries that want to return 99% of the rows in a table in the first place. It would make no sense to have such a query in an OLTP system, for example, because no human is going to page through 9.2 million rows of data. It wouldn't make sense to have that sort of query if the goal is to replicate data because it would almost certainly be more efficient to just replicate incremental changes rather than the entire data set every time. It might make sense to read almost all the rows if the goal is to perform some aggregations. But if this is something that happens enough to care about optimizing the analysis, you'd be better off looking at ways of pre-aggregating the data using materialized views and dimensions so that you can read and aggregate the data once and then just read your pre-aggregated values at runtime.
If you do really need to read all that data, you may also want to look into parallel query. If there are relatively few readers, it is more efficient to let Oracle do the full scan in parallel so that your session can utilize more of the available hardware. Of course, that means that you can have fewer simultaneous sessions since more hardware for you means less for others, so that's a trade-off you need to understand. If you're building an ETL process where there will only be a couple sessions loading data at any point, parallel query can provide substantial performance improvements.

Bad performance when writing log data to Cassandra with timeuuid as a column name

Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family

multicolumn index column order

I've be told and read it everywhere (but no one dared to explain why) that when composing an index on multiple columns I should put the most selective column first, for performance reasons.
Why is that?
Is it a myth?
I should put the most selective column first
According to Tom, column selectivity has no performance impact for queries that use all the columns in the index (it does affect Oracle's ability to compress the index).
it is not the first thing, it is not the most important thing. sure, it is something to consider but it is relatively far down there in the grand scheme of things.
In certain strange, very peculiar and abnormal cases (like the above with really utterly skewed data), the selectivity could easily matter HOWEVER, they are
a) pretty rare
b) truly dependent on the values used at runtime, as all skewed queries are
so in general, look at the questions you have, try to minimize the indexes you need based on that.
The number of distinct values in a column in a concatenated index is not relevant when considering
the position in the index.
However, these considerations should come second when deciding on index column order. More importantly is to ensure that the index can be useful to many queries, so the column order has to reflect the use of those columns (or the lack thereof) in the where clauses of your queries (for the reason illustrated by AndreKR).
HOW YOU USE the index -- that is what is relevant when deciding.
All other things being equal, I would still put the most selective column first. It just feels right...
Update: Another quote from Tom (thanks to milan for finding it).
In Oracle 5 (yes, version 5!), there was an argument for placing the most selective columns first
in an index.
Since then, it is not true that putting the most discriminating entries first in the index
will make the index smaller or more efficient. It seems like it will, but it will not.
With index
key compression, there is a compelling argument to go the other way since it can make the index
smaller. However, it should be driven by how you use the index, as previously stated.
You can omit columns from right to left when using an index, i.e. when you have an index on col_a, col_b you can use it in WHERE col_a = x but you can not use it in WHERE col_b = x.
Imagine to have a telephone book that is sorted by the first names and then by the last names.
At least in Europe and US first names have a much lower selectivity than last names, so looking up the first name wouldn't narrow the result set much, so there would still be many pages to check for the correct last name.
The ordering of the columns in the index should be determined by your queries and not be any selectivity considerations. If you have an index on (a,b,c), and most of your single column queries are against column c, followed by a, then put them in the order of c,a,b in the index definition for the best efficiency. Oracle prefers to use the leading edge of the index for the query, but can use other columns in the index in a less efficient access path known as skip-scan.
The more selective is your index, the fastest is the research.
Simply imagine a phonebook: you can find someone mostly fast by lastname. But if you have a lot of people with the same lastname, you will last more time on looking for the person by looking at the firstname everytime.
So you have to give the most selective columns firstly to avoid as much as possible this problem.
Additionally, you should then make sure that your queries are using correctly these "selectivity criterias".

Resources