performance type varchar(1) or smallint to store status Postgres - performance

I'll store a status from 0 to 7 and I want to know which is the better type field to store, considering performance and space on Postgres database: varchar(1) or smallint.
By the way, is there any difference to set a field varchar(1) or varchar(100), still talking about performance and space?

In my opinion you're fighting the wrong battle. You're worried about the performance impact of storing an integer instead of a single character field, which in my opinion is short-sighted thinking. The actual impact on performance of an integer vs. a single character is trivial, and I doubt this can be measured meaningfully. In my experience it's more important to reduce the cognitive loading on the developers and users of the system, and thus it's better to use character fields which are long enough to contain a reasonable description of the status instead of numeric values or single character abbreviations. Not having to remember what 1, 2, 'A', or 'X' mean is very helpful. Instead of these abbreviated values I suggest using easy-to-understand values such as 'READY', 'ACTIVE', 'PROCESSED', 'CANCELLED', etc.
As to the second part of the question - not really. There might be some trivial amount of time to move the longer string, but it's trivial unless you're talking about millions of values.
Best of luck.

While I agree with Bob Jarvis that this is really premature optimisation, I'll try to focus on the question as asked.
You're neglecting the most important choices. Your choices include:
smallint
enum
"char"
character and character varying
You could use an enumerated type. This is only really OK so long as you expect never to remove valid values, since PostgreSQL currently doesn't support deleting values from enum types.
Alternately, you could use the "char" data type. Yes, the quotes matter. It's a single character, like the C data type char. Without quotes char turns into character(1) at parse time.
varchar and character aren't really ideal for this because they're variable-width types with header overheads etc.
By the way, is there any difference to set a field varchar(1) or varchar(100), still talking about performance and space?
No. This is answered (many times) in other questions.

Related

How do I not normalize continuous data (INTS, FLOATS, DATETIME, ....)?

According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing
However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)
"Normalize, but don't over-normalize." In particular, do not normalize
datetimes or floats or other "continuous" values.
source
Sure purists say normalize time. That is a big mistake. Generally,
"continuous" values should not be normalized because you generally
want to do range queries on them. If it is normalized, performance
will be orders of magnitude worse.
Normalization has several purposes; they don't really apply here:
Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings
To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each
time was independently assigned, and you are not a Time Lord.
In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still
sucks.
source
Do not normalize "continuous" values -- dates, floats, etc --
especially if you will do range queries.
source.
I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.
Note: I could have asked him in a comment or something, but I wanted to document and highlight this point alone because I believe this is very important note that affect almost my entire database performance
The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?
Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:
-- single table
SELECT ...
FROM t
WHERE x = ...
AND date BETWEEN ... AND ...; -- `date` is of datatype DATE/DATETIME/etc
-- extra table
SELECT ...
FROM t
JOIN Dates AS d ON t.date_id = d.date_id
WHERE t.x = ...
AND d.date BETWEEN ... AND ...; -- Range test is now in the other table
Moving the range test to a JOINed table causes the slowdown.
The first query is quite optimizable via
INDEX(x, date)
In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)
DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).
Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.

Does adding an explicit column of SHA-256 hash for a CLOB field improve the search(exact match) performance on that CLOB field

We have a requirement to implement a table(probably an orable db table or a mssql db table) as follows:
One column stores a string value, the length of this string value is highly variable, typically from several bytes to 500 megabytes(occasionally beyond 1 gigabytes )
Based on above, we decided to use CLOB type in db.(using system file is not an option somehow)
The table is very large up to several millions of records.
One of most frequent and important operation against this table is searching records by this CLOB column and the search string needs to EXACTlY match this CLOB column value.
My question is besides adding an index on CLOB column, whether we need to do some particular optimisation to improve the search performance?
One of my team member suggested adding an extra column in which to calculate SHA-256 hash of CLOB column above and search by this hash value instead of CLOB column. In terms of his opinion, the grounds of doing so are hash values are equal length other than variable so that indexing on that makes search faster.
However, I don't think this way makes big difference because assuming adding an explicit hash improves search performance database should be intelligent enough to do it by its own, likely storing this hash value in some hidden places of db system. Why bother we developers do it explicitly, on the other hand, this hash value theoretically creates collision although it's rare.
The only benefit I can imagine is when the client side of database does a search of which the keyword is very large, you can reduce network roundtrip by hashing this large value to a small length value, therefore network transferring is faster.
So any database gurus, please shed lights on this question. Many thanks!
Regular indexes don't work on CLOB columns. Instead you would need to create an Oracle Text index, which is primarily for full text searching of key words/phrases, rather than full text matching.
In contrast by computing a hash function on the column data, you can then create an index on the hash value since it's short enough to fit in a standard VARCHAR2 or RAW column. Such a hash function can significantly reduce your search space when trying to find exact matches.
Further your concern over hash collisions, while not unfounded can be mitigated. First off, hash collisions are relatively rare, but when they do occur, the documents are unlikely to be very similar, so a direct text comparison could be used in situations where a collision is detected. Alternatively due to the way hashing functions work, where small changes to the original document result in significant changes in the hash value, and where the same change to different documents affects the hash value differently, you could compute a secondary hash of a subset (or super set) of the original text to act as a collision avoidance mechanism.

NoSql: Enums vs Strings

Just curious how others deal with enums & nosql? Is it better to store an attribute as an enum value or a string? Does this affect the size or performance of the database in some cases? For example, just think of, let's say, a pro sports player... his sport type could be Football, Hockey, Baseball, Basketball, etc... string vs enum, what do you all think?
You should be using enums in your code - strong typing helps avoid a lot of mistakes - and converting them to strings or numbers for storage.
Strings do require significantly more storage space - "Basketball" is 10-20 bytes depending on encoding, and if you store it as 4 it only needs 1 byte. However, there are very few cases where this will actually matter - if you have a million records, it is still less than 20MB difference in total database size. Strings are easier to work with and less likely to fail silently if the enumeration changes, so use strings.
Strings are also slower than numbers for most operations, including conversion to enum on load. However, the difference is orders of magnitude less than the time taken to retrieve anything at all from the database, so doesn't matter.
String are better of portability perspective. And Enum is not supported by popular DBMS's like MSSQL Server and many others.
You can have application level logic to prevent valid input against an array and just store it as String.
EDIT:
My preferences changed to String as CakePHP (where I do web apps) no-longer support Enum for portability concerns.

Advice on DB design Best Practices/Standard - Oracle

I'm designing the DB for a new app which is something I've done a thousand times, but in this occasion I suddenly start wondering on some aspects that I've never stopped before. Is there some standard/recommendation for the following things?
Whats the recommended data type for storing currencies (no financial operations, just displaying).
Recommended size for storing phone numbers (internationals)
Recommended minimum size for storing first names / last names (minimum meaning smallest maximum recommended size)
Recommended minimum size for storing comment blocks.(minimum meaning smallest maximum recommended size also)
I'm aware that every application has its own particular requirements to consider, but I feel that there must be something more specific than gut feeling and common sense.
Help, as always, will be deeply appreciated.
Whats the recommended data type for storing currencies
This depends on what kind of currency, and to what degree of accuracy.
If it's cents and dollars, rounded to the nearest cent, it's NUMBER(12,2) which allows you to store amounts between -999,999,999,999.99 and 999,999,999,999.99 - which for most currencies should be enough.
If you need to store intermediate results from, say, interest rate calculations, you may need more precision, e.g. NUMBER(15,5).
If you're talking Zimbabwean dollars, perhaps you should choose the maximum NUMBER instead :)
Recommended size for storing phone numbers (internationals)
VARCHAR2(30) should be sufficient. If it's too long your users will enter all sorts of rubbish data in there.
Recommended minimum size for storing first names / last names /
Recommended minimum size for storing comment blocks
These don't apply since you're in Oracle - use VARCHAR2, so you don't have to worry about minimum size. All you need to specify is the maximum size.
Currencies:
NUMBER(15,2), really depends on how big the numbers are that you expect to run into.
Phone numbers:
VARCHAR2(30), please don't hurt me if it should be larger - can't remember the length per se just that VARCHAR allows flexibility for formatting.
I don't see the point of looking at the minimum size if using VARCHAR2. The concerns for the physical model revolve around how much space the database will consume over time, assuming fields are maxed out.
Comment blocks:
Maximum of VARCHAR2(4000)
EDIFACT generally uses 35 as the size of a Name field and I'd copy that (and document that as a basis). Newer stuff tends to be defined in XML and doesn't normally go into field length definitions.
Alternatively the Canadian post office recommends no more than 40 characters per address line.
Note, that is characters and not bytes. Sizing should take into account multi-byte characters, but obviously not all names will be the maximum length. I've used ten characters per name as a broad approximation for sizing estimates but that could vary a lot between countries, ethnicities etc.
I know you were asking minimum size for comment blocks, but for large free-text areas you ought to consider using a CLOB value. Oracle is pretty smart about how these things are handled, how the data is stored, etc. You NEVER have to worry about size. In addition, you can usually pretend that they are VARCHAR2 columns for easy manipulation.

Does setting "NOT NULL" on a column in postgresql increase performance?

I know this is a good idea in MySQL. If I recall correctly, in MySQL it allows indexes to work more efficiently.
Setting NOT NULL has no effect per se on performance. A few cycles for the check - irrelevant.
But you can improve performance by actually using NULLs instead of dummy values. Depending on data types, you can save a lot of disk space and RAM, thereby speeding up ... everything.
The null bitmap is only allocated if there are any NULL values in the row. It's one bit for every column in the row (NULL or not). For tables up to 8 columns the null bitmap is effectively completely free, using a spare byte between tuple header and row data. After that, space is allocated in multiples of MAXALIGN (typically 8 bytes, covering 64 columns). The difference is lost to padding. So you pay the full (low!) price for the first NULL value in each row. Additional NULL values can only save space.
The minimum storage requirement for any non-null value is 1 byte (boolean, "char", ...) or typically much more, plus (possibly) padding for alignment. Read up on data types or check the gory details in the system table pg_type.
More about null storage:
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
The manual.
It's always a good ideal to keep columns from being NULL if you can avoid it, because the semantics of using are so messy; see What is the deal with NULLs? for good a discussion of how those can get you into trouble.
In versions of PostgreSQL up to 8.2, the software didn't know how to do comparisons on the most common type index (the b-tree) in a way that would include finding NULL values in them. In the relevant bit of documentation on index types, you can see that described as "but note that IS NULL is not equivalent to = and is not indexable". The effective downside to this is that if you specify a query that requires including NULL values, the planner might not be able to satisfy it using the obvious index for that case. As a simple example, if you have an ORDER BY statement that could be accelerated with an index, but your query needs to return NULL values too, the optimizer can't use that index because the result will be missing any NULL data--and therefore be incomplete and useless. The optimizer knows this, and instead will do an unindexed scan of the table instead, which can be very expensive.
PostgreSQL improved this in 8.3, "an IS NULL condition on an index column can be used with a B-tree index". So the situations where you can be burned by trying to index something with NULL values have been reduced. But since NULL semantics are still really painful and you might run into a situation where even the 8.3 planner doesn't do what you expect because of them, you should still use NOT NULL whenever possible to lower your chances of running into a badly optimized query.
No, as long as you don't actually store NULLs in the table the indexes will look exactly the same (and equally efficient).
Setting the column to NOT NULL has a lot of other advantages though, so you should always set it to that when you don't plan to store NULLs in it :-)

Resources