For SQL like queries in MySQL and sqlite which more efficient for comparison text or int - performance

I need to store 2 columns in SQLite db. a date (YYYY-MM-DD) and a time (HH-MM-SS)
Since SQLite can store date time in TEXT, REAL, INTEGER only which data type should I use ?
Is it more efficient to store as TEXT or INTEGER. I will be doing frequent lookup against these columns.
So my queries will have a WHERE condition on the date and time columns.
I want to make the WHERE comparison efficient

The primary bottleneck when using databases is I/O.
If you care about performance, use the datatype which requires the least amount of I/O.
An INTEGER value needs less storage than a REAL one, which needs less than TEXT.
Whether this makes a noticeable difference is another question; you have to measure.

Related

How do I not normalize continuous data (INTS, FLOATS, DATETIME, ....)?

According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing
However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)
"Normalize, but don't over-normalize." In particular, do not normalize
datetimes or floats or other "continuous" values.
source
Sure purists say normalize time. That is a big mistake. Generally,
"continuous" values should not be normalized because you generally
want to do range queries on them. If it is normalized, performance
will be orders of magnitude worse.
Normalization has several purposes; they don't really apply here:
Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings
To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each
time was independently assigned, and you are not a Time Lord.
In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still
sucks.
source
Do not normalize "continuous" values -- dates, floats, etc --
especially if you will do range queries.
source.
I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.
Note: I could have asked him in a comment or something, but I wanted to document and highlight this point alone because I believe this is very important note that affect almost my entire database performance
The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?
Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:
-- single table
SELECT ...
FROM t
WHERE x = ...
AND date BETWEEN ... AND ...; -- `date` is of datatype DATE/DATETIME/etc
-- extra table
SELECT ...
FROM t
JOIN Dates AS d ON t.date_id = d.date_id
WHERE t.x = ...
AND d.date BETWEEN ... AND ...; -- Range test is now in the other table
Moving the range test to a JOINed table causes the slowdown.
The first query is quite optimizable via
INDEX(x, date)
In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)
DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).
Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.

Does adding an explicit column of SHA-256 hash for a CLOB field improve the search(exact match) performance on that CLOB field

We have a requirement to implement a table(probably an orable db table or a mssql db table) as follows:
One column stores a string value, the length of this string value is highly variable, typically from several bytes to 500 megabytes(occasionally beyond 1 gigabytes )
Based on above, we decided to use CLOB type in db.(using system file is not an option somehow)
The table is very large up to several millions of records.
One of most frequent and important operation against this table is searching records by this CLOB column and the search string needs to EXACTlY match this CLOB column value.
My question is besides adding an index on CLOB column, whether we need to do some particular optimisation to improve the search performance?
One of my team member suggested adding an extra column in which to calculate SHA-256 hash of CLOB column above and search by this hash value instead of CLOB column. In terms of his opinion, the grounds of doing so are hash values are equal length other than variable so that indexing on that makes search faster.
However, I don't think this way makes big difference because assuming adding an explicit hash improves search performance database should be intelligent enough to do it by its own, likely storing this hash value in some hidden places of db system. Why bother we developers do it explicitly, on the other hand, this hash value theoretically creates collision although it's rare.
The only benefit I can imagine is when the client side of database does a search of which the keyword is very large, you can reduce network roundtrip by hashing this large value to a small length value, therefore network transferring is faster.
So any database gurus, please shed lights on this question. Many thanks!
Regular indexes don't work on CLOB columns. Instead you would need to create an Oracle Text index, which is primarily for full text searching of key words/phrases, rather than full text matching.
In contrast by computing a hash function on the column data, you can then create an index on the hash value since it's short enough to fit in a standard VARCHAR2 or RAW column. Such a hash function can significantly reduce your search space when trying to find exact matches.
Further your concern over hash collisions, while not unfounded can be mitigated. First off, hash collisions are relatively rare, but when they do occur, the documents are unlikely to be very similar, so a direct text comparison could be used in situations where a collision is detected. Alternatively due to the way hashing functions work, where small changes to the original document result in significant changes in the hash value, and where the same change to different documents affects the hash value differently, you could compute a secondary hash of a subset (or super set) of the original text to act as a collision avoidance mechanism.

Time series in Cassandra when measures can go "back in time"

this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
WITH CLUSTERING ORDER BY (measure_time DESC);
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.

What is the fastest way to store huge amount of unique strings?

I wonder what is the best way for storing huge amount of strings and checking for duplication.
We have to think about our priority:
duplicate check speed
inserting new string time
storage space on hard disk
random access time
What is the best solution, when our target is fast duplicate checking and inserting new strings time (no random access or storage space matter) ?
I think about SQL database, but which of DB's is best for this solution ?
If we use SQL DB, like MySQL, which storage engine will be the best ? (of course, we have to exclude memory because of data amount)
Use a hash function on the input string. the output hash would be the primary key/id of the record.
Then you can check if the DB has this hash/id/primary key:
If it doesnt: this is a new string; you add a new record including the string and hash as id.
If it does: check that the string from the loaded record is the same as the input string.
if the string is the same: it is a duplicate
if the string is different: this is a collision. Use a collision resolution scheme to resolve. (A couple of examples below)
You will have to consider which hash function/scheme/strength to use based on speed and expected number of strings and hash collision requirements/guarantees.
A couple of ways to resolve collisions:
Use a 2nd hash function to come up with a new hash in the same table.
Mark the record (e.g. with NULL) and repeat with a stronger 2nd hash function (with wider domain) on a secondary "collision" table. On query, if the string is marked as collided (e.g. NULL) then do the lookup again in the collision table. You might also want to use dynamic perfect hashing to ensure that this second table does not have further collisions.
Of course, depending on how persistent this needs to be and how much memory you are expecting to take up/number of strings, you could actually do this without a database, directly in memory which would be a lot faster.
You may want to consider a NoSQL solution:
Redis. Some of the use cases solved using Redis:
http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html
http://dr-josiah.blogspot.com/2011/02/some-redis-use-cases.html
(Josiah L. Carlson is the author of Redis in Action)
http://www.paperplanes.de/2010/2/16/a_collection_of_redis_use_cases.html
memcached. Some comparisons between memcached and Redis:
http://www.quora.com/What-are-the-differences-between-memcached-and-redis
Is memcached a dinosaur in comparison to Redis?
http://coder.cl/2011/06/concurrency-in-redis-and-memcache/
Membase/Couchbase who counts OMGPOP's Draw Something as one of their success stories. Comparison between Redis and Membase:
What is the major difference between Redis and Membase?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Some questions:
how large is the set of strings?
will the application be read heavy or write heavy? or both?
how often would you like data to be persisted to disk?
is there a N most recent strings requirement?
Hope this helps.
Generate Suffix trees to store strings . Ukkonen's algorithm as in http://www.daimi.au.dk/~mailund/slides/Ukkonen-2005.pdf will give some insight how to create Suffix tree .There are number of ways to store this suffix tree. But once generated , the lookup time is very low.

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

Resources