SQL query to search faster or using hash table - algorithm

If I am looking for a record in the database, is writing a sql query to search the database directly faster OR is reading the entire data from the database into a hashtable and then searching in O(1) time faster?
This question is for experienced programmers who have faced such issues in the past.

If you know the primary key of the row or the column you are searching on is indexed, then doing the retrieval" using SQL will be much faster. Especially if your table does not fit into memory.

Making direct sql query to database would obviously be much faster, than first reading all the records into a hash table and searching from it. This will not only save your time in loading all the records firstly into a hash table and then searching through them. 2ndly it will also save lots of memory, that your hash tables will consume.
I have experienced this kind of situations. Hope this helps you!

Sql Server Database is more faster and better than Hash-table.
one important reason behind.
Hash table reads the data once from secondary storage and then loaded into memory.
now, it is easy to identify that what will happen?
By Storing data in a huge manner, system will be slow. it will difficult to manipulate and retrieve the records.....

Despite, DBMS is being considered well convenient environment as compare to hash table. if you are trying to get results with few thousands of records then you do not have need to create index. it depends on need. Thus, it is much easy to get answers from remote machine with three tier applications. it takes care about row count, IO Speed etc.

If the SQL table is not indexed you'd have to benchmark to find your answer. Since there are lots of factors such as the row count, IO speed, network speed (if database is on a remove machine), it is hard to just give an answer to the question
On the other hand, indexing the table is a better choice. Just, leave the DBMS's job to DBMS.

Related

Does it matter if I don't stress test Oracle 11g with random data

We need to stress test our Oracle database with about 5 million row inserts. According to our DBA, the only columns that need to be different are the Primary or foreign key...all other columns can be the same. He said if we do that, then Oracle will not do any sort of caching when inserting the data.
I just want to make sure that he is right and that by doing this, the stress testing results would be nearly as accurate as using random data. Thank you for your help.
In a very narrow set of circumstances, the DBA is correct. If ALL your queries are lookups based upon primary and foreign keys, then they may be right. In the past when the rule-based optimizer was king, then the data didn't matter so much. Record counts, yes, but not really the data.
In the real world, though, this is not the case. Do you have any other indexes? Then the data matters. Do you join against things other than primary/foreign keys? Then the data matters. Are your strings all 1 byte or null? I doubt it, and the size of these variable-length fields may affect the amount of IO. Basically, for any non-trivial schema in a non-trivial application, having "realistic" data can be significant. The Oracle optimizer takes into account a large variety of statistics when determining how to perform a query.
Are you REALLY only doing inserts in this load test? That's kinda silly. 5 million records is chump change by modern standards. Desktops do that in seconds, typically. Even simple applications will perform some select to do a lookup, or get a set of records based upon a non-key value.
You seem to be smart enough to evaluate the DBA's statement. If you can get him to put that in writing, sign off on it, and have the responsibility fall on him when his idea of a load test doesn't work as expected, then that's great. It sounds like you're the one responsible for this test, though.
If I were in your shoes, I would want to load test with the most accurate data possible. Copying from a production system or known test set of data is a much better option than "random" and light-years better than "nulls except for the primary key" approach.

Why is Solr so much faster than Postgres?

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."
I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).
As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.
What differences in architecture would explain this?
First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.
Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)
But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).
You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.
Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.
Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.
I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.
-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
(
lineno serial PRIMARY KEY,
linetext text NOT NULL,
tsv tsvector
);
-- Load from downloaded data into database.
COPY war_and_peace (linetext)
FROM '/home/kgrittn/Downloads/war-and-peace.txt';
-- "Digest" data to lexemes.
UPDATE war_and_peace
SET tsv = to_tsvector('english', linetext);
-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
ON war_and_peace
USING gist (tsv);
-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;
Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:
-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'gentlemen');
84 rows, gist: 2.006 ms, gin: 0.194 ms
-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies');
184 rows, gist: 3.549 ms, gin: 0.328 ms
-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies & gentlemen');
1 row, gist: 0.971 ms, gin: 0.104 ms
Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.
Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.
The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.
The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.
With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.
UPDATE (Dec 9 '16)
I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.
Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.
As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"
The answer is a qualified yes, and I refer you to the following:
https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr search index as a database - is this "wrong"?
For the guardian solr is the new database
My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.
This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
Please read this and this.
Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.
The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"

WP7 Sterling very slow storing lots of data

I'm trying to store 46,000 objects in Sterling and it's taking 3 minutes.
Yes I know it's a lot but this is data provided by the customer and could end up being a lot more.
I'm guessing each time I save a new object it is looking up the key to see if the object has already been stored.
Is there any way to bypass this and tell sterling to just insert?
Any other ideas?
Without knowing anything about your data structure it's a bit difficult to recommend ways in which you could improve performance, however:
The fewer indexes you create for your data tables the fewer indexes there are to create when your data is persisted. You should look carefully at which indexes you need for your data read scenarios.
The more data relationships there are, the more metadata there is to create at write time. You may be able to simplify the data structures and combine classes.
Sheer volume of data sounds like your biggest problem. I've experienced similar problems before with trying to persist large volumes of GPS data. The problem there is that I was trying to write a lot of relatively small amounts of data related to a single piece of data in another table. I managed to resolve this by consolidating the GPS data into a single string and persisting it as a field with the main record. This offloaded a lot of the read/write time into a significantly smaller amount of time for rehydrating the data when it was actually needed.
I woudl definitely recommend reachign out to Jeremy and the Sterling team via the CodePlex site if none of the above help.
Have you considered keeping most data server-side, and presenting client with only a window into that data, something like 20 or 50 rows at a time?
EDIT: since the answer in no, I'd turn off the database table index while the operation is going, or use SQL bulk copy.

Wanted: DB for fast read operations to be accessed from ruby apps

Basically it's a financial database, with both daily and intraday data (date,symbol,open,high,low,close,vol,openinterest) -- very simple structure. Updates are just once a day. A typical query would be: date and close price of MSFT for all dates in DB. I was thinking that there's got to be something out there that's been optimized for lots of reads and not many writes, as opposed to a general-purpose RDBMS like MySQL. I searched rubyforge.org, and I didn't see anything that specifically addressed this (as far as I could tell).
MS SQL Server can be optimized like this with the fairly simple:
ALTER DATABASE myDatabase
SET READ_COMMITTED_SNAPSHOT ON
SQL Server will automatically cache your data in memory if it is being used heavily for reads.
You can always use a RAMdisk for your MySQL installation if your database footprint is small enough. One way to make your tables small enough to fit is to create them as MyISAM ARCHIVE tables. While they are very compact, compressed, they can only be appended to or read from, but not updated. (http://dev.mysql.com/tech-resources/articles/storage-engine.html)
Generally a properly indexed and well organized MySQL table is really fast, especially when using MyISAM, and even more so when loaded from memory. They key is in denormalizing the data as heavily as you can optimizing for your particular read scenarios.
For example, having a stock_id, date, price tuple is going to be fairly slow to sort and retrieve. If you have, instead, stock_id and a column with some serialized data, the retrieval time will be very quick.
Another solution that is likely faster is to push all the data into an alternative DBMS like Toyko Cabinet or something similar, especially if your data fits neatly into a key/value store.
Look at MySQL, but run the database from memory instead of disk. Depends on the size of your dataset and your budget, but you could then update memory from disk once a day, and have a very, very fast read time afterwards.
The best-known (to me at least!) time series database is Fame but it's expensive and I strongly doubt that there's anything like, say, an ActiveRecord implementation for it. Unless it's changed a lot in the 10 or so years since I last touched it, it isn't SQL-friendly at all.
With a fairly tightly-focused application, you can take a more flexible view of your data. For example, consider what is the information that you're actually looking to store? Is it the atomic price/hi/lo/close/vol/whatever, or is it more appropriately a time series of such values? If you always want to view the series, store a series per row, not a value.
Throwing a few ideas out here...
How might it look if you stored a year or a month of a single value for a single stock in one row? Maybe as an XML string, or JSON or something more terse of your own devising. Compressed CSV, perhaps? That ought to fit a month's values into a 255-character column. (Use something like Huffman coding to do the encoding, perhaps - a single dictionary ought to work for all instances of such similar data).
You can still hold a horizontal view as well: with the extremely low update rate you'll have (should only be data fixes, I'd guess) you can probably stand to build that stuff.
There's an obvious downside to this: you'll have a bunch of extra work to do.
I don't have any personal experience, but MogoDB claims to offer relational-style flexibility with key-value performance.
As mentioned elsewhere key-value database might be worth looking at: Tokyo Cabinet, CouchDB or one of the others again, perhaps, with concatenated value for the time series.

Oracle Hierarchical Query Performance

We're looking at using Oracle Hierarchical queries to model potentially very large tree structures (potentially infinitely wide, and depth of 30+). My understanding is that hierarchal queries provide a method to write recursively joining SQL but they it does not provide any real performance enhancements over if you were to manually write an equivalent query... is this the case? What sort of experiences have people had, performance wise, with using oracle hierarchical queries?
Well the short answer is that without the hierarchical extension (connect by) you couldn't write a recursive query. You could programmitically issue many queries which were recurisively linked.
The rule of thumb with everything database is, especially oracle, is that if you can issue your result in a single query it will almost always be faster than doing it programatically.
My experiences have been with much smaller sets, so I can't speak for how well heirarchical queries will perform for large sets.
When doing these tree retrievals, you typically have these options
Query everything and assemble the tree on the client side.
Perform one query for each level of the tree, building on what you know that you need from the previous query results
Use the built in stuff Oracle provides (START WITH,CONNECT BY PRIOR).
Doing it all in the database will reduce unnecessary round trips or wasteful queries that pull too much data.
Try partitioning the data within you hierarchical table and then limiting the partition included in the query.
CREATE TABLE
loopy
(key NUMBER, key_hier number, info VARCHAR2, part NUMBER)
PARTITION BY
RANGE (part)
(
PARTITION low VALUES LESS THAN (1000),
PARTITION mid VALUES LESS THAN (10000),
PARTITION high VALUES LESS THAN (MAXVALUE)
);
SELECT
info
FROM
loopy PARTITION(mid)
CONNECT BY
key = key_hier
START WITH
key = <some value>;
The interesting problem now becomes your partitioning strategy. Oracle provides several options.
I've seen that using connect by can be slow but compared to what? There isn't really another option except building a result set using recursive PL/SQL calls (slower) or doing it on your client side.
You could try separating your data into a mapping (hierarchy definition) and lookup tables (the display data) and then joining them back together. I guess I wouldn't expect much of a gain assuming you are getting the hierarchy data from indexed fields but its worth a try.
Have you tried it using the connect by yet? I'm a big fan of trying different variations.

Resources