Elasticsearch: Approximate quantitative difference between insert and update performance - performance

I have read many posts I could find here and on the internet internet on insert vs. updates but was unable to find any, even order of magnitude, quantitative statements. Let's assume I do bulk update operations on 50% of my document fields, can I expect the performance vs. bulk index operations with data from all fields to be 80%, 50%, 20%, 10%, 1% of an insert? Just a rough number from experience would be very helpful.
Disclaimer: I understand that inserts are preferable in terms of performance but often there's a difficult trade-off between access/query performance/complexity and insert performance, especially if you have data that you want to query in one place but individual components have different lifecycles. So in my case, I would probably be OK with a certain, even significant write performance hit to keep all other properties of my ES index ideal.

I think there is a particular reason for this not being discussed commonly.
The update operation (add + delete) does not delete the document instantly. Instead, it marks the document as deleted by marking a bit. Therefore, the performance of document addition and document update is not very different.
However, once enough documents are marked deleted, the Lucene indexes are merged and deleted documents are completely removed. Until that time, the documents keep piling up as they do not free the space.
So, the only performance indicator that is considered is the performance of search (as it is effected in terms of results and in terms of time complexity as well).
More on document merging here.

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

CouchDB: Is it more efficient to use includeDocs, or return doc from view?

I'm new to CouchDB. We're going to have millions of documents in our database. I am wondering: is it more efficient to return the 'doc' object from the view, or return just the 'doc.id', and use '&include_docs=true'?
I'm guessing that returning 'doc.id' from the view will take up a lot less disk space for the view index, but might require an added call to the database to get the whole document. In this case, it's a decision between more speed (returning 'doc') or decreased disk space usage (returning 'doc.id').
Is this a correct assumption?
From the couchDB wiki https://wiki.apache.org/couchdb/HTTP_view_API
Note: include_docs will cause a single document lookup per returned view result row. This adds significant strain on the storage system if you are under high load or return a lot of rows per request. If you are concerned about this, you can emit the full doc in each row; this will increase view index time and space requirements, but will make view reads optimally fast.
So I'd say you're correct in your assumptions, the next step to consider is will you actually use the views to get all the matching documents or will you just look at a few matching documents at the time? This is interesting since couchDB will build the entire view and maintain it throughout updates even if you're just looking at a small section of it.
One other thing to consider is how large are the documents? If the documents are small there will be little diffrens in emitting them, but if they are large there will be a vast diffrence.

I need ideas/prior research for a persistent interator

I need some help thinking about an algorithm.
I have a collection of documents, potentially numbering in the millions. These documents are also indexed in MySQL. In extreme cases, this index needs to be rebuilt. Given the large number of documents, the reindexing needs to happen in most recent to least recent. But more importantly, the reindexing needs to start over again at the same point after a computer reboot (or equiv). And given that index a million documents can take a long time, new documents might be added during the reindexing.
This same collection could be mirrored to another server. I would like to have an auditor that would make sure that all documents exist on the mirror.
In both cases users will be accessing the system, so I can't tie up to many resources. For the first case, I would very much like to get an ETA when it would finish.
I feel these are the same problem. But I can't get my head around how to do it efficiently and cleverly.
The brute force approach would be to have a list of the millions of documents + timestamp they were last checked/indexed. I would then pull the "next" one out of the list, check/index it, update the timestamp when done.
This seems wasteful.
What's more, given that a document might be added to the system but the list not adequately updated, we'd have to have an auditor that would make sure all documents are in the list. Which is the basic problem we are trying to solve.
I've seen such an auditor described in multiple situations, such as large nosql setups. There must be description of clever ways of solving this.
I would go,as always turns out with efficiency, for a segmented index.
You probably can divide the whole DB lot into smaller DBs, index them, then index the indices themselves. And only re-index the ones who have changed.
For the new entries while re-indexing, just keep the new entries in a new, temporary DB and just merge that DB into the big DB when the re-index is finished.
You can enhance this approach recursively for the smaller segments. You would have to analyse the trade off of how many segmentation levels will give you the fastest re-index time.

Why is Solr so much faster than Postgres?

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."
I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).
As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.
What differences in architecture would explain this?
First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.
Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)
But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).
You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.
Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.
Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.
I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.
-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
(
lineno serial PRIMARY KEY,
linetext text NOT NULL,
tsv tsvector
);
-- Load from downloaded data into database.
COPY war_and_peace (linetext)
FROM '/home/kgrittn/Downloads/war-and-peace.txt';
-- "Digest" data to lexemes.
UPDATE war_and_peace
SET tsv = to_tsvector('english', linetext);
-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
ON war_and_peace
USING gist (tsv);
-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;
Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:
-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'gentlemen');
84 rows, gist: 2.006 ms, gin: 0.194 ms
-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies');
184 rows, gist: 3.549 ms, gin: 0.328 ms
-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies & gentlemen');
1 row, gist: 0.971 ms, gin: 0.104 ms
Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.
Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.
The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.
The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.
With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.
UPDATE (Dec 9 '16)
I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.
Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.
As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"
The answer is a qualified yes, and I refer you to the following:
https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr search index as a database - is this "wrong"?
For the guardian solr is the new database
My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.
This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
Please read this and this.
Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.
The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"

Does having several indices all starting with the same columns negatively affect Sybase optimizer speed or accuracy?

We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)

Resources