MongoDB Index definition strategy - performance

I have a MongoDB-based database with something about 100K to 500K text documents inside and the collection keeps growing. The system should support the queries by different fields of the documents, e.g. title, category, importance etc.
The system is a near real-time system, which got new documents every 5-10 minutes.
Is it a good idea, in order to boost the queries' performance, to define a separate index for each frequently queried field (field types: small text, numeric, date) of the document? Or there are another best practices for queries' performance boosting in MongoDB?

You should use/make indexes depending on the result you are trying to find.
It's very good idea to have different indexes for different field you are trying to find at different times.
But keep in mind that indexes occupies your RAM. More you make indexes more it will use your RAM. Also consider ordering of index while making for better Search.
When developing your indexing strategy you should have a deep understanding of your application’s queries. Before you build indexes, map out the types of queries you will run so that you can build indexes that reference those fields. Indexes come with a performance cost, but are more than worth the cost for frequent queries on large data set. Consider the relative frequency of each query in the application and whether the query justifies an index.
The best overall strategy for designing indexes is to profile a variety of index configurations with data sets similar to the ones you’ll be running in production to see which configurations perform best.Inspect the current indexes created for your collections to ensure they are supporting your current and planned queries. If an index is no longer used, drop the index.
Some of the Strategies to choose while creating:
Create Indexes to Support Your Queries
An index supports a query when the index contains all the fields scanned by the query. Creating indexes that supports queries results in greatly increased query performance.
Use Indexes to Sort Query Results
To support efficient queries, use the strategies here when you specify the sequential order and sort order of index fields.
Ensure Indexes Fit in RAM
When your index fits in RAM, the system can avoid reading the index from disk and you get the fastest processing.
Create Queries that Ensure Selectivity
Selectivity is the ability of a query to narrow results using the index. Selectivity allows MongoDB to use the index for a larger portion of the work associated with fulfilling the query.

Related

Do I need to split order data into multiple time based index in Elasticsearch?

I am planning to use Elasticsearch to store user orders data. There could be 20 million orders per year in my system. 20 million orders probably take about 10GB size.
My question is whether I should create one index to include all orders' data. I have read ES doc saying we'd better keep 20GB data in one primary shard. If I create one index with 5 primary shards, does it mean I am fine to save 100GB (200 millions) orders in this index?
Another approach is to create index per year, for example, I create index order-2020, order-2021, order-2022 etc. And I can create less primary shard for each index. I understand using this pattern may benefit if I want to add a retention period on my order data. But apart from that, what other benefits I can have to use this pattern?
From query performance perspective, which approach is better?
In terms of search speed and aggregation accuracy, multi-index multi-fragment will inevitably have some loss, but in terms of data health, it is recommended to split the data by year, you can use alias to establish index association, and the loss in query performance is much less than that in aggregation.

Does timescaledb index works the same as postgreSQL?

I am testing an PostgreSQL extension named Timescaledb for time series data.
If I read the document of PostgreSQL right, the query for example
WHERE x = 'somestring' and timestamp between 't1' and 't2'
will work best with index (x,timestamp). And run EXPLAIN on that SQL query shows that it works.
When I try the same query on Timescaledb hypertable, which contains same data and without index (x,timestamp). The performance is about the same (if not better). After creating index (x,timestamp), the performance does not improve.
I understand that the hypertable have a build-in timestamp index. So, I should have a different strategy to add index to the table, for example index with just (x). Is that right?
A few things about how TimescaleDB handles queries:
The primary way that time-based queries get improved performance is
through chunk exclusion. Data is partitioned by time into chunks so
that when a query for a particular time range is executed, the
planner can ignore chunks that have data outside of that time range.
Indexes are then applied for chunks that are being searched.
If you are searching a time-range that includes all chunks, chunk
exclusion does not apply, and so you get query times closer to
standard PostgreSQL.
If your query matches on a large number of the rows in the chunks
being scanned, the query planner may choose a sequential scan
instead of an index scan to save on I/O operations
https://github.com/timescale/timescaledb/issues/317.
There is nothing inherently special about the built-in indexes, you can drop the indexes after hypertable creation or turn them off when running create_hypertable (see timescale api docs).

In ElasticSearch i have to create single index and multiple types or multiple index with single types?

I am new in elastic search.I am using elastic search for big data.
There is not join query in my application then which structure is best for my application?
I am working on elasticserach from past few days. I would like to share my experience/learnings.
1) If we moving from relational DB like MYSQL, SQL to ES, We need to maintain all relation among all data. Declare the primary key in different types or indexes, On basis of which you can perform Query DSL.
2) In case of if you dealing with millions data everyday, You need to design accordingly. Some people prefer duration based structure like Day, Week, Month wise. Its totally depend on your use case. For large data set (~ 1TB) you need to distribute your data in various of indexes and shards .
3) If you have small data set the it will be work in default settings too (5 shrads 1 replica). It will give you better If data set is small in your shards.
4) The JOIN query can be expensive in elasticsearch. And if you frequently performing it can be impact to your HEAP. So I would suggest prepare your data set with pre-cooked data (The result data which you getting when you perform join query in Relational DBs.) & document with unique ID. You can refer this. Check here to look, How we can perform JOIN
5) There might be some points which you need to take care while designing your index:
Don't treat Elasticsearch like a database
Know your use case BEFORE you jump in
Organize your data wisely
Make smart use of replicas
Base your capacity plans on experiment
6) Your wrong architecture can cause reindex which will be heavy cost with downtime. Checkout this article to know about index designing and best practices.

Elasticsearch - implications of splitting documents into separate indexes

Let's say I have 100,000 documents from different customer groups, which are formatted the same with the same type of information.
Documents from individual customer groups get refreshed at different times of the day. I've been recommended to give each customer group their own index so when my individual customer index is refreshed locally I can create a new index for that customer and delete the old index for that customer.
What are the implications for splitting the data into multiple indexes and querying using an alias? Specifically:
Will it increase my server HDD requirements?
Will it increase my server RAM requirements?
Will elasticsearch be slower to search by querying the alias to query all the indexes?
Thank you for any help or advice.
Every index has some overhead on all levels but it's usually small. For 100,000 documents I would question the need for splitting unless these documents are very large. In general each added index will:
Require some amount of RAM for insert buffers and other per-index related tasks
Have it's own merge overhead on disk relative to a larger single index
Provide some latency increase at query time due to result merging if a query spans multiple indexes
There are a lot of factors that go into determining if any of these are significant. If you have lots of RAM and several CPUs and SSDs then you may be fine.
I would advise you to build a solution that uses the minimum number of shards as possible. That probably means one (or at least only a few) index(es).

ElasticSearch Scale Forever

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?
A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

Resources