I'm planning a strategy for querying millions of docs in date and user directions.
Option 1 - indexing by user. routing by date.
Option 2 - indexing by date. routing by user.
What are the differences or advantages when using routing or indexing?
One of the design patterns that Shay Banon # Elasticsearch recommends is: index by time range, route by user and use aliasing.
Create an index for each day (or a date range) and route documents on user field, so you could 'retire' older logs and you don't need queries to execute on all shards:
$ curl -XPOST localhost:9200/user_logs_20140418 -d '{
"mappings" : {
"user_log" : {
"_routing": {
"required": true,
"path": "user"
},
"properties" : {
"user" : { "type" : "string" },
"log_time": { "type": "date" }
}
}
}
}'
Create an alias to filter and route on users, so you could query for documents of user_foo:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"alias": "user_foo",
"filter": {"term": {"user": "foo"}},
"routing": "foo"
}
}]
}'
Create aliases for time windows, so you could query for documents this_week:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"index": ["user_logs_20140418", "user_logs_20140417", "user_logs_20140416", "user_logs_20140415", "user_logs_20140414"],
"alias": "this_week"
},
"remove": {
"index": ["user_logs_20140413", "user_logs_20140412", "user_logs_20140411", "user_logs_20140410", "user_logs_20140409", "user_logs_20140408", "user_logs_20140407"],
"alias": "this_week"
}
}]
}'
Some of the advantages of this approach:
if you search using aliases for users, you hit only shards where the users' data resides
if a user's data grows, you could consider creating a separate index for that user (all you need is to point that user's alias to the new index)
no performance implications over allocation of shards
you could 'retire' older logs by simply closing (when you close indices, they consume practically no resources) or deleting an entire index (deleting an index is simpler than deleting documents within an index)
Indexing is the process of parsing
[Tokenized, filtered] the document that you indexed[Inverted Index]. It's like appendix of an text book.
When the indexed data exceeds one server limit. instead of upgrading server configurations, add another server and share data with them. This process is called as sharding.
If we search it will search in all shards and perform map reduce and return results.If we group similar data together and search some data in specific data means it reduce processing power and increase speed.
Routing is used to store group of data in particular shards.To select a field for routing. The field should be present in all docs,field should not contains different values.
Note:Routing should be used in multiple shards environment[not in single node]. If we use routing in single node .There is no use of it.
Let's define the terms first.
Indexing, in the context of Elasticsearch, can mean many things:
indexing a document: writing a new document to Elasticsearch
indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)
Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices
Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client
Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day).
Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its _id. The hash falls into the hash range of one of the shards ==> that's where the document goes.
You can use custom routing to control this: instead of hashing the _id, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular).
Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?"
To answer, the strategy should account for:
how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks
how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user)
total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster
In practice, I've seen the following designs:
one index per fixed time interval. You'll see this with logs (e.g.
Logstash writes to daily indices by default)
one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies
one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards
one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.
I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals.
BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)
Related
We are using elasticsearch for the following usecase.
Elasticsearch Version : 5.1.1
Note: We are using AWS managed ElasticSearch
We have a multi-tenanted system where in each tenant stores data for multiple things and number of tenants will increase day by day.
exa: Each tenant will have following information.
1] tickets
2] sw_inventory
3] hw_inventory
Current indexing stratergy is as follows:
indexname:
tenant_id (GUID) exa: tenant_xx1234xx-5b6x-4982-889a-667a758499c8
types:
1] tickets
2] sw_inventory
3] hw_inventory
Issues we are facing:
1] Conflicts for mappings of common fields exa: (id,name,userId) in types ( tickets,sw_inventory,hw_inventory )
2] As the number of tenants are increasing number of indices can reach upto 1000 or 2000 also.
Will it be a good idea if we reverse the indexing stratergy ?
exa:
index names :
1] tickets
2] sw_inventory
3] hw_inventory
types:
tenant_tenant_id1
tenant_tenant_id2
tenant_tenant_id3
tenant_tenant_id4
So there will be only 3 huge indices with N number of types as tenants.
So the question in this case is which solution is better?
1] Many small indices and 3 types
OR
2] 3 huge indices and many types
Regards
I suggest a different approach: https://www.elastic.co/guide/en/elasticsearch/guide/master/faking-it.html
Meaning custom routing where each document has a tenant_id or similar (something that is unique to each tenant) and use that both for routing and for defining an alias for each tenant. Then, when querying documents only for a specific tenant, you use the alias.
You are going to use one index and one type this way. Depending on the size of the index, you consider the existing index size and number of nodes and try to come up with a number of shards in such way that they are split evenly more or less on all data holding nodes and, also, following your tests the performance is acceptable. IF, in the future, the index grows too large and shards become too large to keep the same performance, consider creating a new index with more primary shards and reindex everything in that new one. It's not an approach unheard of or not used or not recommended.
1000-2000 aliases is nothing in terms of capability of being handled. If you have close to 10 nodes, or more than 10, I also do recommend dedicated master nodes with something like 4-6GB heap size and at least 4CPU cores.
Neither approach would work. As others have mentioned, both approaches cost performance and would prevent you from upgrading.
Consider having one index and type for each set of data, e.g. sw_inventory and then having a field within the mapping that differentiates between each tenant. You can then utilize document level security in a security plugin like X-Pack or Search Guard to prevent one tenant from seeing another's records (if required).
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type which means that doc_type (_type) is deprecated.
Full explanation you can find here but in summary there are two solutions:
Index per document type
This approach has two benefits:
Data is more likely to be dense and so benefit from compression techniques used in Lucene.
The term statistics used for scoring in full text search are more likely to be accurate because all documents in the same index represent a single entity.
Custom type field
Of course, there is a limit to how many primary shards can exist in a cluster so you may not want to waste an entire shard for a collection of only a few thousand documents. In this case, you can implement your own custom type field which will work in a similar way to the old _type.
PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"type": { "type": "keyword" },
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" },
"content": { "type": "text" },
"tweeted_at": { "type": "date" }
}
}
}
}
You use older version of Elastic but the same logic can apply and it would be easer for you to move to newer version when you decide to do that so I think that you should go with separate index structure or in other words 3 huge indices and many types but types as a field in mapping not as _type.
I think both strategies have pros and cons:
Multiple Indexes:
Pros:
- Tenant data is isolated from the others and no query would return results from more than one.
- If total of documents is a very big number, different smaller indexes could give a better performance
Cons: Harder to manage. If each index has few documents you may be wasting a lot of resources.
EDITED: Avoid multiple types in the same index as per comments o performance and deprecation of the feature
I have an elasticsearch index, my_index, with millions of documents, with key my_uuid. On top of that index I have several filtered aliases of the following form (showing only my_alias as retrieved by GET my_index/_alias/my_alias):
{
"my_index": {
"aliases": {
"my_alias": {
"filter": {
"terms": {
"my_uuid": [
"0944581b-9bf2-49e1-9bd0-4313d2398cf6",
"b6327e90-86f6-42eb-8fde-772397b8e926",
thousands of rows...
]
}
}
}
}
}
}
My understanding is that the filter will be cached transparently for me, without having to do any configuration. The thing is I am experiencing very slow searches, when going through the alias, which suggests that 1. the filter is not cached, or 2. it is wrongly written.
Indicative numbers:
GET my_index/_search -> 50ms
GET my_alias/_search -> 8000ms
I can provide further information on the cluster scale, and size of data if anyone considers this relevant.
I am using elasticsearch 2.4.1. I am getting the right results, it is just the performance that concerns me.
Matching each document with a 4MB list of uids is definetly not the way to go. Try to imagine how many CPU cycles it requires. 8s is quite fast.
I would duplicate the subset of data in another index.
If you need to immediately reflect changes, you will have to manage the subset index by hand :
when you delete a uuid from the list, you delete the corresponding documents
when you add a uuid, you copy the corresponding documents (reindex api with a query is your friend)
when you insert a document, you have to check if the document should be added in subset index too
when you delete a document, delete it in both indices
Force the document id so they are the same in both indices. Beware of refresh time if you store the uuid list in elasticsearch index.
If updating the subset with new uuid is not time critical, you can just run the reindex every day or every hour.
Is there any way to update all data in elasticsearch.
In below example, update done for external '1'.
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"doc": { "name": "Jane Doe", "age": 20 }
}'
Similarly, I need to update all my data in external. Is there any way or query to updating all data.
Updating all documents in an index means that all documents will be deleted and new ones will be indexed. Which means lots of "marked-as-deleted" documents.
When you run a query ES will automatically filter out those "marked-as-deleted" documents, which will have an impact on the response time of the query. How much impact it depends on the data, use case and query.
Also, if you update all documents, unless you run a _force_merge there will be segments (especially the larger ones) that will still have "marked-as-deleted" documents and those segments are hard to be automatically merged by Lucene/Elasticsearch.
My suggestion, if your indexing process is not too complex (like getting the data from a relational database and process it before indexing into ES, for example), is to drop the index completely and index fresh data. It might be more effective than updating all the documents.
I have 65000 document(pdf,docx,txt,..etc) index in elastic-search using mapper-attachment. now I want to search content in that stored document using following query:
"from" : 0, "size" : 50,
"query": {
"match": {
"my_attachment.content": req.params.name
}
}
but it will take 20-30 seconds for results. It is very slow response. so what i have to do for quick response? any idea?
here is mapping:
"my_attachment": {
"type": "attachment",
"fields": {
"content": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
}
}
}
Since your machine has 4 CPUs and the index 5 shards, I'd suggest switching to 4 primary shards, which means you need to reindex. The reason for this approach is that at any given time one execution of the query will use 4 cores. And for one of the shards the query needs to wait. To have an equal distribution of load at query time, use 4 primary shards (=number of CPU cores) so that when you run the query there will not be too much contention at CPU level.
Also, by providing the output of curl localhost:9200/your_documents_index/_stats I saw that the "fetch" part (retrieving the documents from the shards) is taking 4.2 seconds per operation on average. This is likely the result of having very large documents or of retrieving a lot of documents. size: 50 is not a big number, but combined with large documents it will make the query to return the results in a longer time.
The content field (the one with the actual document in it) has store: true and if you want this for highlighting, the documentation says
In order to perform highlighting, the actual content of the field is required. If the field in question is stored (has store set to true in the mapping) it will be used, otherwise, the actual _source will be loaded and the relevant field will be extracted from it.
So if you didn't disable _source for the index, then that will be used and storing the content is not necessary. Also there is no magic for having a faster fetch, it's strictly related to how large your documents are and how many you want to retrieve. Not using store: true might slighly improve the time.
From nodes stats (curl -XGET "http://localhost:9200/_nodes/stats") there was no indication that the node has memory or CPU problems, so everything boils down to my previous suggestions.
I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.