Could you please help me with the following question:
When creating a metrics and index in the Splunk, do you have to create a single index per metrix or you can use many to many connection?
Just like a single event index can hold many types of events, so can a single metrics index hold many types of metrics.
If this is not the answer you seek then please clarify the question.
Related
I have a scenario where there are too many groups, and each group has numerous documents with few standard fields and few dynamic fields. So I have created an alias for every group for an index. Now, as these groups grow, these dynamic fields are also increasing. I just hit the default 1000 fields limit.
I plan to create an index for each group to solve too many fields issues, but I will have too many indexes problems.
Please let me know if someone knows a better way to handle this problem.
I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!
We have a SAAS product where companies create accounts and populate their own private data. We are thinking about using ElasticSearch to allow the customer to search all their own data in our system.
As an example we would have a free text search where the user can type anything and the API would return multiple different types of objects. E.g. they type John and the API returns the user object for users matching a first name containing John, or an email containing John. Or it might also return a team object where the team name matches John (e.g. John's Team) etc.
So my questions are:
Is ElasticSearch a sensible choice for what we want to do from a
concept perspective?
If we did use ElasticSearch what would be the
best way to index the data so we can search all data for a
particular customer? Does each customer have its own index?
Are there any hints on how we keep ElasticSearch in sync with the data in the database (DynamoDB)? If we index the data for a customer and then update the data as it changes is it sensible to then also reindex the data on a scheduled basis too?
Thanks!
I will try to provide general answers from my own experience with splitted customer data with elastic search:
If you want to search through a lot of data really fast, ES is always a really good solution for this - it comes with the cost of an secondary data storage that you will have to keep in sync with your database.
You cant have diffrent data types in one index, so the case would be either to create one index per data type and customer (carefull, indices come with an overhead - avoid creating too much with little data in it) - or you create one index per data type and add a property to your data where you then can filter it with e.g. a customer number.
You will have to denormalize your data as much as possible to benefit from elastic search.
As mentioned in 1 you will need to keep both in sync - there are plenty ways too do that. As an example we use a an event driven approach to push critical updates into elasticsearch as soon as possible (carefull: its not SQL - so you will always have some concurrency issues when u need read and write safety). For data that is not highly critical we use jobs that update them regulary. When you index a document with the same id it will get completely updated.
Hope this helps, feel free to asy questions.
We are planning to introduce Elastic search(AWS) for our Multi tenancy application. We have below options,
Using One Index Per Tenant
Using One Type Per Tenant
All Tenants Share One Index with Custom routing
As per this blog https://www.elastic.co/blog/found-multi-tenancy the first option would give memory issue. But not clear about other options.
It seems if we are using the third option then there is no data segregation. Not sure about security.
I believe second option would be better option as data would be segregated.
Help me to identify best option to proceed elastic search with Multi tenancy.
Please note that we would leverage AWS infrastructure.
We are considering the same question right now, and the following set of articles by Elasticsearch was very helpful.
Start here: https://www.elastic.co/guide/en/elasticsearch/guide/current/scale.html
And read through each subsequent article until you hit this one: https://www.elastic.co/guide/en/elasticsearch/guide/current/finite-scale.html
The following two were very eye-opening for me:
https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/one-big-user.html
The basic takeaway:
Alias per customer
Shard routing
Now you can have indexes for big customers, shared indexes for little customers, and they all appear to be separate indices
This is a too important link not to be mentioned here:
http://www.bigeng.io/elasticsearch-scaling-multitenant/
Good architecture dilemmas, and great performance analysis / reasoning.
tldr; they had index groups that are built around shard allocation filtering to segregate load across nodes in the cluster
To sum up accepted answer and other articles,
Use a shared index using custom routing using an alias
1.1) Special case: Big client can have dedicated index, only if needed.
Following article covers many use cases for detailed explanation.
https://www.elastic.co/blog/found-multi-tenancy
Following is the conclusion on how you can do it (link source: accepted answer)
https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html
The overhead of adding indexes is well-documented, but I have not been able to find good information on when to use multiple indexes with regards to the various document types being indexed.
Here is a generic example to illustrate the question:
Say we have the following entities
Products (Name, ProductID, ProductCategoryID, List-of-Stores)
Product Categories (Name, ProductCategoryID)
Stores (Name, StoreID)
Should I dump these three different types of documents into a single index, each with the appropriate elasticsearch type?
I am having difficulty establishing where to the draw the line on one vs. multiple indexes.
What if we add an unrelated entity, "Webpages". Definitely a separate index?
A very interesting video explaining elasticsearch "Data Design Patterns" by Shay Banon:
http://vimeo.com/44716955
This exact question is answered at 13:40 where examining different data flows, by looking at the concepts of Type, Filter and Routing
Regards
I was recently modeling a ElasticSearch backend from scratch and from my point of view, the best option is putting all related documents types in the same index.
I read that some people had problems with too many concurrent indexes (1 index per type). It's better for performance and robustness to unify related types in the same index.
Besides, if the types are in the same index you can use "_parent" field to create hierarquical models that allow to you interesting features for search as "has_child" and "has_parent" and of course you have not to duplicate data in your model.