I use Elasticsearch to store system vulnerabilities. Right now my typical entry is
{
_id: somenadomid
_source: {
"ip": "10.10.10.10",
"vuln_name": "v1",
"vuln_type": 1
}
This approach has the advantage to simplify queries ("number of machines with a vuln of type 1" -> an aggregation, "number of vulnerabilities" - a query_all search and associated totalvalue, ...).
It aslo has drawbacks, in particular:
the information is heavily demultiplied: the information about one host is copied over all vulnerabilities
there are as many lines as vulnerabilities, and not hosts (50x more in average)
the natural container is "host" and not "vulnerability" - it can be updated, deleted, etc. more easily.
I am therefore considering changing the scheme to a "host" base one:
{
_id: machine1
_source: {
"ip": "10.10.10.10",
"vuln": [
{
"name": "v1",
"type": 1
},
{
"name": "v2",
"type": 1
}
]
}
The problem I am running into is that I still fundamentally query vulnerabilities and do not know how to "explode" them in a query.
Specifically (I believe my problem will gravitate around this family of queries), how can I query
the total number of vulnerabilities of type 1 (not the hosts - there can be several vulns of type 1 per host, the basic query retrieves the entries, which are hosts)
the same as above, but with some filtering on, say, the vulnerability name ("number of vulnerabilities of type 1 with "Microsoft" in the name) - the filtering is on a feature of the vulnerability and not the host)
Just to give you a simple overview,
In Elasticsearch you have two way to mange nested data, you can use Nested Object or Inner Object, behind the scene they are completely different.
The nested type is a specialized version of the object datatype that allows arrays of objects to be indexed and queried independently of each other.
Nested docs are stored in the same Lucene block as each other, which helps read/query performance.
Reading a nested doc is faster than the equivalent parent/child.
Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
"Cross referencing" nested documents is impossible
Best suited for data that does not change frequently
Inner Object is an objects embedded inside the parent document.
Easy, fast, performant Only applicable when one-to-one relationships
are maintained No need for special queries Nested
Please have a look the following link for further information the difference between Inner Object and Nested Object.
https://www.elastic.co/blog/managing-relations-inside-elasticsearch
In order to query and aggregate(To Get the total Number) have look the following links:
Query : https://www.elastic.co/guide/en/elasticsearch/guide/master/nested-objects.html
Aggregations :
https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-aggregation.html
Related
I was working on a project that needs to index a bunch of Products and their Variants into ElasticSearch. Variants have the same schema as Products in DB. So naturally, I started with designing a mapping that is exactly the same as the Product schema and index products and variants as their own documents.
But later, when I accidentally tried to index variants as nested objects inside products, the indexing process is 3x-5x faster (tested several times locally with 1000 products&5 variants, 2000 products&10 variants, and 25000 products&5 variants). The mapping looks something like the below:
id: keyword
name: text
sku: keyword
price: long
color: keyword
...
variants: [
{
id: keyword
name: text
sku: keyword
price: long
color: keyword
...
}
]
So the question is why is that. Since the data size would be the same, a nested mapping will cause a longer index time due to 2x fields. Also, I'm using _bulk API to index products with their variants in each API call, so the request count would be the same.
Thanks in advance for any suggestions on why is this.
PS: I'm running ElasticSearch 6.7 locally
Just trying to answer the question, "why indexing time is different."
Nested documents are indexed differently. Internally nested documents are indexed as separate documents - but indexed as a single block within Lucene.
Suppose your document contains two variants in the nested data structure. In that case, the total number of documents indexed will be 3 ( 1 parent doc + 2 variants as separate docs) - internally by .[addDocuments()][1] Lucene call. This guarantees documents are indexed in a single block and available to query using nested query ( nested query joins these documents in runtime).
This results in a different indexing behavior. In your case - it got faster, but say if you have thousands of variants per product, too many nested structures can give you indexing problems. There are some limits to avoid it.
I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?
You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.
So I have following problem which I'm trying to solve last two days. I have python script which parses logs and inserts data in elastic search, dynamically creating indices via bulk function.
Problem is my mapping has one "type": "nested" property, something like "users" field. And particularly when I'm only adding "type": "nested" in this property I can't query objects from Kibana nor creating any vizualization (because nested objects are separate documents If I'm not making mistakes). First think I tried: adding aditional "include_in_parent": true parameter to users field, but as result I got "wrong" queries (i.e. running something like +users.name: 'test' +users.age: 30) would result in ANY document which has those two fields, not exactly referring to ONE user object. Also vizualization was obviously wrong too.
Second solution I found was adding parent-child relationship. But this could be potentially be waste of time as I don't know will it result in correct queries. So I'm asking, if it will be normal solution to my problem?
Found out that Kibana doesn't support nested objects.
But ppadovani made this fork which supports this feature.
https://github.com/homeaway/kibana/tree/nestedSupport-4.5.4
I store log data in elasticsearch and my records, among other data, contain lists of values. First I represented these lists of values with regular arrays in elastic, but soon realised that the flattening in combination with the inverted index in Lucene made average aggregations on a list such as [1,1,1,1,5] came out completely wrong since the inverted index only contained [1,5]. Clearly avg([1,5]) is different from avg([1,1,1,1,5]).
Seeking out solutions I turned to nested documents, which do not flatten the data.
I now have my nested documents in elasticsearch looking something in the line of:
"nested_documents": [
{ "list1": 1, "list2": 2},
{ "list1": 3, "list2": 4}
]
Using the nested aggregation I am able to do aggregations such as:
"aggs": {
"nested_aggregation": {
"nested": {
"path": "nested_documents"
},
"aggs": {
"average_of_list1": {
"avg": {
"field": "nested_documents.list1"
}
}
}
}
Which now give me the correct result over the entire data set. However, I do have another requirements as well.
I would like to achieve things like max(avg(nested_documents.list1)), i.e. I want to have the average value of a field of my nested documents. I imagined I could use a script to achieve this, but I can't find a way to access the nested document in scripts. I did achieve the desired result using script and _source, but this was way too slow to be used in production on my datasets.
The only simple (and fast) solution I can imagine is to calculate the averages before storage, and store them along the actual lists, but that doesn't feel right.
Aggregating over aggregation results are not yet supported in elasticsearch. Apparently there is a concept called reducers that are being developed for 2.0. I would suggest having a look at scripted metric aggregations. Basically, you can create your own aggregation by controlling the collection and computation aspects yourself using scripts.
Have a look at the following question for an example of this aggregation: Elasticsearch: Possible to process aggregation results?
Referring to this question here:
I am working on a similar site using mongodb as my main database. As you can imagine, each user object has a lot of fields that need to be serchable, say for example mood, city, age, sex, smoker, drinker, etc.
Now, apart from the problem that there cannot be more than 64 indexes per collection, is it wise to assign index to all of my fields?
There might be another viable way of doing it: tags (refer to this other question) If i set the index on an array of predetermined tags and then text-search over them, would it be better? as I am using only ONE index. What do you think? E.g.:
{
name: "john",
tags: ["happy", "new-york", "smoke0", "drink1"]
}
MongoDB doesn't (yet) support index intersection, so the rule is: one index per query. Some of your query parameters have extremely low selectivity, the extreme example being the boolean ones, and indexing those will usually slow things down rather than speed them up.
As a simple approximation, you could create a compound index that starts with the highest-selectivity fields, for instance {"city", "age", "mood", ... }. However, then you will always have to use a city constraint. If you query for {age, mood}, the above index wouldn't be used.
If you can narrow down your result set to a reasonable size using indexes, a scan within that set won't be a performance hog. More precisely, if you say limit(100) and MongoDB has to scan 200 items to fill up those 100, it won't be critical.
The danger lies is very narrow searches across the database - if you have to perform a scan on the entire dataset to find the only unhappy, drinking non-smoker older than 95, things get ugly.
If you want to allow very fine grained searches, a dedicated search database such as SolR might be a better option.
EDIT: The tags suggestion looks a bit like using the crowbar to me -- maybe the key/value multikey index recommended by in the MongoDB FAQ is a cleaner solution:
{ _id : ObjectId(...),
attrib : [
{ k: "mood", v: "happy" },
{ k: "city": v: "new york" },
{ k: "smoker": v: false },
{ k: "drinker": v: true }
]
}
However, YMMV and 'clean' and 'fast' often don't point in the same direction, so the tags approach might not be bad at all.