Creating an inverted index of a numberic field in Elasticsearch - elasticsearch

I have a dataset of about 20 million records, with the following structure:
{"id": "123",
"cites":[
{"id":"234", "date":"2018-05-04"},
{"id":"456","date":"2018-02-01"}]
}
and I would like to make an index where I can see the list of the ids that cite an article, something like
{"id":"234", "cited_by":[{"id":"123"},{"id:"188"}]}
Which I understand is technically an inverted index. This can be static, so it could be computed just a single time. I've only seen documentation about inverted indices used for terms and their frequency in phrases, which is a very different use case.
I looked into using aggregations, but because the number of different ids is too large it runs out of buckets, and I am not sure 20 million buckets are possible and/or a good idea.
How could I generate this index? Is it possible in ElasticSearch, or would I need to write an external script that does this in batches?
Thank you so much!

No problem to use ElasicSearch in your case.
Script to make the index
PUT /city_index
{
"mappings": {
"citydata": {
"dynamic": "false",
"properties": {
"id": {
"type": "keyword"
},
"cited_by": {
"type": "object",
"properties": {
"id": {
"type": "keyword"
}
}
}
}
}
}
}

Related

Elasticsearch Index Design based on very big nested array (could have more than 300,000 records)

We have a following index schema:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"data": {
"properties": {
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
},
"type": "nested"
}
}
}
}
Id is a unique identifier. Here data field is an array and it could have more than 300,000 objects and may be more. Is it sensible and correct way to index this kind of data? Or we should change our design and make the schema like following:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
}
}
}
In this design, we cant use Id as a document id because with this design, id would be repeating. If we have 300,000 FieldName and FieldValues for one Id, Id would be repeating 300,000 time. The challenge here is to generate our custom id using some mechanism. Because we need to handle both insert and update cases.
In the first approach one document size would be too large so that it could contain an array of 300,000 objects or may be more.
In second approach, we would have too many documents. 75370611530 is the number we currently have. This is the number of FieldNames and FieldValues we have. How should we handle this kind of data? Which approach would be better? What should be the size of shards in this index?
I noticed that the current mapping is not nested. I assume you would need to be nested as the query seems to be "find value for key = key1".
If it is known that 300K objects are expected - It may not be a good idea. ES soft limit is 10K. Indexing issues are going to give trouble with this approach in addition to possible slow queries.
I doubt if indexing 75 billion documents for this purpose is useful - given the resources required, though it is feasible and will work.
May be consider RDBMS?

Elasticsearch: Is it possible to reference properties in mappings?

I have a simple mapping in elasticsearch-6, like this.
{
"mappings": {
"_doc": {
"properties": {
"#timestamp": {
"type": "date"
},
"fields": {
"properties": {
"meta": {
"properties": {
"task": {
"properties": {
"field1": {
"type": "keyword"
},
"field2": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
}
}
Now I have to add another property to it - tasks which is just an array of the task property already defined.
Is there a way to reference the properties of task so that I don't have to duplicate all the properties? Something like:
{
"fields": {
"properties": {
"meta": {
"properties": {
"tasks": {
"type": "nested",
"properties": "fields.properties.meta.properties.task"
},
"task": {
...
}
}
}
}
}
}
you can already use your task field as an array of task objects, only, you cannot query them independently. If your goal is to achieve this (as I assume from your second example), I would directly set the "nested" data type into the mapping of the task field - then, yes, you'll need to reindex.
I can't imagine a use case where you would need the same array of objects duplicated in two fields, with one nested and the other not.
EDIT
Below, some considerations/suggestions based on the discussion in the comments:
One field can have either one value or an array of values. In your case, your task field can have either one task object or an array of task objects. You should only care about setting the "nested" datatype for task, if you plan to query its objects independently (of course, if they are more than one)
I would suggest to design your documents in such a way to avoid duplicated information in the first place. Duplicated information will make your documents bigger and more complex to process, leading to greater storage requirements and slower queries
If it's not possible to redesign your document mapping, you might check whether alias datatypes can help you avoiding some repetitions.
If it's not possible to redesign your document mapping, you might check whether dynamic templates can help you avoiding some repetitions

Multiple mappings for one index in Elasticsearch 6

I have an index that contains documents of different types (not talking about _type here) and each document has a field document_type that states their type. Is it possible to define mappings for each type of document within this index?
Is it possible to define mappings for each type of document within this index?
No, if you think of using the same field name with different types. For instance, field name id of type string and integer won't work.
Having different document_type basically indicates different domains. What you could do is to group information under each respective domain or type. For instance, an employee and project, both have an id and name, but different types in this example. Some call that nesting.
An example index mapping:
PUT example
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"employee": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 64
}
}
}
}
},
"project": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "keyword",
"ignore_above": 32
}
}
}
}
}
}
}
If you write the information, with different types.
PUT example/doc/1
{
"employee": {
"id": 4711,
"name": "John Doe"
},
"project": {
"id": "Project X",
"name": "Firebrand"
}
}
Others would argue to store employee and project in separate indices. This approach depends on your scenario and is also desirable. You allow both domains to evolve separately from each other.
Having a separate employee and project index gives you an advantage regarding maintenance. For querying some would argue, that you can group than with an alias. In the above example, it doesn't make sense since the field types are different. A search for the name over an analysed text field is different than over a keyword. Querying makes sense if you have the same field type.
No, if you want to use a single index, you would need to define a single mapping that combines the fields of each document type.
A better way might be to define separate indices on the same cluster for each document type. You can then create a single index alias that aliases to both of those indices if you want to be able to query across document types. Be sure that all fields that exist in both documents have the same data type in both mappings.
Having a single field name with more than one mapping type in the same index is not possible. Two options I can think of:
1. Separate the different doc types to separate indices.
2. Use different fields names for different doc types, so that each name can have different mapping. You can also use nesting, like: type_a.my_field and type_b.my_field, both in the same index.

elastic search 6 - use one or two type in one index?

How can I use, create two index or what?
I have one entity goods and one entity shop, should I create two index or two type in elastic search 6?
I have tried two mapping two type but it throw Exception.
How Can I do ?
In elacticsearch 6, you cannot create more than one doc type for an index. Earlier for an index company you could have doc type employee, infra, 'building' etc but now you it will throw an error.
In future versions doc type will be completely removed, so you will only have to deal with index.
An index in the elasticsearch is like table in normal database. And every document that you store will be row, and fields of that document will be columns.
Without seeing the data and knowing what you want to accomplish it is pretty hard to suggest how you should plan the schema of elasticsearch, but these information can help you decide.
you can use one of these two options:
1)Index per document type
2)Custom type field
for option 2:
PUT twitter
{
"mappings": {
"goods": {
"properties": {
"field1": { "type": "text" },
"field2": { "type": "keyword" },
}
},
"shop": {
"properties": {
"field1": { "type": "text" },
"field2": { "type": "date" }
}
}
}
}
see this

Elasticsearch: better to have more values or more fields?

Suppose to have an index with documents describing vehicles.
Your index needs to deal with two different type of vehicles: motorcycle and car.
Which of the following mapping is better from a performance point of view?
(nested is required for my purposes)
"vehicle": {
"type": "nested",
"properties": {
"car": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
},
"motorcycle": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
}
}
}
or this one:
"vehicle": {
"type": "nested",
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
},
"vehicle_type": {
"type": "string" ### "car", "motorcycle"
}
}
}
The second one is more readable and thin.
But the drawback that I'll have is that when I make my queries, if I want to focus only on "car", I need to put this condition as part of the query.
If I use the first mapping, I just need to have a direct access to the stored field, without adding overhead to the query.
The first mapping, where cars and motorcycles are isolated in different fields, is more likely to be faster. The reason is that you have one less filter to apply as you already know, and because of the increased selectivity of the queries (e.g less documents for a given value of vehicle.car.model than just vehicle.model)
Another option would be to create two distinct indexes car and motorcycle, possibly with the same index template.
In Elasticsearch, a query is processed by a single-thread per shard. That means, if you split your index in two, and query both in a single request, it will be executed in parallel.
So, when needed to query only one of cars or motorcycles, it's faster simply because indexes are smaller. And when it comes to query both cars and motorcycles it could also be faster by using more threads.
EDIT: one drawback of the later option you should know, the inner lucene dictionary will be duplicated, and if values in cars and motorcycles are quite identical, it doubles the list of indexed terms.

Resources