how to index questions and answers in elaticsearch - elasticsearch

I am doing a project to index questions and answers of a website in elasticsearch (version 6) for search purpose.
I have first thought of creating two indexes as shown below, one for questions and one for answers.
questions mapping:
{"mappings": {
"question": {
"properties": {
"title":{
"type":"text"
},
"question": {
"type": "text"
},
"questionId":{
"type":"keyword"
}
}
}
}
}
answers mapping:
{"mappings": {
"answer": {
"properties": {
"answer":{
"type":"text"
},
"answerId": {
"type": "keyword"
},
"questionId":{
"type":"keyword"
}
}
}
}
}
I have used multimatch query along with term and top_hits aggregation to search the indexed Q&As (referred question).I used this method to remove the duplicates from the search results. As answers or the question itself of the same question can appear in the result. I only want one entry per question in the results. the problem I am facing is to paginate the results. there is no possible way to paginate aggregation in elasticsearch. It can only paginate hits not aggregations.
then I thought of saving the both question and answers in one document, answers in a Json array. the problem with this approach is that there is no clean way to add, remove, update a specific answer in a given question document. only way I found was using a groovy script (referred question). which is deprecated in elasticsearch v6 AFAIK.
Is there a better and clean way to design this ?
Thanks.

Parent-Child Relationship
Use the parent-child relationship. It is similar to the nested model, and allows association of one entity with another. You can associate one document type with another, in a one-to-many relationship.
More information on here: https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html
Child documents can be added, changed, or deleted without affecting the parent nor other children. You can do pagination on the parent documents using the Scroll API.
Child documents can be retrieved using the has_parent join.
The trade-off: you do not have to take care of duplicates and pagination problems, but parent-child queries can be 5 to 10 times slower than the equivalent nested query.
Your mapping can be like the following:
PUT /my-index
{
"mappings": {
"question": {
"properties": {
"title": {
"type": "text"
},
"question": {
"type": "text"
},
"questionId": {
"type": "keyword"
}
}
},
"answer": {
"_parent": {
"type": "question"
},
"properties": {
"answer": {
"type": "text"
},
"answerId": {
"type": "keyword"
},
"questionId": {
"type": "keyword"
}
}
}
}
}

Related

Partial update into large document

I'm facing the problem about performance. My application is about chatting.
I designed mapping index with nested object like below.
{
"conversation_id-v1": {
"mappings": {
"stream": {
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
},
"comments": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
}
}
}
}
}
}
}
}
** actually have a lot of fields
A document has around 4,000 nested objects. When I upsert data into document, It peak the cpu to 100% also disk i/o in case write. Input ratio around 1000/s.
How can I tuning to improve performance?
Hardware
3x 2vCPUs 13GB on GCP
4000 nested fields sounds like a lot - if I were you, I would look long and hard at your mapping design to be very certain you actually need that many nested fields.
Quoting from the docs:
Internally, nested objects index each object in the array as a separate hidden document.
Since a document has to be fully reindexed on update, you're indexing 4000 documents with a single update.
Why so many fields?
The reason you gave in the comments for needing so many fields
I'd like to search comments in nested and come with their parent stream for display.
makes me think that you may be mixing two concerns here.
ElasticSearch is meant for search, and your mapping should be optimized for search. If your mapping shape is dictated by the way you want to display information, then something is wrong.
Design your index around search
Note that by "search" I mean both indexing and querying.
For the use case you have, it seems like you could:
Index only the comments, with a reference (some id) to the parent stream in the indexed comment document.
After you get the search results (a list of comments) back from the search index, you can retrieve each comment along with its parent stream from some other data source (e.g. a relational database).
The point is, it may be much more efficient to re-retrieve the comment along with whatever else you want from some other source that is more better than ElasticSearch at joining data.

Multiple Paths in Nested Queries

I'm cross-posting this from the elasticsearch forums (https://discuss.elastic.co/t/multiple-paths-in-nested-query/96851/1)
Below is an example, but first I’ll tell you about my use case, because I’m not sure if this is a good approach. I’m trying to automatically index a large collection of typed data. What this means is I’m trying to generate mappings and queries on those mappings all automatically based on information about my data. A lot of my data is relational, and I’m interested in being able to search accross the relations, thus I’m also interested in using Nested data types.
However, the issue is that many of these types have on the order of 10 relations, and I’ve got a feeling its not a good idea to pass 10 identical copies of a nested query to elasticsearch just to query 10 different nested paths the same way. Thus, I’m wondering if its possible to instead pass multiple paths into a single query? Better yet, if its possible to search over all fields in the current document and in all its nested documents and their fields in a single query. I’m aware of object fields, and they’re not a good fit because I want to retrive some data of matched nested documents.
In this example, I create an index with multiple nested types and some of its own types, upload a document, and attempt to query the document and all its nested documents, but fail. Is there some way to do this without duplicating the query for each nested document, or is that actually a performant way to do this? Thanks
PUT /my_index
{
"mappings": {
"type1" : {
"properties" : {
"obj1" : {
"type" : "nested",
"properties": {
"name": {
"type":"text"
},
"number": {
"type":"text"
}
}
},
"obj2" : {
"type" : "nested",
"properties": {
"color": {
"type":"text"
},
"food": {
"type":"text"
}
}
},
"lul":{
"type": "text"
},
"pucci":{
"type": "text"
}
}
}
}
}
PUT /my_index/type1/1
{
"obj1": [
{ "name":"liar", "number":"deer dog"},
{ "name":"one two three", "number":"you can call on me"},
{ "name":"ricky gervais", "number":"user 123"}
],
"obj2": [
{ "color":"red green blue", "food":"meatball and spaghetti"},
{ "color":"orange", "food":"pineapple, fish, goat"},
{ "color":"none", "food":"none"}
],
"lul": "lul its me user123",
"field": "one dog"
}
POST /my_index/_search
{
"query": {
"nested": {
"path": ["obj1", "obj2"],
"query": {
"query_string": {
"query": "ricky",
"all_fields": true
}
}
}
}
}

Using both term and match query on same text field?

I have an index with a text field.
"state": {
"type": "text"
}
Now suppose there are two data.
"state": "vail"
and
"state": "eagle vail"
For one of my requirements,
- I need to do a term level query, such that if I type "vail", the search results should only return states with "vail" and not "eagle vail".
But another requirement for different search on the same index,
- I need to do a match query for full text search, such that if I type "vail", "eagle vail" should display as well.
So my question is, how do I do both term level and full text search in this field, as for doing a term level query, I would have to set it as "keyword" type such that it wont be analyzed.
You can use "multi-field" feature to achieve this. Here is a mapping:
{
"mappings": {
"my_type": {
"properties": {
"state": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
In this case state will act as text field (tokenized) whereas state.raw will be keyword (single-token). When indexing a document you should only set state. state.raw will be created automatically.

Elasticsearch - aggregating multi level hierarchy

I am facing a problem with providing aggregated search result of documents with multi level hierarchy. Simplified documents structure looks like this:
Magazine title (Hunting) -> Magazine year (1999) -> Magazine issue (II.) -> Pages (Text of pages ...)
Every level od document is mapped to its parent by attribute "parentDocumentId".
I have prepared simple query, which works just fine for hierarchy with just 2 levels:
POST http://localhost:9200/my_index/document/_search?search_type=count&q=hunter
{
"query": {
"multi_match" : {
"query": "hunter",
"fields": [ "title", "text", "labels" ]
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "parentDocumentId"
}
}
}
}
This query is able to search through text of pages, and istead of giving me thousands of pages containting work "hunter" returns buckets (aggregated by parentDocumentId) of documents. However these buckets represent just "Magazine issues" which containt these pages.
Response:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 44,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 5,
"doc_count": 43
},
{
"key": 0,
"doc_count": 1
}
]
}
}
}
What I need, is to be able to aggregate search results on highest possible level. That means, in this particular case, to aggregate on "Magazine title" level. This could be done outside the elasticsearch query (on our application side), but as I see this, it should be definitely made in elasticsearch (performance, and other issues).
Does anybody have experience with similar aggregation? Is elasticsearch aggregations the right approach to use?
Every idea is welcome.
Thanks
Peter
Update:
Our mapping looks like this:
{
"my_index": {
"mappings": {
"document": {
"properties": {
"dateIssued": {
"type": "date",
"format": "dateOptionalTime"
},
"documentId": {
"type": "long"
},
"filter": {
"properties": {
"geo_bounding_box": {
"properties": {
"issuedLocation": {
"properties": {
"bottom_right": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"top_left": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
}
}
}
}
}
}
},
"issuedLocation": {
"type": "geo_point"
},
"labels": {
"type": "string"
},
"locationLinks": {
"type": "geo_point"
},
"parentDocumentId": {
"type": "long"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"storedLocation": {
"type": "geo_point"
},
"text": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
That means we use 1 mapping for all types of documents. We are indexing set of books, newspapers and other press. That means, that sometimes there is only one parent for set of pages, any sometimes there are multiple levels of parents above the pages level.
To distinguish the type of document there is an attribute "type".
When indexing top levels (these contain especially book meta-data) we leave the "text" attribute empty, always specifying the parent of document using the parentDocumentId. The top level documents have their parentDocumentId set to 0. When indexing the lowest level (pages), we provide only text attribute and parentDocumentId for indexed document.
The link used is very similar to classic one-to-many mapping (magazine has many years, has many issues, has many pages).
You could also say, that we have flattened the nested documents in elasticsearch, but the reason for this is, that there are multiple document types, that can have different level of their hierarchy.
You need to rethink your data modelling. In essence, you need a join over your data and moreover the join needs to be over an arbitrarily deep hierarchy. That is a problem even in relational databases let alone in a fulltext search engine like Elasticsearch.
Elasticsearch does support a couple of joins. You could use nested documents - a single document with all the subdocs nested. That's clearly not ideal in your case.
You could use the parent-child relationship feature which lets you index your (sub-)docs separately always referring to their parent. Underneath, that feature uses Lucene's blockjoin. However, to aggregate over a hierarchy, you would have to explicitly specify the join - listing all the intermediate steps. You want to always aggregate by the top-most available doc but that could be a different level each time (once a magazine, another time a magazine collection or perhaps a publisher).
I would consider indexing each doc with a field pointing to the top-most document. Then you can easily aggregate by that field. It would mean precomputing a part of the complex aggregation you want to do but it would result in fast aggregations and updates also wouldn't be very painful. It all depends on the source of your data, how you imagine that it will change, what updates and other queries you'll need to do.
This blog post could help to guide you a bit too: https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Is it possible to sort nested documents in ElasticSearch?

Lets say I have the following mapping:
"site": {
"properties": {
"title": { "type": "string" },
"description": { "type": "string" },
"category": { "type": "string" },
"tags": { "type": "array" },
"point": { "type": "geo_point" }
"localities": {
type: 'nested',
properties: {
"title": { "type": "string" },
"description": { "type": "string" },
"point": { "type": "geo_point" }
}
}
}
}
I'm then doing an "_geo_distance" sort on the parent document and am able to sort the documents on "site.point". However I would also like the nested localities to be sorted by "_geo_distance", inside the parent document.
Is this possible? If so, how?
Unfortunately, no (at least not yet).
A query in ElasticSearch just identifies which documents match the query, and how well they match.
To understand what nested documents are useful for, consider this example:
{
"title": "My post",
"body": "Text in my body...",
"followers": [
{
"name": "Joe",
"status": "active"
},
{
"name": "Mary",
"status": "pending"
},
]
}
The above JSON, once indexed in ES, is functionally equivalent to the following. Note how the followers field has been flattened:
{
"title": "My post",
"body": "Text in my body...",
"followers.name": ["Joe","Mary"],
"followers.status": ["active","pending"]
}
A search for: followers with status == active and name == Mary would match this document... incorrectly.
Nested fields allow us to work around this limitation. If the followers field is declared to be of type nested instead of type object then its contents are created as a separate (invisible) sub-document internally. That means that we can use a nested query or nested filter to query these nested documents as individual docs.
However, the output from the nested query/filter clauses only tells us if the main doc matches, and how well it matches. It doesn't even tell us which of the nested docs matched. To figure that out, we'd have to write code in our application to check each of the nested docs against our search criteria.
There are a few open issues requesting the addition of these features, but it is not an easy problem to solve.
The only way to achieve what you want is to index your sub-docs as separate documents, and to query and sort them independently. It may be useful to establish a parent-child relationship between the main doc and these separate sub-docs. (see parent-type mapping, the Parent & Child section of the index api docs, and the top-children and has-child queries.
Also, an ES user has mailed the list about a new has_parent filter that they are currently working on in a fork. However, this is not available in the main ES repo yet.

Resources