Hierarchical faceting with Lucene/Solr/Elasticsearch where document can have multiple parents - elasticsearch

I'm currently evaluating whether to use elasticsearch or solr in a project and moving through the cases that need to be implemented. I found one case on which I couldn't find any documentation which felt a bit strange to me since the case seemed to be quite common to me. The categories are user supplied so I don't know them in advance. Consider the following part of a taxonomy with documents that can have multiple categories:
Root (3)
Books (2)
Sci-fi (1)
DocumentA
Fantasy (2)
DocumentA
DocumentC
Movies (1)
Action (1)
DocumentB
Games (1)
Adventure
DocumentB
In this case DocumentB could be an entry for e.g. Indiana Jones. Normal term hierarchies can be implemented using the path hierarchy tokenizer in solr/elastic, so DocumentC would have 'Root/Books/Fantasy' as category with a path split on '/'.
DocumentB however would need to have two paths ('Root/Movies/Action' and 'Root/Games/Adventure'). I thought about dynamically adding one category_n field per path for the document in elastic with the path hierarchy tokenizer and then do the category search on all the category_* fields, but i don't know if that would be the right approach, especially considering that the document count for the facets is not simple because the count of a parent node is not the sum of its children (documents can be in multiple child categories and should not be counted more than once).
What would be a good way to implement this in solr/elastic?
Cheers

I ended up using ES and had a category field in which I put every path to the node. So 'Root/Movies/Action' and 'Root/Games/Adventure'. Then I used a path hierarchy tokenizer splitting on / with this field.
ES supports putting multiple paths in that field and searching them. I then used an aggregation with bucketing on the categories, that yielded exactly what I wanted, documents are not counted multiple times if the occure more than one time in a branch.

Related

Elastic Search and Search Ranking Models

I am new to Elastic Search. I would like to know if the following steps are how typically people use ES to build a search engine.
Use Elastic Search to get a list of qualified documents/results based on a user's input.
Build and use a search ranking model to sort this list.
Use this sorted list as the output of the search engine to the user.
I would probably add a few steps
Think about your information model.
What kinds of documents are you indexing?
What are the important fields and what field types are they?
What fields should be shown in the search result?
All this becomes part of your mapping
Index documents
Are the underlying data changing or can you index it just once?
How are you detecting new docuemtns/deletes/updates?
This will be included in your connetors, that can be set up in multiple ways, for example using the Documents API
A bit of trial and error to sort out your ranking model
Depending on your use case, the default ranking may be enough.
have a look at the Search API to try out different ranking.
Use the search result list to present the results to the end user

How do I query nodes that are missing a child of a specific type?

I'm new to graphql, and trying to understand how I might fill this use case.
I have thousands of nodes of a specific type/schema.
Some of these nodes have children, some of them don't.
I'd like to query all the nodes, and return only the ones that don't have children.
This might get more specific in the future, where I'd like to query only nodes that don't have children of a specific type.
Is that even possible?
I've seen plenty of query examples that show how to select children nodes, or nested nodes + fields, or nodes with specific values. It's an easy thing with SQL, I'm just having trouble understanding how it's done with graphql.
Thoughts?
As Daniel Rearden said, there is no built in way in GraphQL to filter or sort the results of a query. We have a few filters in our Gentics Mesh GraphQL API, but it is currently not possible to create a filter involving another list of items (children in your case).
I've added your case to the issue in Github. https://github.com/gentics/mesh/issues/27

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

parent-child documents relation retrieval

could you help me with little problem regarding parent-child documents relation?
Considering JSON, I have objects, each of them contains an array of sub-objects. Sub-objects contain some text fields.
I need to maintain full-text-search on these objects and construct snippets. I need highlighting for building snippets.
If I use nested objects, highlighting does not deal with them.
Therefore, I use Parent-Child relationships.
Now I need to retrieve Parent-documents, which children match the query_string. Furthermore, I need to get highlighted fields of matched children and associate each one(each child) with corresponding parent to construct snippets in my application.
Is it possible to accomplish my goal in one query?
I think that you should consider using the children aggregation. With that you can retrieve children items within their parents. It's aggregation so you are not able to get the whole document (just id) but with that you retrieve the relationship... Then with another query you can get document details quickly.
Link here : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html
And more details : https://www.elastic.co/guide/en/elasticsearch/guide/current/children-agg.html

How to implement Tag search?

I've designed a news hub system which read Rss links and stores whole news in the database. Now I want to implement a search system using tags. Each news has it's own tags. There are lots of algorithms to implement this but I don't know what is the most common to have the best performance. Currently I'm using Elastic search database and I use multiple keyword search. Which one of these are the best?
1- to store tags in a list or a string with a separator and search among them?
2- work like a relational system and have a table of tags, and a table of news tags to have a record for each news tag. and 5 records for 5 tags of one news
3- another algorithm which I don't know
Seems like you want something like the inverted index
This is an index, that for each term (hashtag in your case) holds a list of document ids which contain this hashtag.
For example, if you have 3 documents: d1,d2,d3 with the hash tags:
d1: #tag1, #tag2
d2: #tag3
d3: tag3, #tag2
The inverted index will be:
#tag1: d1
#tag2: d1,d3
#tag3: d2,d3
It is fairly easy using the inverted index to find all documents that contain a certain term (hashtag in your case), by simply going over the list the is attached to this term.
This datastructure is also very efficient for union (or queries) and intersection (and queries).
This DS is very popular for information retrieval for full text search and also is often used in semi-structured search.
For more information, you can read about Information Retrieval in general. Mannings Introduction to Information Retrieval represents this Data structure in the book's first chapter.
ElasticSearch will handle that very well and you have multiple ways of implementing that behavior.
What you want is a parent child relationship between a news article (parent) and its tags (children).
Depending on whether you need to update the hashtags after indexing your news articles or not, you could go with storing them in the news article or as separate documents pointing to the news article document as their parent.
See more details here: http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
You mentioned a choice between storing the tags as a list or a comma separated string. Go with the list as that is more idiomatic and ElasticSearch can handle json objects (you would actually analyze the string and turn it into a list of token anyways).

Resources