Is it possible to make elasticsearch highlights linkable? - elasticsearch

I'm successfully using ES for indexing documents and higlighting searched text. But now I have a new requirement - make all yellow highlights linkable, i.e user have to be able to dive into the page with selected occurence.
I haven't implemented page preview of document yet but I'm sure that there exists some software which gets page number or bytes offset and returns docx or pdf page as image. So, I want elastic to return index of occurence (most likely, byte offset from the beginning). After that I probably may use indexToImage soft for showing occurence page to user. Even if such software does not exist I may open RandomAccessFile and read occurence page and somehow show it to user. But anyway I need occurence index. is it possible to get it from elastic?
My search request looks like:
http://localhost:9200/mongofilesindex/_search?pretty&source={
"_source": ["filename",
"metadata"],
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*test*"
}
}
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"content": {
"fragment_size": 200,
"number_of_fragments": 10
}
}
}
}&size=10&from=0
Of course, I may use ES just for extracting matching documents and after that manually apply KMP in input stream which works in linear time. But I want something better than linear because I know that suffix automatas and other complex data structures may return occurences in O(search_string_len+occurences_count) which is much more better than O(doc_len).
I'm sure that elastic uses such cool data structures and probably I'm missing some API for getting occurences indices.

Related

Position aware search results in Elasticsearch automcompletion

I want to implement address autocompletion using Elasticsearch.
The current approach I am investigating is based on search_as_you_type field type.
Consider this two addresses:
3543JN Carl Zellerhof 8 Utrecht (3543JN is postcode)
1234JN The Street 3543 Utrecht
It is important to prioritize some address parts over others, for instance, postcode should have more weight than number, eg when a user types 3543 - the first address should be first in search results.
I see two solutions here:
Combine address into one string and give weight based on position within the combined string
Do search on multiple fields (then weight can be adjusted per field, but it seems more complex to me, how to ensure the same address part is not matched several times?)
I am leaning more towards one-string solution, but this implementation gives the same weight for the 3543 search query.
Please advise how to implement this.
(It is also desirable to allow some fuzziness)
UPD:
seems adding postcode field to the multi_match fields gives me what I want. Are there any disadvantages of this approach?
the index
{
"mappings": {
"properties": {
"search": {
"type": "search_as_you_type"
}
}
}
}
the search query
{
"query": {
"multi_match": {
"query": "3543",
"type": "bool_prefix",
"fields": [
"search",
"search._2gram",
"search._3gram"
]
}
}
}

Most performant way to update a single document in Elasticsearch via an alias

I have an Elasticsearch setup with an alias that points to many indices. I need to update a single document, but I don't know which index it resides in.
There are two ways I can accomplish this as far as I can see:
_update_by_query:
POST my-alias/_update_by_query
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
},
"script": {
"source": "ctx._source['Field'] = 'new value'"
}
}
read (which returns the specific index) then write:
GET my-alias/_search
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
}
}
POST my-index-returned-from-the-get/_update/my-id-to-update
{
"doc": {
"Field": "new value"
}
}
Which method is more performant?
Which method is preferred?
Is there a better way than either of these two?
The performance of both approach will be the same with one difference that your first approach only need to send one request compare to second one with two request, so it would be better to use first approach as you will reduce the API calls by half.
Also in my opinion the first approach is much cleaner and fits more in concept of aliases of Elasticsearch because you are encapsulating exact index name from your application, as application doesn't need to have any clue about exact index-name your documents are in.
An important note about updating a document in Elasticsearch is documents in Elasticsearch don't get updated, it means the document will be flagged as deleted and new document will be created (this is due to Lucene implementation), then during process of Lucene segment merging the document will be actually deleted.
you can find a good blog post about segment merging here.

Return position and highlighting of search queries in Elasticsearch

I am using the official Elasticsearch-PHP client installed on a personal Debian server, and what I am trying to do involves indexing, searching and highlighting individual documents. i.e. each search result will only return one document - which will then be highlighted for "simple query string" searches. I am also using FVH (fast vector highlighting).
My question is similar to this one Position as result, instead of highlighting and the test code is basically the same so I won't repeat that here. However in my case I need both position and highlighting. I followed the link to the documentation about term vectors, but just like the other OP, my searches are not exact words per se. In some cases they are phrases. How would I approach this?
My use case is to search only one document (for each query), and present a summary of results with links which the user can click to go to the specific place in the document where that result came from. If I have the index / position I can simply use that against the full source of the document. I have checked the documentation to no avail.
You could try to install a specific plugin developed by wikimedia foundation called Experimental Highlighter -github here
You can install for elasticsearch 7.5 in this way - for other elasticsearch versions please refer to the github project page:
./bin/elasticsearch-plugin install org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.5.1
And restart elasticsearch.
Inasmuch you need to retrieve also the positions - if for your use case the offsets can replace the positions please go on to the next paragraph - you should declare your field with termvector with the index option "with_position_offset_payloads" - doc here
PUT /my-index-000001
{ "mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "fulltext_analyzer"
}
}
}
}
For other cases that don't need to retrieve also the position, it is faster and uses much less space to use the index option "offsets" - elastic doc here, plugin doc here:
PUT /my-index-000001
{ "mappings": {
"properties": {
"text": {
"type": "text",
"index_options": "offsets",
"analyzer" : "fulltext_analyzer"
}
}
}
}
Then you could query with the experimental highlighter and return only offset of the highlighter part:
{
"query": {
"match": {
"text": "hello world"
}
},
"highlight": {
"order": "score",
"fields": {
"text": {
"number_of_fragments": 10,
"fragment_size": 15,
"type": "experimental",
"options": {"return_offset": true}
}
}
}
}
In this way no text is returned from your query but only the start offset and the end offset - numbers that represent position. To retrieve your highlighted content you need to enter inside ['hits']['hits'][0]['_source']['text'] -text is your field name - and extract text from the field using your start offset point and the end offset point. You need to ensure to use the correct string encoding - UTF-8 - otherwise the offsets don't match text. According to the doc:
The return_offsets option changes the results from a highlighted
string to the offsets in the highlighted that would have been
highlighted. This is useful if you need to do client side sanity
checking on the highlighting. Instead of a marked up snippet you'll
get a result like 0:0-5,18-22:22. The outer numbers are the start and
end offset of the snippet. The pairs of numbers separated by the ,s
are the hits. The number before the - is the start offset and the
number after the - is the end offset. Multi-valued fields have a
single character worth of offset between them.
Let me know if that plugin could help!

How do I architect my Elasticsearch indexes for rapidly changing data?

I have two models in my MySQL database: Users and Posts
users have geolocation attributes (lat/long)
posts simply have a body of text.
I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.
This is my current solution:
Index the posts and have a geolocation attribute in each
document. When a user changes location, run an elasticsearch batch
update on all that user's posts, and modify the geolocation attribute
on those documents.
Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.
Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?
Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.
Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.
It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.
The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:
POST /users/_search?filter_path=hits.hits._id
{
"size": 1000,
"_source": false,
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "100m",
"location": {
"lat": 32.5362723,
"lon": -80.3654783
}
}
}
]
}
}
}
You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.
POST /posts/_search
{
"size": 50,
"query": {
"bool": {
"must": {
"match": {
"body": "some text"
}
},
"filter": [
{
"terms": {
"user_id": [ 1, 2, 3, 4 ]
}
}
]
}
}
}

ElasticSearch query referencing document

I read some time ago that there was a way to build a query that references another document in your index. At the time, this wasn't helpful to me, but I now have very large GIS areas that I need to query against and sending this data to ElasticSearch in the query body every time seems wasteful.
While my specific use-case relates to GIS, geo_shape, etc, it's a general issue that can be applied to other types of queries.
I have a document type areas that holds all of the predefined search areas (these are things like suburbs, states, etc) and entities that hold all of my search data, including a geo_point type field with lat/lon.
I need to be able to construct a geo_shape query for entities documents that references the mpoly attribute (which is a GeoShape type) on an areas document for it's shape coordinates.
Unfortunately, neither Google nor reading the ElasticSearch docs have proved useful in this case, because generally nested documents (related, but not what I'm looking for) is what people seem to be more interested in.
Finally found the answer myself while looking for something different. Unfortunately, the information about the GeoShape filter is not in the GeoShape query manual pages:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html#_pre_indexed_shape
{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_shape": {
"location": {
"indexed_shape": {
"id": "DEU",
"type": "countries",
"index": "shapes",
"path": "location"
}
}
}
}
}
}
If anyone has better information about how to do this generically, I will happily accept their answer instead.

Resources