how to include doc urls in result set - elasticsearch

In ElasticSearch, I am wondering how I can get back document urls as well in the search result set? Here is what I meant with some example.
Let's say I index a doc using the following curl command:
curl -XPUT 'http://localhost:9200/ads/offers/1234' -d '{
"name": "blah blah",
"Weight":0.0001,
...
}'
Then I run a search and I want to get the document URL itself in the result set. In the above case, the document URL is the following:
http://localhost:9200/ads/offers/1234.
How can I do that? Is there a special field name for this or do I have to create some kind field to store this explicitly?

elasticsearch search response contains all piece that are needed to build this URL on the client. The record for the URL in you example will look like this:
"hits" : [ {
"_index" : "ads",
"_type" : "offers",
"_id" : "1234",
...
If you really need to get this URL from elasticsearch you can use script field to combine these pieces together into a single field on the server side, although I cannot think of a legitimate scenario where it would be needed.

Related

Elastic Search pipeline search queries

I am looking for a way to pipeline multiple queries into Elastic search. My main problem is that when I receive the results I want to be able to know the which was the query that generated the result. In pseudo-code I would like to do something like following
query1="James Bond"
query2="Sean Connery"
query3="Charlie Chaplin"
pipeline=new ElasticSearchPipeline()
pipeline.add(query1);pipeline.add(query2);pipeline.add(query3)
pipeline.execute()
jamesBondResults=pipeline.getResultsForQuery(query1)
seanConneryResults=pipeline.getResultsForQuery(query2)
charleChaplinResults=pipeline.getResultsForQuery(query3)
The key feature is that I want to send avoid the overhead of sending multiple requests on the ES server, but still be able to treat the results as if I had sent the queries one by one.
The multi search API is exactly what you're looking for.
You can send many queries and the response will contain an array with the responses to each query in the same order:
curl -XPOST localhost:9200/_msearch -d '
{"index" : "test1"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{"index" : "test2",}
{"query" : {"match_all" : {}}}
'
The response array of the above multi search queries will contain two ES responses with the documents from the first and second queries.

can terms lookup mechanism query by other field but id?

here is elasticsearch official website about terms:
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/query-dsl-terms-query.html
As we can see, if we want to do terms lookup mechanism query, we should use command like this:
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}'
But what if i want to do query by other field of users.
Assume that users has some other fields such as name and can i use terms lookup mechanism finding the tweets by giving users name but not id.
I have tried to use command like this:
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"name" : "Jane",
"path" : "followers"
}
}
}
}'
but it occurs error.
Looking forward to your help. Thank you!
The terms lookup mechanism is basically a built-in optimization to not have to make two queries to JOIN two indices, i.e. one in index A to get the ids to lookup and a second to fetch the documents with those ids in index B.
In contrary to SQL, such a JOIN can only work on the id field since this is the only way to uniquely retrieve a document from Elasticsearch via a GET call, which is exactly what Elasticsearch will do in the terms lookup.
So to answer your question, the terms lookup mechanism will not work on any other field than the id field since the first document to be retrieved must be unique. In your case, ES would not know how to fetch the document for the user with name Jane since name is just a field present in the user document, but in no way a unique identifier for user Jane.
I think you did not understand exactly how this works. Terms lookup query works by reading values from a field of a document with the given id. In this case, you are trying to match the value of field user in tweets index with values of field followers in document with id "2" present in users index and user type.
If you want to read from any other field then simply mention that in "path".
What you mainly need to understand is that the lookup values are all fetched from a field of a single document and not multiple documents.

can terms lookup mechanism query be nested

I want to know can I nest a terms lookup mechanism query in anther terms lookup mechanism.
For instance:
curl -XPUT localhost:9200/users/user/2 -d '{
"tweets" : ["1", "3"]
}'
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
"uuid" : "1",
"comments":["1","2","3"]
}'
curl -XPUT localhost:9200/comments/comment/1 -d '{
"uuid" : "1"
}'
As you know, we can use a terms lookup mechanism query to get tweets which belong to the user:
curl -XGET localhost:9200/tweets/tweet/_search -d'{
"query" : {
"terms" : {
"uuid" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "tweets"
}
}
}
}'
But if i want to get comments, i must do anther query.
However my documents is so many, it is not a good method.
So i want to nest terms lookup query in order to get comments in only one query by user's id, can i?
I will so appreciate it, if you can give me some help. Thank you! :)
At the moment, this is not possible as far as I know, because you expect data from three different indices to be returned in one query, which would equate to a JOIN. The terms lookup query sort of implements JOINs between two indices "only" (which is already quite cool considering the fact that ES does not want to support JOINs in the first place).
One way out of this would be to refactor your data model to get rid of the comments index and use either parent/child and/or nested relationships within the tweet mapping type. Since a comment can only belong to a single tweet and there aren't usually hundreds of comments on a tweet (I'm pretty confortable with the idea that 99% of the time there are less than half a dozen comments per tweet, if any at all), you could add comments either as a child documents or as a nested document (my preference), instead of just storing their ids in the comments array. That way you'd get your comments right away with your existing query, without the need for a second query.
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
"uuid" : "1",
"comments":[{
"id": 1,
"content": "Nice tweet!"
},{
"id": 2,
"content": "Way to go!"
},{
"id": 3,
"content": "Sucks!"
}]
}'
Or you can wait for this pull request (#3278) (Terms Lookup by Query/Filter (aka. Join Filter)) to be merged, which will effectively allow to do what you're asking for, but that PR has been created more than 2 years ago and there still are conflicts to be resolved.

How to retrieve all the document ids from an elasticsearch index

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?
I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.
If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.
link to knapsack:
https://github.com/jprante/elasticsearch-knapsack
edit: hopefully you are not doing this often... or this may not be a viable solution
For that amount of documents, you probably want to use the scan and scroll API.
Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:
es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
print res['_id']
First you can issue a request to get the full count of records in the index.
curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'
{
"count" : 1408,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.
Find a good page size that you can consume without running out of memory and increment the from each iteration.
curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'
Example item response:
{
"_index" : "documents",
"_type" : "document",
"_id" : "1341",
"_score" : 1.0
},
...

Queries vs. Filters

I can't see any description of when I should use a query or a filter or some combination of the two. What is the difference between them? Can anyone please explain?
The difference is simple: filters are cached and don't influence the score, therefore faster than queries. Have a look here too. Let's say a query is usually something that the users type and pretty much unpredictable, while filters help users narrowing down the search results , for example using facets.
This is what official documentation says:
As a general rule, filters should be used instead of queries:
for binary yes/no searches
for queries on exact values
As a general rule, queries should be used instead of filters:
for full text search
where the result depends on a relevance score
An example (try it yourself)
Say index myindex contains three documents:
curl -XPOST localhost:9200/myindex/mytype -d '{ "msg": "Hello world!" }'
curl -XPOST localhost:9200/myindex/mytype -d '{ "msg": "Hello world! I am Sam." }'
curl -XPOST localhost:9200/myindex/mytype -d '{ "msg": "Hi Stack Overflow!" }'
Query: How well a document matches the query
Query hello sam (using keyword must)
curl localhost:9200/myindex/_search?pretty -d '
{
"query": { "bool": { "must": { "match": { "msg": "hello sam" }}}}
}'
Document "Hello world! I am Sam." is assigned a higher score than "Hello world!", because the former matches both words in the query. Documents are scored.
"hits" : [
...
"_score" : 0.74487394,
"_source" : {
"name" : "Hello world! I am Sam."
}
...
"_score" : 0.22108285,
"_source" : {
"name" : "Hello world!"
}
...
Filter: Whether a document matches the query
Filter hello sam (using keyword filter)
curl localhost:9200/myindex/_search?pretty -d '
{
"query": { "bool": { "filter": { "match": { "msg": "hello sam" }}}}
}'
Documents that contain either hello or sam are returned. Documents are NOT scored.
"hits" : [
...
"_score" : 0.0,
"_source" : {
"name" : "Hello world!"
}
...
"_score" : 0.0,
"_source" : {
"name" : "Hello world! I am Sam."
}
...
Unless you need full text search or scoring, filters are preferred because frequently used filters will be cached automatically by Elasticsearch, to speed up performance. See Elasticsearch: Query and filter context.
Filters -> Does this document match? a binary yes or no answer
Queries -> Does this document match? How well does it match? uses scoring
Few more addition to the same.
A filter is applied first and then the query is processed over its results. To store the binary true/false match per document , something called a bitSet Array is used.
This BitSet array is in memory and this would be used from second time the filter is queried. This way , using bitset array data-structure , we are able to utilize the cached result.
One more point to note here , the filter cache is created only when the request is executed hence only from the second hit , we actually get the advantage of caching.
But then you can use warmer API , to outgrow this. When you register a query with filter against a warmer API , it will make sure that this is executed against a new segment whenever it comes live. Hence we will get consistent speed from the first execution itself.
Basically, a query is used when you want to perform a search on your documents with scoring.
And filters are used to narrow down the set of results obtained by using query. Filters are boolean.
For example say you have an index of restaurants something like zomato.
Now you want to search for restaurants that serve 'pizza', which is basically your search keyword.
So you will use query to find all the documents containing "pizza" and some results will obtained.
Say now you want list of restaurant that serves pizza and has rating of atleast 4.0.
So what you will have to do is use the keyword "pizza" in your query and apply the filter for rating as 4.0.
What happens is that filters are usually applied on the results obtained by querying your index.
Since version 2 of Elasticsearch, filters and queries have been merged and any query clause can be used as either a filter or a query (depending on the context). As with version 1, filters are cached and should be used if scoring does not matter.
Source: https://logz.io/blog/elasticsearch-queries/
Queries : calculate score; thus they’re able to return results sorted by relevance.
Filters : don’t calculate score, making them faster and easier to cache.

Resources