How can I boost certain fields over others in elasticsearch? - boost

My goal is to apply the boost to field "name" (see example below), but I have two problems when I search for "john":
search is also matching {name: "dany", message: "hi bob"} when name is "dany" and
search is not boosting name over message (rows with name="john" should be on the top)
The gist is on https://gist.github.com/tomaspet262/5535774
(since stackoverflow's form submit returned 'Your post appears to contain code that is not properly formatted as code', which was formatted properly).

I would suggest using query time boosting instead of index time boosting.
#DELETE
curl -XDELETE 'http://localhost:9200/test'
echo
# CREATE
curl -XPUT 'http://localhost:9200/test?pretty=1' -d '{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyz_1" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}'
echo
# DEFINE
curl -XPUT 'http://localhost:9200/test/posts/_mapping?pretty=1' -d '{
"posts" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "my_analyz_1"
},
"message" : {
"type" : "string",
"analyzer" : "my_analyz_1"
}
}
}
}'
echo
# INSERT
curl localhost:9200/test/posts/1 -d '{name: "john", message: "hi john"}'
curl localhost:9200/test/posts/2 -d '{name: "bob", message: "hi john, how are you?"}'
curl localhost:9200/test/posts/3 -d '{name: "john", message: "bob?"}'
curl localhost:9200/test/posts/4 -d '{name: "dany", message: "hi bob"}'
curl localhost:9200/test/posts/5 -d '{name: "dany", message: "hi john"}'
echo
# REFRESH
curl -XPOST localhost:9200/test/_refresh
echo
# SEARCH
curl "localhost:9200/test/posts/_search?pretty=1" -d '{
"query": {
"multi_match": {
"query": "john",
"fields": ["name^2", "message"]
}
}
}'

Im not sure if this is relevant in this case, but when testing with such small amounts of data, I always use 1 shard instead of default settings to ensure no issues because of distributed calculation.

Related

Don't make some fields searchable when using query_string or term/terms in Elasticsearch

Having this mapping:
curl -XPUT 'localhost:9200/testindex?pretty=true' -d '{
"mappings": {
"items": {
"dynamic": "strict",
"properties" : {
"title" : { "type": "string" },
"body" : { "type": "string" },
"tags" : { "type": "string" }
}}}}'
I add two simple items:
curl -XPUT 'localhost:9200/testindex/items/1' -d '{
"title": "This is a test title",
"body" : "This is the body of the java",
"tags" : "csharp"
}'
curl -XPUT 'localhost:9200/testindex/items/2' -d '{
"title": "Another text title",
"body": "My body is great and Im super handsome",
"tags" : ["cplusplus", "python", "java"]
}'
If I search the string java:
curl -XGET 'localhost:9200/testindex/items/_search?q=java&pretty=true'
... it will match both items. The first item will match on the body and the other one on the tags.
How can I avoid to search in some fields? In the example I dont know it to match with the field tags. But I want to maintain tags indexed as I use them for getting aggregations.
I know I can do it using this:
{
"query" : {
"query_string": {
"query": "java AND -tags:java"
}},
"_source" : {
"exclude" : ["*.tags"]
}
}'
But is there any other more elegant way, like putting something in the mapping?
PS: My searches are always query_strings and term / terms and I'm using ES 2.3.2
You can specify fields option if you only want to match against certain fields
{
"query_string" : {
"fields" : ["body"],
"query" : "java"
}
}
EDIT 1
You could use the "include_in_all": false param inside mapping. Check the documentation. Query string query defaults to _all so you can add "include_in_all": false to all the fields in which you don't want match and after that this query would only look in body field
{
"query_string" : {
"query" : "java"
}
}
Does this help?

Elasticsearch bash script not working but if I copy and paste in the terminal it works

I have creating a bash script to test my index, however this bash script it gives different results than if I copy directly the CURL command in my terminal . I retrieve different result than if I launch my code with sh [file-name] What can be happening?
#!/bin/sh
alias curl="curl -s"
echo "Delete index"
curl -X DELETE "localhost:9200/products?pretty"
echo
echo "Create index"
curl -X POST "localhost:9200/products?pretty" -d '
{
"products" : {
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"my_synonym" : {
"ignore_case" : "true",
"expand" : "true",
"type" : "synonym",
"synonyms" : [ "pote, foundation"] }
},
"analyzer" : {
"folding_analyzer" : {
"filter" : [ "standard", "lowercase", "asciifolding", "my_synonym" ],
"tokenizer" : "standard"
}
}
},
"number_of_shards" : "1",
"number_of_replicas" : "0"
}
},
"mappings" : {
"product" : {
"dynamic" : "false",
"properties" : {
"brand_name" : {
"type" : "string",
"index_options" : "offsets",
"analyzer" : "folding_analyzer"
},
"product_name" : {
"type" : "string",
"index_options" : "offsets",
"analyzer" : "folding_analyzer"
}
}
}
}
}
}
'
echo
echo "Info index"
curl -XGET 'localhost:9200/products/_settings,_mappings?pretty'
echo
echo "Test doc:"
curl -X POST "localhost:9200/products/product/1?pretty" -d '{
"product_name": "Foundation Brush",
"brand_name": "Bobbi Brown"
}'
echo
echo "Test doc:"
curl -X POST "localhost:9200/products/product/2?pretty" -d '{
"product_name": "Foundation Primer",
"brand_name": "Laura Mercier"
}'
echo
echo "Test doc:"
curl -X POST "localhost:9200/products/product/3?pretty" -d '{
"product_name": "Lock-It Tattoo Foundation",
"brand_name": "Kat Von D"
}'
echo
echo "Test doc:"
curl -X POST "localhost:9200/products/product/4?pretty" -d '{
"product_name": "Diorskin Airflash Spray Foundation",
"brand_name": "Dior"
}'
echo
echo "Test doc:"
curl -X POST "localhost:9200/products/product/5?pretty" -d '{
"product_name": "Diorskin Airflash Spray LancĂ´me",
"brand_name": "Dior"
}'
echo
echo "Info index"
curl -XGET 'localhost:9200/products/_settings,_mappings?pretty'
echo
echo "Search all"
curl -X GET "localhost:9200/products/_search?pretty" -d '{
"query": {
"match_all": {}
}
}'
echo
Indexed documents in elasticsearch are not immediately available for searching. They only show up in search after refresh operation takes place. By default this operation occurs automatically every 1sec. So when you paste commands one by one it occurs while you are getting settings and mappings.
When you run bash script there is simply no time for refresh to take place. So, you need to add explicit refresh yourself after the last index command:
curl -XPOST 'http://localhost:9200/products/_refresh?pretty'

How to perform wildcard search on a date field?

I've a field containing values like 2011-10-20 with the mapping :
"joiningDate": { "type": "date", "format": "dateOptionalTime" }
The following query ends up in a SearchPhaseExecutionException.
"wildcard" : { "ingestionDate" : "2011*" }
Seems like ES(v1.1) doesn't provide that much of ecstasy. This post suggests the idea of scripting (unaccepted answer says even more). I'll try that, just asking if anyone has did it already ?
Expectation
A search string 13 should match all documents where the joiningDate field has values :
2011-10-13
2013-01-11
2100-13-02
I'm not sure if I understand your needs correctly, but I would suggest you to use "range query" for the date field.
The code below will return the results what you want to get.
{
"query": {
"range": {
"joiningDate": {
"gt": "2011-01-01",
"lt": "2012-01-01"
}
}
}
}'
I hope this could help you.
Edit (Searching date containing "13" itself.)
I suggest you to use "Multi field" functionality of Elasticsearch.
It means you can index "joiningDate" field by two different field type at the same time.
Please see and try the example codes below.
Create a index
curl -XPUT 'localhost:9200/blacksmith'
Define mapping in which the type of "joiningDate" field is "multi_field".
curl -XPUT 'localhost:9200/blacksmith/my_type/_mapping' -d '{
"my_type" : {
"properties" : {
"joiningDate" : {
"type": "multi_field",
"fields" : {
"joiningDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"verbatim" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}
}'
Indexing 4 documents (3 documents containing "13")
curl -s -XPOST 'localhost:9200/blacksmith/my_type/1' -d '{ "joiningDate": "2011-10-13" }'
curl -s -XPOST 'localhost:9200/blacksmith/my_type/2' -d '{ "joiningDate": "2013-01-11" }'
curl -s -XPOST 'localhost:9200/blacksmith/my_type/3' -d '{ "joiningDate": "2130-12-02" }'
curl -s -XPOST 'localhost:9200/blacksmith/my_type/4' -d '{ "joiningDate": "2014-12-02" }' # no 13
Try wildcard query to the "joiningDate.verbatim" field NOT the "joiningDate" field.
curl -XGET 'localhost:9200/blacksmith/my_type/_search?pretty' -d '{
"query": {
"wildcard": {
"joiningDate.verbatim": {
"wildcard": "*13*"
}
}
}
}'

Elasticsearch index aliases (w/ routing) and parent/child docs

I'm trying to set up an index with the following characteristics:
The index houses data for many projects. Most work is project-specific, so I set up aliases for each project, using project_id as the routing field. (And as an associated term filter.)
The data in question have a parent/child structure. For simplicity, let's call the doc types "mama" and "baby."
So we create the index and aliases:
curl -XDELETE http://localhost:9200/famtest
curl -XPOST http://localhost:9200/famtest -d '
{ "mappings" :
{ "mama" :
{ "properties" :
{ "project_id" : { "type" : "string", "index" : "not_analyzed" } }
},
"baby" :
{ "_parent" :
{ "type" : "mama" },
"properties" :
{ "project_id" : { "type" : "string", "index" : "not_analyzed" } }
}
}
}'
curl -XPOST "http://localhost:9200/_aliases" -d '
{ "actions":
[ { "add":
{ "alias": "family1",
"index": "famtest",
"routing": "100",
"filter":
{ "term": { "project_id": "100" } }
}
} ]
}'
curl -XPOST "http://localhost:9200/_aliases" -d '
{ "actions":
[ { "add":
{ "alias": "family2",
"index": "famtest",
"routing": "200",
"filter":
{ "term": { "project_id": "200" } }
}
} ]
}'
Now let's make some mamas:
curl -XPOST localhost:9200/family1/mama/1 -d '{ "name" : "Family 1 Mom", "project_id" : "100" }'
curl -XPOST localhost:9200/family2/mama/2 -d '{ "name" : "Family 2 Mom", "project_id" : "200" }'
These documents are now available via /familyX/_search. So now we want to add a baby:
curl -XPOST localhost:9200/family1/baby/1?parent=1 -d '{ "name": "Fam 1 Baby","project_id" : "100" }'
Unfortunately, ES doesn't like that:
{"error":"ElasticSearchIllegalArgumentException[Alias [family1] has index routing associated with it [100], and was provided with routing value [1], rejecting operation]","status":400}
So... any idea how to use alias routing and still set the parent id? If I understand this right, it shouldn't be a problem: all project operations ("family1", in this case) go through the alias, so parent and child docs will wind up on the same shard anyway. Is there some alternative way to set the parent id, without interfering with the routing?
Thanks. Please let me know if I can be more specific.
Interesting question! As you already know the parent id is used for routing too since children must be indexed in the same shard as the parent documents. What you're trying to do is fine, since parent and children would fall into the same family, thus in the same shard anyway since you configured the routing in the family alias.
But I'm afraid the parent id has higher priority than the routing defined in the alias, which gets overwritten, but that's not possible and that's why you get the error. In fact, if you try again providing the routing in your index request it works:
curl -XPOST 'localhost:9200/family1/baby/1?parent=1&routing=100' -d '{ "name": "Fam 1 Baby","project_id" : "100" }'
I would fill in a github issue with a curl recreation.

Elasticsearch ignore words breakers

i'm new to Elasticsearch and i've got a problem regarding querying.
I indexed strings like that:
my-super-string
my-other-string
my-little-string
This strings are slugs.
So, they are no spaces, only alphanumeric characters. Mapping for the related field is only "type=string".
I'm using a query like this:
{ "query":{ "query_string":{ "query": "*"+<MY_QUERY>+"*", "rewrite": "top_terms_10" } }}
Where "MY_QUERY" is also a slug. Something like "my-super" for example.
When searching for "my" i get results.
When searching for "my-super" i get no results and i'd like to have "my-super-string".
Can someone help me on this? Thanks!
I would suggest using match_phrase instead of using query string with leading and trailing wildcards. Even standard analyzer should be able to split slug into tokens correctly, so there is not need for wildcards.
curl -XPUT "localhost:9200/slugs/doc/1" -d '{"slug": "my-super-string"}'
echo
curl -XPUT "localhost:9200/slugs/doc/2" -d '{"slug": "my-other-string"}'
echo
curl -XPUT "localhost:9200/slugs/doc/3" -d '{"slug": "my-little-string"}'
echo
curl -XPOST "localhost:9200/slugs/_refresh"
echo
echo "Searching for my"
curl "localhost:9200/slugs/doc/_search?pretty=true&fields=slug" -d '{"query" : { "match_phrase": {"slug": "my"} } }'
echo
echo "Searching for my-super"
curl "localhost:9200/slugs/doc/_search?pretty=true&fields=slug" -d '{"query" : { "match_phrase": {"slug": "my-super"} } }'
echo
echo "Searching for my-other"
curl "localhost:9200/slugs/doc/_search?pretty=true&fields=slug" -d '{"query" : { "match_phrase": {"slug": "my-other"} } }'
echo
echo "Searching for string"
curl "localhost:9200/slugs/doc/_search?pretty=true&fields=slug" -d '{"query" : { "match_phrase": {"slug": "string"} } }'
Alternatively, you can create your own analyzer that will split slugs into tokens only on "-"
curl -XDELETE localhost:9200/slugs
curl -XPUT localhost:9200/slugs -d '{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer" : {
"slug_analyzer" : {
"tokenizer": "slug_tokenizer",
"filter" : ["lowercase"]
}
},
"tokenizer" :{
"slug_tokenizer" : {
"type": "pattern",
"pattern": "-"
}
}
}
}
},
"mappings" :{
"doc" : {
"properties" : {
"slug" : {"type": "string", "analyzer" : "slug_analyzer"}
}
}
}
}'

Resources