Exact phrase match in ElasticSearch - elasticsearch

I'm trying to achieve exact search by phrase in Elastic, using my existing index (full-text). When user is searching, say, "Sanity Testing", the result should bring all the docs with "Sanity Testing" (case-insensitive), but not "Sanity tested".
My mapping:
{
"doc": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"term_vector":"with_positions_offsets",
"analyzer":"o3analyzer",
"store": true
},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"keywords" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"},
"language" : {"store" : "yes"}
}
}
}
}
}
As I understand, there's a way to add another index with "raw" analyzer, but I'm not sure this will work due to the need to search as case-insensitive. And also I don't want to rebuild indexes, as there are hundreds machines with tons of documents already indexed, so it may take ages.
Is there a way to run such a query? I'm now trying to search using the following query:
{
query: {
match_phrase: {
file: "Sanity Testing"
}
}
and it brings me both "Sanity Testing" and "Sanity Tested".
Any help appreciated!

Related

Elasticsearch : how to search multiple words in a copy_to field?

I am currently learning Elasticsearch and stuck on the issue described below:
On an existing index (I don't know if it matter) I added this new mapping:
PUT user-index
{
"mappings": {
"properties": {
"common_criteria": { -- new property which aggregates other properties by copy_to
"type": "text"
},
"name": { -- already existed before this mapping
"type": "text",
"copy_to": "common_criteria"
},
"username": { -- already existed before this mapping
"type": "text",
"copy_to": "common_criteria"
},
"phone": { -- already existed before this mapping
"type": "text",
"copy_to": "common_criteria"
},
"country": { -- already existed before this mapping
"type": "text",
"copy_to": "common_criteria"
}
}
}
}
The goal is to search ONE or MORE values only on common_criteria.
Say that we have:
{
"common_criteria": ["John Smith","johny","USA"]
}
What I would like to achieve is an exact match searching on multiple values of common_criteria:
We should have a result if we search with John Smith or with USA + John Smith or with johny + USA or with USA or with johny and finally with John Smith + USA + johny (the words order does not matter)
If we search with multiple words like John Smith + Germany or johny + England we should not have a result
I am using Spring Data Elastic to build my query:
NativeSearchQueryBuilder nativeSearchQuery = new NativeSearchQueryBuilder();
BoolQueryBuilder booleanQuery = QueryBuilders.boolQuery();
String valueToSearch = "johny"
nativeSearchQuery.withQuery(booleanQuery.must(QueryBuilders.matchQuery("common_criteria", valueToSearch)
.fuzziness(Fuzziness.AUTO)
.operator(Operator.AND)));
Logging the request sent to Elastic I have:
{
"bool" : {
"must" :
{
"match" : {
"common_criteria" : {
"query" : "johny",
"operator" : "AND",
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"boost" : 1.0
}
}
},
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
With that request I have 0 result. I know that request is not correct because of must.match condition and maybe the field common_criteria is also not well defined.
Thanks in advance for your help and explanations.
EDIT: After trying multi_match query.
Following #rabbitbr's suggestion I tried the multi_match query but does not seem to work. This is the example of a request sent to Elastic (with 0 result):
{
"bool" : {
"must" : {
"multi_match" : {
"query" : "John Smith USA",
"fields" : [
"name^1.0",
"username^1.0",
"phone^1.0",
"country^1.0",
],
"type" : "best_fields",
"operator" : "AND",
"slop" : 0,
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
},
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
That request does not return a result.
I would try to use Multi-match query before creating a field to store all the others in one place.
The multi_match query builds on the match query to allow multi-field
queries.

Add new suggestion context - ElasticSearch Suggestions

I have this search context in my index mapping
Index: region
"place_suggest": {
"type" : "completion",
"analyzer" : "simple",
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50,
"contexts" : [
{
"name" : "place_type",
"type" : "CATEGORY",
"path" : "place_type"
}
]
}
And I want to add a new context to this mapping
{
"name": "restricted",
"type": "CATEGORY",
"path": "restricted"
}
I've tried using Update Mapping API to add this new context like this:
PUT region_test/_mapping/
{
"properties" : {
"place_suggest" : {
"contexts": [
"name": "restricted",
"type": "CATEGORY",
"path": "restricted"
]
}
}
}
I'm using Kibana dev tools for running this query.
You will not be able to edit your field by adding the new context.
You need to create a new mapping and re-index your index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html#change-existing-mapping-parms

How to implement fuzzy field-centric (cross_fields) query on fields with multiple analysers?

Mapping:
{
"articles" : {
"mappings" : {
"data" : {
"properties" : {
"author" : {
"type" : "text",
"analyzer" : "standard"
},
"content" : {
"type" : "text",
"analyzer" : "english"
},
"tags" : {
"type" : "keyword"
},
"title" : {
"type" : "text",
"analyzer" : "english"
}
}
}
}
}
}
Example data:
{
"author": "John Smith",
"title": "Hello world",
"content": "This is some example article",
"tags": ["programming", "life"]
}
So as you see I have mapping with different analysers on different fields. Now I want to search across those fields in a following way:
only documents matching all search keywords are returned (like multi_match with cross_fields as a type and and as operator)
query should be fuzzy so it can tolerate some typos
different fields should have different boost values (e.g. title more important than content)
For example following query should match above document:
programing worlds john examlpe
How can I do it? According to documentation fuzziness won't work with cross_fields nor fields with different analysers.
One way of doing it would be implementing custom _all fields and coping all values there using copy_to but with this approach I can't assign different weights nor use different analysers.

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

Elasticsearch synonym analyzer not working

EDIT: To add on to this, the synonyms seem to be working with basic querystring queries.
"query_string" : {
"default_field" : "location.region.name.raw",
"query" : "nh"
}
This returns all of the results for New Hampshire, but a "match" query for "nh" returns no results.
I'm trying to add synonyms to my location fields in my Elastic index, so that if I do a location search for "Mass," "Ma," or "Massachusetts" I'll get the same results each time. I added the synonyms filter to my settings and changed the mapping for locations. Here are my settings:
analysis":{
"analyzer":{
"synonyms":{
"filter":[
"lowercase",
"synonym_filter"
],
"tokenizer": "standard"
}
},
"filter":{
"synonym_filter":{
"type": "synonym",
"synonyms":[
"United States,US,USA,USA=>usa",
"Alabama,Al,Ala,Ala",
"Alaska,Ak,Alas,Alas",
"Arizona,Az,Ariz",
"Arkansas,Ar,Ark",
"California,Ca,Calif,Cal",
"Colorado,Co,Colo,Col",
"Connecticut,Ct,Conn",
"Deleware,De,Del",
"District of Columbia,Dc,Wash Dc,Washington Dc=>Dc",
"Florida,Fl,Fla,Flor",
"Georgia,Ga",
"Hawaii,Hi",
"Idaho,Id,Ida",
"Illinois,Il,Ill,Ills",
"Indiana,In,Ind",
"Iowa,Ia,Ioa",
"Kansas,Kans,Kan,Ks",
"Kentucky,Ky,Ken,Kent",
"Louisiana,La",
"Maine,Me",
"Maryland,Md",
"Massachusetts,Ma,Mass",
"Michigan,Mi,Mich",
"Minnesota,Mn,Minn",
"Mississippi,Ms,Miss",
"Missouri,Mo",
"Montana,Mt,Mont",
"Nebraska,Ne,Neb,Nebr",
"Nevada,Nv,Nev",
"New Hampshire,Nh=>Nh",
"New Jersey,Nj=>Nj",
"New Mexico,Nm,N Mex,New M=>Nm",
"New York,Ny=>Ny",
"North Carolina,Nc,N Car=>Nc",
"North Dakota,Nd,N Dak, NoDak=>Nd",
"Ohio,Oh,O",
"Oklahoma,Ok,Okla",
"Oregon,Or,Oreg,Ore",
"Pennsylvania,Pa,Penn,Penna",
"Rhode Island,Ri,Ri & PP,R Isl=>Ri",
"South Carolina,Sc,S Car=>Sc",
"South Dakota,Sd,S Dak,SoDak=>Sd",
"Tennessee,Te,Tenn",
"Texas,Tx,Tex",
"Utah,Ut",
"Vermont,Vt",
"Virginia,Va,Virg",
"Washington,Wa,Wash,Wn",
"West Virginia,Wv,W Va, W Virg=>Wv",
"Wisconsin,Wi,Wis,Wisc",
"Wyomin,Wi,Wyo"
]
}
}
And the mapping for the location.region field:
"region":{
"properties":{
"id":{"type": "long"},
"name":{
"type": "string",
"analyzer": "synonyms",
"fields":{"raw":{"type": "string", "index": "not_analyzed" }}
}
}
}
But the synonyms analyzer doesn't seem to be doing anything. This query for example:
"match" : {
"location.region.name" : {
"query" : "Massachusetts",
"type" : "phrase",
"analyzer" : "synonyms"
}
}
This returns hundreds of results, but if I replace "Massachusetts" with "Ma" or "Mass" I get 0 results. Why isn't it working?
The order of the filters is
filter":[
"lowercase",
"synonym_filter"
]
So, if elasticsearch is "lowercasing" first the tokens, when it executes the second step, synonym_filter, it won't match any of the entries you have defined.
To solve the problem, I would define the synonyms in lower case
You can also define your synonyms filter as case insensitive:
"filter":{
"synonym_filter":{
"type": "synonym",
"ignore_case" : "true",
"synonyms":[
...
]
}
}

Resources