ElasticSearch on concatination of multiple fields - elasticsearch

I have data where phone number is in parts. Therefore I created it as an array(object). But I want to search on the complete phone number now.
"Phone":{
"type" : "object",
"properties" : {
"first" : {
"type” : "text"
},
"second": {
"type" : "text"
}
}
}
Now if I have three records, [{"first" : "123", "second" = "456"}, {"first" : "456", "second" = "123"}, {"first" : "412", "second" = "356"}]. It should search on records like "123456", "456123", "412356". And should give 3 records for query "123".

Take a look at copy_to fields or create an ingest pipeline that creates a single field of those different numbers and also enriches the JSON.

Related

How to implement fuzzy field-centric (cross_fields) query on fields with multiple analysers?

Mapping:
{
"articles" : {
"mappings" : {
"data" : {
"properties" : {
"author" : {
"type" : "text",
"analyzer" : "standard"
},
"content" : {
"type" : "text",
"analyzer" : "english"
},
"tags" : {
"type" : "keyword"
},
"title" : {
"type" : "text",
"analyzer" : "english"
}
}
}
}
}
}
Example data:
{
"author": "John Smith",
"title": "Hello world",
"content": "This is some example article",
"tags": ["programming", "life"]
}
So as you see I have mapping with different analysers on different fields. Now I want to search across those fields in a following way:
only documents matching all search keywords are returned (like multi_match with cross_fields as a type and and as operator)
query should be fuzzy so it can tolerate some typos
different fields should have different boost values (e.g. title more important than content)
For example following query should match above document:
programing worlds john examlpe
How can I do it? According to documentation fuzziness won't work with cross_fields nor fields with different analysers.
One way of doing it would be implementing custom _all fields and coping all values there using copy_to but with this approach I can't assign different weights nor use different analysers.

How to find similar documents in Elasticsearch

My documents are made up of using various fields. Now given an input document, I want to find the similar documents using the input document fields. How can I achieve it?
{
"query": {
"more_like_this" : {
"ids" : ["12345"],
"fields" : ["field_1", "field_2"],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
you will get similar documents to id 12345. Here you need to specify only ids and field like title, category, name, etc. not their values.
Here is another code to do without ids, but you need to specify fields with values. Example: Get similar documents which have similar title to:
elasticsearch is fast
{
"query": {
"more_like_this" : {
"fields" : ["title"],
"like" : "elasticsearch is fast",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
You can add more fields and their values
You haven't mentioned the types of your fields. A general approach is to use a catch all field (using copy_to) with the more like this query.
{
"query": {
"more_like_this" : {
"fields" : ["first name", "last name", "address", "etc"],
"like" : "your_query",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
Put everything in your_query . You can increase or decrease min_term_freq and max_query_terms

Elasticsearch appends random strings to source data inside indexes

I am new to Elasticsearch and have a peculiar problem: I am using elasticsearch with kibana to store and visualize most of the events in my application. For example to store a user login with user_id of 123, I would write to an index user/login/123 with the following array as data:
{
"details" : {
"fname" : "John",
"lname" : "Smith",
"click" : "login-button",
etc...
},
"ip_address" : 127.0.0.1,
"browser_type" : "Chrome",
"browser_version" : "17"
}
However the problem I encountered is that some records show up with a random string after the "details" array: see screenshot. Can anyone suggest what am I doing wrong and how can I fix existing indexes?
Screenshot
I think you should have something like this in your data:
{
"details" : {
"28d211adbf" : {
"stats" : {
"merge_field_count": 6,
"unsubscribe_count_since_send": 3
}
},
"555cd3bcba" : {
"stats" : {
"merge_field_count": 6,
"unsubscribe_count_since_send": 3
}
}
},
"ip_address" : 127.0.0.1,
"browser_type" : "Chrome",
"browser_version" : "17"
}
It's actually not a good practice in indexing document in elasticsearch.
Read about mapping explosion for more info about this:
https://www.elastic.co/blog/found-crash-elasticsearch#mapping-explosion

Elasticsearch: Comparing two fields of the same document, where one of the fields is inside a nested document

Consider the following document:
{
"group" : "fans",
"preferredFanId" : 1,
"user" : [
{
"fanId" : 1,
"first" : "John",
"last" : "Smith"
},
{
"fanId" : 2,
"first" : "Alice",
"last" : "White"
},
]
}
where "user" is a nested document. I want to get inner_hits (from 2.0.0-SNAPSHOT) where preferredFanId == user.fanId , and so I want only the John Smith record returned in the inner_hits.
Is it possible? I've tried several approaches like using "include_in_parent" or "_source", but nothing seems to work.

How to index dump of html files to elasticsearch?

I am totaly new in elastic so my knowledge is only from elasticsearch site and I need to help.
My task is to index large row data in html format into elastic search. I already crawled my data and stored it onto disk (200 000 html files). My question is what is the simplest way to index all html files into elasticsearch? Should I do it manualy by for each document to make put request to elastic? For example like:
curl -XPUT 'http://localhost:9200/registers/tomas/1' -d '{
"user" : "tomasko",
"post_date" : "2009-11-15T14:12:12",
"field 1" : "field data"
"field 2" : "field 2 data"
}'
And second question is if I have to parse HTML document to retrieve data for JSON field 1 like in example code over?
And finaly after indexing may I delete all HTML documents? Thanks for all.
I'd look at the bulk api that allows you to send more than document in a single request, in order to speed up your indexing process. You can send batch of 10, 20 or more documents, depending on how big they are.
Depending on what you want to index you might need to parse the html, unless you want to index the whole html as a single field (you might want to use the html strip char filter in that case to strip out the html tags from the indexed text).
After indexing I'd suggest to make sure the mapping is correct and you can find what you're looking for. You can always reindex using the _source special field that elasticsearch stores under the hood, but if you already wrote your indexer code you might want to use it again to reindex when needed (of course with the same html documents). In practice, you never index your data once... so be careful :) even though elasticsearch always helps you out with the _source field), it's just a matter of querying the existing index and reindex all its documents on another index.
#javanna's suggestion to look at the Bulk API will definitely lead you in the right direction. If you are using NEST, you can store all your objects in a list which you can then serialize JSON objects for indexing the content.
Specifically, if you want to strip the html tags out prior to indexing and storing the content as is, you can use the mapper attachment plugin - in which when you define the mapping, you can categorize the content_type to be "html."
The mapper attachment is useful for many things especially if you are handling multiple document types, but most notably - I believe just using this for the purpose of stripping out the html tags is sufficient enough (which you cannot do with the html_strip char filter).
Just a forewarning though - NONE of the html tags will be stored. So if you do need those tags somehow, I would suggest defining another field to store the original content. Another note: You cannot specify multifields for mapper attachment documents, so you would need to store that outside of the mapper attachment document. See my working example below.
You'll need to result in this mapping:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
Such that in NEST, I will assemble the content as such:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
I have tested this in Sense - results are as follows:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n "
]
}

Resources