How to implement fuzzy field-centric (cross_fields) query on fields with multiple analysers? - elasticsearch

Mapping:
{
"articles" : {
"mappings" : {
"data" : {
"properties" : {
"author" : {
"type" : "text",
"analyzer" : "standard"
},
"content" : {
"type" : "text",
"analyzer" : "english"
},
"tags" : {
"type" : "keyword"
},
"title" : {
"type" : "text",
"analyzer" : "english"
}
}
}
}
}
}
Example data:
{
"author": "John Smith",
"title": "Hello world",
"content": "This is some example article",
"tags": ["programming", "life"]
}
So as you see I have mapping with different analysers on different fields. Now I want to search across those fields in a following way:
only documents matching all search keywords are returned (like multi_match with cross_fields as a type and and as operator)
query should be fuzzy so it can tolerate some typos
different fields should have different boost values (e.g. title more important than content)
For example following query should match above document:
programing worlds john examlpe
How can I do it? According to documentation fuzziness won't work with cross_fields nor fields with different analysers.
One way of doing it would be implementing custom _all fields and coping all values there using copy_to but with this approach I can't assign different weights nor use different analysers.

Related

How do I get the occurences and doc_count of every term of an ES Index?

I have an ES Index of the form
{
"adminfile" : {
"mappings" : {
"properties" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
With the field 'title' being the title of the string found in the field 'text'. The titles do not contain any spaces, while the texts are normal texts (sentences with spaces and dots etc).
I want to get all the terms in the index and their doc_count and/or frequency. I found this query in the ES doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
GET /adminfile/_search
{
"size": 10,
"aggs" : {
"text" : {
"terms" : {
"field" : "text.keyword",
"order" : { "_count" : "asc" },
"size": 10
}
}
}
}
This returns all the sources but the aggregation buckets are empty. If I change "text.keyword" to "title.keyword" in that command, it does work and return all the titles as keys.
Why does it not work on the text fields?
Is there a better command to use? I know that this:
GET /adminfile/_search
{
"query" : {
"match" : {"text" : "WordToSearch"}
},
"_source":false,
"aggregations": {
"keywords" : {
"significant_text" : {
"field" : "text",
"filter_duplicate_text": true,
"size": 100
}
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
works to get all occurences of wordToSearch in every document of the index, with the counts and frequency. Is there a way to ask this command to match every word of every doc?
EDIT: I have also tried changing the name of the text field to "contenu" in case ES didn't like have a field of name 'text' and of type 'text'. No effect.
Another option could be using https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but it _termvectors only works for one specific ID (or _mtermvectors for mutliple specific ID, not all the documents in any case)
EDIT2: I realised that the ignore_above could be a problem. I tried cutting all my texts to 200 chars as a test. The query now runs, except that it returns the entire text as a key instead of cutting it into words.
When you use the keyword version of the field, the content is kept as a single, large token. You're right in assuming that ignore_above is the source of your problem, since these tokens apparently will be longer than 256 characters in your data set.
If you instead aggregate across the tokenized field (the normal text field), instead of the keyword version, you'll get counts for each word (i.e. each token) as processed by the field.

What does "mappings" do in Elasticsearch?

I just started learning Elasticsearch. I am trying out to create index, adding data, deleting data, and search data.
I can also understand the settings of Elasticsearch.
When using "PUT" to use settings
{
"settings": {
"index.number_of_shards" : 1,
"index.number_of_replicas" : 0
}
}
When using "GET" to retrieve settings information
{
"dsm" : {
"settings" : {
"index" : {
"creation_date" : "1555487684262",
"number_of_shards" : "1",
"number_of_replicas" : "0",
"uuid" : "qsSr69OdTuugP2DUwrMh4g",
"version" : {
"created" : "7000099"
},
"provided_name" : "dsm"
}
}
}
}
However,
What does "mappings" do in Elasticsearch?
{
"kibana_sample_data_flights" : {
"aliases" : { },
"mappings" : {
"properties" : {
"AvgTicketPrice" : {
"type" : "float"
},
"Cancelled" : {
"type" : "boolean"
},
"Carrier" : {
"type" : "keyword"
},
"Dest" : {
"type" : "keyword"
},
"DestAirportID" : {
"type" : "keyword"
},
"DestCityName" : {
}, // just part of data
The mapping document is a way of describing the structure of your data and defining the types eg boolean, text, keyword. These types are important as they determine how your fields are indexed and analysed.
Elasticsearch supports dynamic mapping, so effectively performs an automatic best guess of the appropriate types but you may wish to override these.
I found this to be a useful article to explain the mapping process:
https://www.elastic.co/blog/found-elasticsearch-mapping-introduction
Indexing is determined by the field type for example where the type is 'keyword' the search engine will be expecting an exact match, when the type is 'text' the search engine will be trying to determine how well the document matches the query term and in so doing so will be performing a 'full text search'.
So for example:
- A search for jump should also match jumped, jumps, jumping, and perhaps even leap.
This is a great article describing exact vs full text search and is where I took the jump example: https://www.elastic.co/guide/en/elasticsearch/guide/current/_exact_values_versus_full_text.html
Much of the power of elasticsearch is in the mapping and analysis.
Its the mapping of the index. This means it describes the data that is stored in this index. Take a deeper look here.

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

How to index dump of html files to elasticsearch?

I am totaly new in elastic so my knowledge is only from elasticsearch site and I need to help.
My task is to index large row data in html format into elastic search. I already crawled my data and stored it onto disk (200 000 html files). My question is what is the simplest way to index all html files into elasticsearch? Should I do it manualy by for each document to make put request to elastic? For example like:
curl -XPUT 'http://localhost:9200/registers/tomas/1' -d '{
"user" : "tomasko",
"post_date" : "2009-11-15T14:12:12",
"field 1" : "field data"
"field 2" : "field 2 data"
}'
And second question is if I have to parse HTML document to retrieve data for JSON field 1 like in example code over?
And finaly after indexing may I delete all HTML documents? Thanks for all.
I'd look at the bulk api that allows you to send more than document in a single request, in order to speed up your indexing process. You can send batch of 10, 20 or more documents, depending on how big they are.
Depending on what you want to index you might need to parse the html, unless you want to index the whole html as a single field (you might want to use the html strip char filter in that case to strip out the html tags from the indexed text).
After indexing I'd suggest to make sure the mapping is correct and you can find what you're looking for. You can always reindex using the _source special field that elasticsearch stores under the hood, but if you already wrote your indexer code you might want to use it again to reindex when needed (of course with the same html documents). In practice, you never index your data once... so be careful :) even though elasticsearch always helps you out with the _source field), it's just a matter of querying the existing index and reindex all its documents on another index.
#javanna's suggestion to look at the Bulk API will definitely lead you in the right direction. If you are using NEST, you can store all your objects in a list which you can then serialize JSON objects for indexing the content.
Specifically, if you want to strip the html tags out prior to indexing and storing the content as is, you can use the mapper attachment plugin - in which when you define the mapping, you can categorize the content_type to be "html."
The mapper attachment is useful for many things especially if you are handling multiple document types, but most notably - I believe just using this for the purpose of stripping out the html tags is sufficient enough (which you cannot do with the html_strip char filter).
Just a forewarning though - NONE of the html tags will be stored. So if you do need those tags somehow, I would suggest defining another field to store the original content. Another note: You cannot specify multifields for mapper attachment documents, so you would need to store that outside of the mapper attachment document. See my working example below.
You'll need to result in this mapping:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
Such that in NEST, I will assemble the content as such:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
I have tested this in Sense - results are as follows:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n "
]
}

How should I query Elastic Search given my mapping and using keywords?

I have a very simple mapping which looks like this (I streamlined the example a bit):
{
"location" : {
"properties": {
"name": { "type": "string", "boost": 2.0, "analyzer": "snowball" },
"description": { "type": "string", "analyzer": "snowball" }
}
}
}
Now I index a lot of locations using some random values which are based on real English words.
I'd like to be able to search for locations that match any of the given keywords in either the name or the description field (name is more important, hence the boost I gave it). I tried a few different queries and they don't return any results.
{
"fields" : ["name", "description"],
"query" : {
"terms" : {
"name" : ["savage"],
"description" : ["savage"]
},
"from" : 0,
"size" : 500
}
}
Considering there are locations which have the word savaged in the description it should get me some results (savage is the stem of savaged). It yields 0 results using the above query. I've been using curl to query ES:
curl -XGET -d #query.json http://localhost:9200/myindex/locations/_search
If I use query string instead:
curl -XGET http://localhost:9200/fieldtripfinder/locations/_search?q=description:savage
I actually get one result (of course now it would be searching the description field only).
Basically I am looking for a query that will do a OR kind of search using multiple keywords and compare them to the values in both the name and the description field.
Snowball stems "savage" into "savag" that’s why term "savage" didn't return any results. However, when you specify "savage" on URL, it’s getting analyzed and you get results. Depending on what your intention is, you can either use correct stem ("savag") or analyze your terms by using "match" query instead of "terms":
{
"fields" : ["name", "description"],
"query" : {
"bool" : {
"should" : [
{"match" : {"name" : "savage"}},
{"match" : {"description" : "savage"}}
]
},
"from" : 0,
"size" : 500
}
}

Resources