Get most searched terms from indexes that are referenced by an alias - elasticsearch

The search is pointing to an alias.
The indexing process create a new index every 5 minutes.
Then the alias is updated, pointing to the new index.
The index is recreated to avoid sync problems that can occur if we update item by item when a change is made.
However, I need to keep track of the searched terms to produce a dashboard to list the most searched terms in a period. Or even using Kibana to show/extract it.
*The searched terms can be multi words, such as "white", "white summer night", etc. We are looking to rank the term, not the individual words.
I don't have experience with Elasticsearch and the searches that I have tried did not bring relevant solutions.
Thanks for the help!
{
"actions" : [
{ "remove" : { "index" : "catalog*", "alias" : "catalog-index" } },
{ "add" : { "index" : "catalog1234566", "alias" : "catalog-index" } }
]
}
Mappings:
{
"mappings":{
"properties":{
"created_at":{
"type":"integer"
},
"search_terms_key":{
"type":"keyword"
}
}
}
}
Query:
{
"query":{
"match_all":{
}
},
"aggs":{
"search_terms_key":{
"terms":{
"field":"search_terms_key",
"value_type":"string"
}
}
}
}

Log the search terms (or entire queries, if necessary), ingest those into Elasticsearch, then analyze them with Kibana. The index alias configuration is not relevant.
You should get the logs either directly from whatever connects to Elasticsearch, or from a proxy between it and Elasticsearch.
You could get Elasticsearch itself to log queries, but that's usually a bad idea in terms of performance.
Since it's the entire term you're after, be sure to use keyword mapping on the search term.
Once you have search terms ingested, use a terms aggregation to show the most popular searches.
[edit: make explicit that search terms need to be logged, not full DSL queries]

Related

Is it possible to check that specific data matches the query without loading it to the index?

Imagine that I have a specific data string and a specific query. The simple way to check that the query matches the data is to load the data into the Elastic index and run the online query. But can I do it without putting it into the index?
Maybe there are some open-source libraries that implement the Elastic search functionality offline, so I can call something like getScore(data, query)? Or it's possible to implement by using specific API endpoints?
Thanks in advance!
What you can do is to leverage the percolator type.
What this allows you to do is to store the query instead of the document and then test whether a document would match the stored query.
For instance, you first create an index with a field of type percolator that will contain your query (you also need to add in the mapping any field used by the query so ES knows what their types are):
PUT my_index
{
"mappings": {
"properties": {
"query": {
"type": "percolator"
},
"message": {
"type": "text"
}
}
}
}
Then you can index a real query, like this:
PUT my_index/_doc/match_value
{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}
Finally, you can check using the percolate query if the query you've just stored would match
GET /my_index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document" : {
"message" : "A new bonsai tree in the office"
}
}
}
}
So all you need to do is to only store the query (not the documents), and then you can use the percolate query to check if the documents would have been selected by the query you stored, without having to store the documents themselves.

Understanding boosting in ElasticSearch

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.
In order to do that I plan on doing the following:
1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")
2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index
curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
"fields" : ["content"],
"offsets" : true,
"term_statistics" : true
}'
3) Use the Google Books unigram model to "rescale" the ttf for every term.
The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?
For instance, let's consider this example
"query":
{
"bool":{
"should":[
{
"match":{
"title":{
"query":"cat",
"boost":2
}
}
},
{
"match":{
"content":{
"query":"cat",
"boost":2
}
}
}
]
}
}
Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?
Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?
I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.
The boost is used when query with multi query clauses, example:
{
"bool":{
"should":[
{
"match":{
"clause1":{
"query":"query1",
"boost":3
}
}
},
{
"match":{
"clause2":{
"query":"query2",
"boost":2
}
}
},
{
"match":{
"clause3":{
"query":"query1",
"boost":1
}
}
}
]
}
}
In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.
also if you just query with one query clause with boost, it's not useful.
An usage scenario for using boost:
A set of page document set with title and content field.
You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

Elasticsearch to wildcard search email addresses

I'm trying to use elasticsearch for a project I'm working on. I was wondering if someone could help steer me in the right direction. I'm using an index with 100+ million records.
I need to be able to search with a wildcard query like the following:
b*g#gmail.com
b*g#*.com
*gus#gmail.com
br*gu*#gmail.com
*g*#*
When I try using Wildcard and other searches, I don't get completely expected results.
What type of search with elasticsearch should I look into implementing? Is ElasticSearch even the right tool to be using? The source I'm pulling this out of is Mysql, so if not I may consider using Sphinx or Solr.
I assume that you have tried out the wildcard query as described here.
However, it has very different behaviour if your email is analyzed versus not analyzed. I would suggest you delete your index and change your mapping. e.g.
PUT /emails
{
"mappings": {
"email": {
"properties": {
"email": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Once you have this, you can just do the normal wildcard query or query_string. e.g.
GET emails/_search
{
"query": {
"wildcard": {
"email": {
"value": "s*com"
}
}
}
}
As an aside, when you just index email without setting it as not_analyzed, the default mapping actually splits up the email prefix from the domain and so that's why you don't get results for when you do s*#gmail.com. You would still get results for s* or *gmail.com but for your case, using not_analyzed works correctly. If you want to support case insensitivity, then you might want to look at a custom analyzer that uses the uax_url_email tokenizer as described here.

Is it possible to returned the analyzed fields in an ElasticSearch >2.0 search?

This question feels very similar to an old question posted here: Retrieve analyzed tokens from ElasticSearch documents, but to see if there are any changes I thought it would make sense to post it again for the latest version of ElasticSearch.
We are trying to search bodies of text in ElasticSearch with the search-query and field-mapping using the snowball stemmer built into ElasticSearch. The performance and results are great, but because we need to have the stemmed text-body for post-analysis we would like to have the search result return the actual stemmed tokens for the text-field per document in the search results.
The mapping for the field currently looks like:
"TitleEnglish": {
"type": "string",
"analyzer": "standard",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"stemming": {
"type": "string",
"analyzer": "snowball"
}
}
}
and the search query is performed specifically on TitleEnglish.stemming. Ideally I would like it to return that field, but returning that does not return the analyzed field but the original field.
Does anybody know of any way to do this? We have looked at Term Vectors, but they only seem to be returnable for individual documents or a body of documents, not for a search result?
Or perhaps other solutions like Solr or Sphinx do offer this option?
To add some extra information. If we run the following query:
GET /_analyze?analyzer=snowball&text=Eight issue of Industrial Lorestan eliminate barriers to facilitate the Committees review of
It returns the stemmed words: eight, issu, industri, etc. This is exactly the result we would like back for each matching document for all of the words in the text (so not just the matches).
Unless I'm missing something evident, why not simply returning a terms aggregation on the TitleEnglish.stemming field?
{
"query": {...},
"aggs" : {
"stems" : {
"terms" : {
"field" : "TitleEnglish.stemming",
"size": 50
}
}
}
}
Adding that aggregation to your query, you'd get a breakdown of all the stemmed terms in the TitleEnglish.stemming sub-field from the documents that matched your query.

Elasticsearch doesn't return results

I am facing a strange issue in elasticsearch query. I don't know much about elasticsearch. My query is:
{
"query":
{
"bool":
{
"must":
[
{
"text":
{
"countryCode2":"DE"
}
}
],
"must_not":[],
"should":[]
}
},"from":0,"size":1,"sort":[],"facets":{}
}
The issues is for "DE". It is giving me results but for "BE" or "IN" it returns empty result.
You are indexing using the default mapping, which by default removes english stopwords. The country codes "IN", "BE", and many more are stopwords which don't even get indexed, therefore it's not possible to have matching documents, nor get back those country codes when faceting on that field.
The solution is to reindex after having submitted your own mapping for the country code field:
{
"your_type_name" : {
"country" : {
"type" : "string", "index" : "not_analyzed"
}
}
}
If you already tried to do this but nothing changed, the mapping didn't get submitted properly. I would suggest to double check that its json structure is correct and that you can actually get it back using the get mapping api.
As this is a common problem the defaults are probably going to change in the future to be less intrusive and avoid applying any language dependent text analysis.

Resources