display limited words from ES field - elasticsearch

I am using Mpdreamz/NEST for integrating Elasticsearch in C# . Is there any way to limit the number of words in a result string of query??
For example I have a field named 'Content' in ES and I need to dispaly 30 words of 'Content' matching 'sensex' from my index.
Thanks in advance for any help

You can't so easily even within Elasticsearch itself.
You have three options
Force excerpts by using highlighting
Try to use script_fields to return the first 30 words
At index time add another field that has just the first 30 words
Eventhough the first two are possible to do with NEST i would go for the third option since it won't incur a performance penalty at query time.

Related

Keyword search using Elastic search

I am new to Elasticsearch and I am trying to achieve a Text Search functionality using Elasticsearch. I have over 100 documents and every document has lines starting with timestamp notations.
Eg.
00:00:00 - 00:01:00 This is the first line
00:01:01 - 00:02:30 This is the second line
00:02:30 - 00:03:45 This is the third line
00:03:46 - 00:05:00 This is the fourth line
00:05:01 - 00:06:00 This is fifth line
...
And so on.
I am splitting each of these lines into different paragraphs and performing a text search over the documents.
Now, I want to search by keyword wherein 1 or more keywords would be defined for let's say lines between timestamp 00:00:00 - 00:05:00. So based on the keyword search, the entire data from 00:00:00 - 00:05:00 should be returned. As in all the lines in between these timestamps should be returned based on keyword search.
Can you please help me understand how to achieve this functionality using Elasticsearch?
Thanks in advance!!
As per i understand below is my opinion:
It is better to create one more field (type can be datetime, timestamp) in your schema and perform range query on that field. Because it will be going to use very frequently and your data will store in time series manner.
[Not recommended] If you field type is "keyword" where you have stored your whole string, Then you need use wildcard query with '%youstring'. But this will return partial data only. And offcourse it is heavy cost and slow. It is like query in SQL.
[Not recommended] If you field type is "text" then you need to check wheather date time terms is created or not. This is also return partial data only.
It is best to design you schema according to your search query. 1st option will be better for you and it will help to scale in future.

Elasticsearch query on string representation of number

Good day:
I have an indexed field called amount, which is of string type. The value of amount can be either one or 1. Say in this example, we have amount=1 as an indexed document but, I try to search for one, ElasticSearch will not return the value unless I put 1 for the search query. Thoughts on how I can get this to work? I'm thinking a tokenizer is what's needed.
Thanks.
You probably don't want this for sevenmillionfourhundredfifteenthousendtwohundredfourteen and the like, but only for a small number of values.
At index time I would convert everything to a proper number and store it in a numerical field, which then even allows to sort --- if you need it. Apart from this I would use synonyms at index and at query time and map everything to the digit-strings, but in a general text field that is searched by default.

Get top 10 most used words in text fields

I have an index containing thousands of documents, each one of them having a full text field.
I want to search through all those fields and fetch the 10 most common words that come back most often.
I would also like a way of visualizing it on Kibana if that's possible.
The most common way to achieve that is to duplicate your full text field with a keyword datatype. That will get you able to make terms aggregation on that field - doc here. Maybe you could consider to do a significant term aggregation - doc here, thus to avoid the presence of stopwords and common words. In ES 6.x you could use also the significant text aggregation - doc here, without create the keyword field, but i never try it, i don't know how it works. Instead if you need to retrieve the frequency of the words for each document, you should use the termvector - doc here

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

elasticsearch: extract number from a field

I'm using elasticsearch and kibana for storing my logs.
Now what I want is to extract a number from a field and store it a new field.
So for instance, having this:
accountExist execution time: 1046 ms
I would like to extract the number (1046) and see it in a new field in kibana.
Is it possible? how?
Thanks for the help
You'll need to do this before/during indexing.
Within Elasticsearch, you can get what you need during indexing:
Define a new analyzer using the Pattern Analyzer to wrap a regular expression (for your purposes, to capture consecutive digits in the string - good answer on this topic).
Create your new numeric field in the mapping to hold the extracted times.
Use copy_to to copy the log message from the input field to the new numeric field from (2) where the new analyzer will parse it.
The Analyze API can be helpful for testing purposes.
While not performant, if you must avoid reindexing, you could use scripted fields in kibana.
Introduction here: https://www.elastic.co/blog/using-painless-kibana-scripted-fields
enable painless regex support by putting the following in your elasticsearch.yaml:
script.painless.regex.enabled: true
restart elasticsearch
create a new scripted field in Kibana through Management -> Index Patterns -> Scripted Fields
select painless as the language and number as the type
create the actual script, for example:
def logMsg = params['_source']['log_message'];
if(logMsg == null) {
return -10000;
}
def m = /.*accountExist execution time: ([0-9]+) ms.*$/.matcher(params['_source']['log_message']);
if ( m.matches() ) {
return Integer.parseInt(m.group(1))
} else {
return -10000
}
you must reload the website completely for the new fields to be executed, simply re-doing a search on an open discover site will not pick up the new fields. (This almost made me quit trying to get this working -.-)
use the script in discover or visualizations
While I do understand, that it is not performant to script fields for millions of log entries, my usecase is a very specific log entry, that is logged 10 times a day in total and I only use the resulting fields to create a visualization or in analysis where I reduce the candidates through regular queries in advance.
Would be interesing if it is possible to have those fields only be calculated in situations where you need them (or they make sense & are computable to begin with; i.e. to make the "return -1000" unnecessary). Currently they will be applied and show up for every log entry.
You can generate scripted fields inside of queries like this: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html but that seems a bit too much of burried under the hood, to maintain easily :/

Resources