elasticsearch regex query to include forward slash - elasticsearch

I have the following records in elastic search
{
"label": "/home and garden/home furnishings",
"score": "0.731174"
},
{
"label": "/travel/vacation rentals",
"score": "0.601932"
},
{
"label": "/travel/vacation rentals",
"score": "0.657443"
},
{
"label": "/home and garden/gardening and landscaping/yard and patio",
"score": "0.707792"
}
Now i want to make a query to get all taxonomy labels that start with "/travel" and i want the data only till the third forward slash
example if we take
"label": "/home and garden/gardening and landscaping/yard and patio"
then i want the data only till
/home and garden/gardening and landscaping/
I tried some of the queries for it for partial match like:
{
"_source":["taxonomy"],
"from" : 0,
"size" : 100,
"query": {
"regexp":{
"taxonomy.label":{
"value":"/travel.*",
"boost":1.2
}
}
}
}
But it does not seem to take the forward slashes as soon as i give slashes it stops giving any results, i want to know if this is possible or not and if it is then how do i proceed with this query?
Any help is appreciated.

You have two problems. first is querying documents which starts with travel. For that you can try path_hierarchy from ElasticSearch. You can read more here. This will allow you to store your path '/travel/vacation rentals' in the form of
/travel
/travel/vacation rentals
after than you can do direct match query to that field.
For the second part you can try scripts, but be careful with them as enabling inline scripts can result in some unethical access if exposed to outside world.
You can go safe by using scripts kept in specific scripts folder. You can read more on scripts and how to make it secure here.

Related

Elastic Search Rollup Jobs

Can I filter the documents in elastic search before rolling them up, or can I define filter query in Roll up job, If yes how?
There's no way to filter data before rolling it up into a new rolled up index. However, you can achieve what you want by first defining a filtered alias and then rolling up on that alias.
Say, you want to roll up index test but only for customers 1, 2 and 3. You can create the following filtered alias:
POST /_aliases
{
"actions": [
{
"add": {
"index": "test",
"alias": "filtered-test",
"filter": { "terms": { "customer.id": [1, 2, 3] } }
}
}
]
}
And then you can roll up on the filtered-test alias instead of the test index and that will only roll up data from customers 1, 2 and 3:
PUT _rollup/job/sensor
{
"index_pattern": "filtered-test",
"rollup_index": "customer_rollup",
...
}
PS: It is worth noting that you're not alone but Elastic folks specifically decided not to allow filtering in roll-ups for various reasons (you can read more in the issue I linked to). The issue has been reopened because there's a big refactor of the roll up feature going on. Stay tuned...

index letter by letter with elasticsearch in rails app

does anyone knows what is the best way to index data letter by letter in elasticsearch . i have rails app witch i've use elasticsearch as search engine in my app. in my rails app i have alot of content that contain articles , in my implementation i can search through articles and return reult perfectly , i've multiple filter like ngram - edge gram - whitespace and ... . if user type exact full word every things fine and with my design i have some filter like ngram ,at this state i want to help user if he/she type one letter he/she could return some result i can handle this ponit with where command but i want to do it with elastics.
Im looking through elastices tutorials and best practices but none of them cant help me , at the best state if user type at least three or four letter elastices could return results .
I`m useing ngram filter
edge gram filter
regex
and ...
but none of them was useful.
you need to use suggest query in elasticsearch as below:
{
"suggest": {
"auto-complete-suggest": {
"prefix": "Your prefix text",
"completion": {
"size" : 5,
"field": "text_completion"
}
}
}
}
and your mapping would be something like this:
{
"mappings": {
"properties": {
"text_completion": {
"type": "completion",
"analyzer": "Your analyzer"
}
}
}
}

Return position and highlighting of search queries in Elasticsearch

I am using the official Elasticsearch-PHP client installed on a personal Debian server, and what I am trying to do involves indexing, searching and highlighting individual documents. i.e. each search result will only return one document - which will then be highlighted for "simple query string" searches. I am also using FVH (fast vector highlighting).
My question is similar to this one Position as result, instead of highlighting and the test code is basically the same so I won't repeat that here. However in my case I need both position and highlighting. I followed the link to the documentation about term vectors, but just like the other OP, my searches are not exact words per se. In some cases they are phrases. How would I approach this?
My use case is to search only one document (for each query), and present a summary of results with links which the user can click to go to the specific place in the document where that result came from. If I have the index / position I can simply use that against the full source of the document. I have checked the documentation to no avail.
You could try to install a specific plugin developed by wikimedia foundation called Experimental Highlighter -github here
You can install for elasticsearch 7.5 in this way - for other elasticsearch versions please refer to the github project page:
./bin/elasticsearch-plugin install org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.5.1
And restart elasticsearch.
Inasmuch you need to retrieve also the positions - if for your use case the offsets can replace the positions please go on to the next paragraph - you should declare your field with termvector with the index option "with_position_offset_payloads" - doc here
PUT /my-index-000001
{ "mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "fulltext_analyzer"
}
}
}
}
For other cases that don't need to retrieve also the position, it is faster and uses much less space to use the index option "offsets" - elastic doc here, plugin doc here:
PUT /my-index-000001
{ "mappings": {
"properties": {
"text": {
"type": "text",
"index_options": "offsets",
"analyzer" : "fulltext_analyzer"
}
}
}
}
Then you could query with the experimental highlighter and return only offset of the highlighter part:
{
"query": {
"match": {
"text": "hello world"
}
},
"highlight": {
"order": "score",
"fields": {
"text": {
"number_of_fragments": 10,
"fragment_size": 15,
"type": "experimental",
"options": {"return_offset": true}
}
}
}
}
In this way no text is returned from your query but only the start offset and the end offset - numbers that represent position. To retrieve your highlighted content you need to enter inside ['hits']['hits'][0]['_source']['text'] -text is your field name - and extract text from the field using your start offset point and the end offset point. You need to ensure to use the correct string encoding - UTF-8 - otherwise the offsets don't match text. According to the doc:
The return_offsets option changes the results from a highlighted
string to the offsets in the highlighted that would have been
highlighted. This is useful if you need to do client side sanity
checking on the highlighting. Instead of a marked up snippet you'll
get a result like 0:0-5,18-22:22. The outer numbers are the start and
end offset of the snippet. The pairs of numbers separated by the ,s
are the hits. The number before the - is the start offset and the
number after the - is the end offset. Multi-valued fields have a
single character worth of offset between them.
Let me know if that plugin could help!

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

How to get name/confidence individually from classify_text?

Most of the other methods in the language api, such as analyze_syntax, analyze_sentiment etc, have the ability to return the constituent elements like
sentiment.score
sentiment.magnitude
token.part_of_speech.tag
etc etc etc....
but I have not found a way to return name and confidence in isolation from classify_text. It doesn't look like it's possible but that seems weird. Am missing something? Thanks
The language.documents.classifyText method returns a ClassificationCategory object which contains name and confidence. If you only want one of the fields you can filter by categories/name or categories/confidence. As an example I executed:
POST https://language.googleapis.com/v1/documents:classifyText?fields=categories%2Fname&key={YOUR_API_KEY}
{
"document": {
"content": "this is a test for a StackOverflow question. I get an error because I need more words in the document and I don't know what else to say",
"type": "PLAIN_TEXT"
}
}
Which returns:
{
"categories": [
{
"name": "/Science/Computer Science"
},
{
"name": "/Computers & Electronics/Programming"
},
{
"name": "/Jobs & Education"
}
]
}
Direct link to API explorer for interactive testing of my example (change content, filters, etc.)

Resources