Searching documents indexed via ingest-attachment in elastic search on basis of attributes.
i'm new to elastic search and wanted to index files with some of the attributes like Author, Title, Subject, Category, Community etc.
How far i reached :-
i was able to create a attachment pipeline and was able to ingest the different docs in the elastic with attributes. see below how i did:-
1) created pipeline by following request:-
{
"description":"Extract attachment information",
"processors":[
{
"attachment":{
"field":"data"
}
}
]
}
2) upload an attachment via following code :-
{
"filename":"Presentations-Tips.ppt",
"Author":"Jaspreet",
"Category":"uploading ppt",
"Subject":"testing ppt",
"fileUrl":"",
"attributes":
{"attr11":"attr11value","attr22":"attr22value","attr33":"attr33value"},
"data": "here_base64_string_of_file"
}
3) then able to search freely on the all the above attributes and on file content as well:-
{
"query":{
"query_string":{
"query":"*test*"
}
}
}
Now what I wanted is :-
Wanted to narrow down the searches through some filters like :-
wanted to search on basis like search on specific parameters like search all those whose author must "Rohan"
then search all whose author must be "Rohan" and category must be "Education"
then search all whose author has letters like "han" and categories has letters "Tech"
search all whose author is "Rohan" and can search full text search on all fields which can have "progress" in any field, means first narrow down search for author and then full text search on those resultset fields.
Please help me with proper query syntax and call url like for above full text search I used 'GET /my_index/_search'
After spending some time finally got the answer:-
curl -X POST \
http://localhost:9200/my_index/_search \
-H 'Content-Type: application/json' \
-d '{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "progress"
}
},
{
"wildcard": {
"Author": "Rohan"
}
},
{
"wildcard": {
"Title": "q*"
}
}
]
}
}
}'
now in above you can remove or add any object in must array as per your need.
Related
I indexed several documents into my Elasticsearch cluster and queried the Elasticsearch cluster using some keywords and sentences, the output from my query displayed the entire documents where the sentences or keywords where be found.
I want a case where if a query is carried out, it should display just the paragraph where the sentence or keyword can be found and also show the page number it was found.
You can use highlighting functionality with source filtering. So it will show only field which is required and you can hide the remaining field.
You can set _source to false so it will return only highlighted field. If you want to search on different field and highlight on different field then you can set require_field_match to false. Please refer the elastic doc for more referance.
GET /_search
{
"_source":false,
"query": {
"match": { "content": "kimchy" }
},
"highlight": {
"require_field_match":false,
"fields": {
"content": {}
}
}
}
I stored a document in elasticsearch, one of whose fields is a regexp expression, then I hope to use a normal string to hit the document, is it possible?
For example, in elasticsearch, there is one document like :
{
"service_id": "service1",
"regexp_url": "abc.*"
}
then I hope to use the string "abcde" to hit the document, is it possible?
The detail user case is:
I have a website and each page is corresponding to a service, also different pages can also corresponded to the same service.
Now I have an es index, it contains two fields: service_id, regexp_url (it means the page's url matched the url regexp will be mapped to the service), now I open a page with url: "abcde", and the document is like above, I want to hit the document to find the serviceId.
POST <INDEX>/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"regexp": {
"<YOUR_FIELD>": {
"value": "<YOUR_REGEX_PATTER"
}
}
}
]
}
}
More Reference :https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
I have added a document like this to my index
POST /analyzer3/books
{
"title": "The other day I went with my mom to the pool and had a lot of fun"
}
And then I do queries like this
GET /analyzer3/_analyze
{
"analyzer": "english",
"text": "\"The * day I went with my * to the\""
}
And it successfully returns the previously added document.
My idea is to have quotes so that the query becomes exact, but also wildcards that can replace any word. Google has this exact functionality, where you can search queries like this, for instance "I'm * the university" and it will return page results that contain texts like I'm studying in the university right now, etc.
However I want to know if there's another way to do this.
My main concern is that this doesn't seem to work with other languages like Japanese and Chinese. I've tried with many analyzers and tokenizers to no avail.
Any answer is appreciated.
Exact matches on the tokenized fields are not that straightforward. Better save your field as keyword if you have such requirements.
Additionally, keyword data type support wildcard query which can help you in your wildcard searches.
So just create a keyword type subfield. Then use the wildcard query on it.
Your search query will look something like below:
GET /_search
{
"query": {
"wildcard" : {
"title.keyword" : "The * day I went with my * to the"
}
}
}
In the above query, it is assumed that title field has a sub-field named keyword of data type keyword.
More on wildcard query can be found here.
If you still want to do exact searches on text data type, then read this
Elasticsearch doesn't have Google like search out of the box, but you can build something similar.
Let's assume when someone quotes a search text what they want is a match phrase query. Basically remove the \" and search for the remaining string as a phrase.
PUT test/_doc/1
{
"title": "The other day I went with my mom to the pool and had a lot of fun"
}
GET test/_search
{
"query": {
"match_phrase": {
"title": "The other day I went with my mom to the pool and had a lot of fun"
}
}
}
For the * it's getting a little more interesting. You could just make multiple phrase searches out of this and combine them. Example:
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"title": "The"
}
},
{
"match_phrase": {
"title": "day I went with my"
}
},
{
"match_phrase": {
"title": "to the"
}
}
]
}
}
}
Or you could use slop in the phrase search. All the terms in your search query have to be there (unless they are being removed by the tokenizer or as stop words), but the matched phrase can have additional words in the phrase. Here we can replace each * with 1 other words, so a slop of 2 in total. If you would want more than 1 word in the place of each * you will need to pick a higher slop:
GET test/_search
{
"query": {
"match_phrase": {
"title": {
"query": "The * day I went with my * to the",
"slop": 2
}
}
}
}
Another alternative might be shingles, but this is a more advanced concept and I would start off with the basics for now.
Is it possible to search within the results that I get from elasticsearch?
To achieve that currently I need to run & wait for two searches on elasticsearch: the first search is
{ "match": { "title": "foo" } }
It takes 5 seconds and returns 500 docs etc.. And then a second search
{
"bool": {
"must": [
{ "match": { "title": "foo" } },
{ "match": { "title": "bar" } }
]
}
}
It takes another 5 seconds and returns 200 docs, which basically has nothing to do with the first search from elasticsearch's perspective.
Instead of doing it this way, I'd like to offer a "search further within the result" option to my users. Hopefully with this option, users can make a search with more keyword provided based on the result returned from the first search.
So my scenario is that a user makes a first search with keyword "foo", and gets 500 results on the webpage, and then selects "search further within the result", to make a second search within the 500 results, and hope to get some refined results really quick.
How can I achive it? Thanks!
What you could do is use the IDS query. Collect all document IDs from the first request, and then post them with a new Bool query that includes an IDS query in a must clause next to the original query. You could efficiently collect the IDs in the first request using the Scroll API. Since you will return the second result sorted anyway, it does not make sense to do any sorting in the first request, so you can speed up the first request.
See:
Scroll API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
IDS Query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
post filter is a way to search inside an other search.
In your case :
GET _search
{
"query": {
"match": {
"title": "foo"
}
},
"post_filter": {
"match": {
"title": "bar"
}
}
}
post_filter will be executed on the query result.
GET candidates1/candidate/_search
{
"fields": ["contactInfo.emails.main"],
"query": {
"bool": {
"must": [
{
"match": {
"contactInfo.emails.main": "com"
}
}
]
}
}
}
GET candidates1/candidate/_search
{
"size": 5,
"fields": [
"contactInfo.emails.main"
],
"query": {
"match": {
"contactInfo.emails.main": "com"
}
}
}
Hi,
When i am using the above query i am getting results like ['nraheem#dbtech1.com','arelysf456#gmai1.com','ron#rgb52.com'] but i am not getting emails like ['pavann.aryasomayajulu#gmail.com','kumar#gmail.com','raj#yahoo.com']
But when i am using the query to match "gmail.com", i am getting results which have gmail.com
So My question is when i am using "com" in the first query, i am expecting results that include gmail.com as "com" is present in gmail.com. But that is not happening
Note: we are having almost 2Million emailid and most of them are gmail.com , yahoo.com or hotmail but only few are of other types.
"contactInfo.emails.main" fields seem to be an analyzed field.
In elasticsearch all string fields are analyed using Standard Analyzer and are converted into tokens.You can see how your text is getting analyzed using analyze api. Email Ids mentioned by you ending in number before com are getting analyzed as nraheem , dbtech1 , com. Use following query to see the tokens.
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "nraheem#dbtech1.com"
}'
As you can see there is a separate term com being created. While if you analyze kumar#gmail.com you will get tokens like kumar , gmail.com. There is no separate token com created in this case.
This is because Standard Analyzer splits the terms when it encounters some special characters like #,? etc or numbers too. You can create custom Analyzer to meet your requirement.
Hope this helps!!