Elasticsearch Query "must match" in log - elasticsearch

I have the following in my log that i would like to use ElasticSearch query to find:
2014-07-02 20:52:39 INFO home.helloworld: LOGGER/LOG:ID1234 has successfully been received, {"uuid"="abc123"}
2014-07-02 20:52:39 INFO home.helloworld: LOGGER/LOG:ID1234 has successfully been transferred, {"uuid"="abc123"}
2014-07-02 20:52:39 INFO home.byebyeworld: LOGGER/LOG:ID1234 has successfully been processed, {"uuid"="abc123"}
2014-07-02 20:52:39 INFO home.byebyeworld: LOGGER/LOG:ID1234 has exited, {"uuid"="abc123"}
2014-07-02 20:53:00 INFO home.helloworld: LOGGER/LOG:ID1234 has successfully been received, {"uuid"="def123"}
2014-07-02 20:53:00 INFO home.helloworld: LOGGER/LOG:ID1234 has successfully been transferred, {"uuid"="def123"}
2014-07-02 20:53:00 INFO home.byebyeworld: LOGGER/LOG:ID1234 has successfully been processed, {"uuid"="def123"}
2014-07-02 20:53:00 INFO home.byebyeworld: LOGGER/LOG:ID1234 has exited, {"uuid"="def123"}
Since each of above line is represented as single "message" in elasticsearch, i have a hard time querying it using POST rest calls. I tried using "must match" like below to only get line 1 of my log but it is not consistent, sometimes it returns multiple hits instead of just 1 hit:
{
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "match_phrase_prefix" : {"message" : "home.helloworld:"}},
{ "match_phrase_prefix" : {"message" : "LOGGER/LOG:ID1234"}},
{ "match" : {"message" : "received, {\"uuid\"=\"abc123\"}"}}
]
}
}
}
}
}
am i doing anything wrong with above elasticsearch query? i thought "must" is equal to AND, and "match" is more of CONTAINS, "match_phrase_prefix" is STARTSWITH? can someone please show me how to properly query a log filled with above logs with different uuid number and only return the single hit? originally i thought i got the query down with above, it first returned just 1 hit but then it returned 2 then a lot more.. .which to me is not consistent. Thank you in advance!!

The problem is with the 3-rd clause of your bool query. Let me give you couple queries which will just work for you and I'll explain why they do the job.
First Query
curl -XGET http://localhost:9200/my_logs/_search -d '
{
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "match_phrase_prefix" : {"message" : "home.helloworld:"}},
{ "match_phrase_prefix" : {"message" : "LOGGER/LOG:ID1234"}},
{ "match" : {
"message" : {
"query": "received, {\"uuid\"=\"abc123\"",
"operator": "and"
}
}
}
]
}
}
}
}
}'
Explanation
Let's make sure we are on the same page about indexing. By default the indexer will pass your data through standard chain of analysis. I.e. splitting by whitespace, reducing special characters, making lower-casing, etc. So in the indices we will just have tokens with its positions.
Match query as full-text query will take your query text "received, {\"uuid\"=\"abc123\"" and will pass through analysis as well. By default this analysis just splits your text by whitespace, reduces special characters, makes lower-casing, etc. The result of this analysis would look similar to this (simplified): received, uuid, abc123.
What match query will do - it will combine those tokens against message field using default operator (which is or). So as a logical expression the very last clause (match-query) would look like this: message:received OR message:uuid OR message:abc123.
This is why first 4 log-entries will match. I was able to reproduce it.
Second Query
curl -XGET http://localhost:9200/my_logs/_search -d '
{
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "match_phrase_prefix" : {"message" : "home.helloworld:"}},
{ "match_phrase_prefix" : {"message" : "LOGGER/LOG:ID1234"}},
{ "match_phrase_prefix" : {"received, {\"uuid\"=\"abc123\""}}
]
}
}
}
}
}'
Explanation
This one a little bit simpler. Remember: our indexing process left tokens and their positions in the index.
What actually Match Phrase Prefix query is doing - it takes the input query (let's take "received, {\"uuid\"=\"abc123\"" as an example), makes exactly the same step of query text analysis. And tries to find tokens received, uuid, abc123 at neighboring positions in the index. Just in the same exact ordering: received -> uuid -> abc123 (almost).
Except the very last token, which in our case is abc123. To be precisely, it will make the wildcard for the last token. I.e. received -> uuid -> abc123*.
To be perfectionist I'd add that received -> uuid -> abc123 (i.e. without wildcard in the end) - is what actually Match Phrase query is doing. It also counts the positions in the index, i.e. tries to match the 'phrase', not just separate tokens in random positions.

Related

How to limit elasticsearch to a list of documents each identified by a unique keyword

I have an elasticsearch document repository with ~15M documents.
Each document has an unique 11-char string field (comes from a mongo DB) that is unique to the document. This field is indexed as keyword.
I'm using C#.
When I run a search, I want to be able to limit the search to a set of documents that I specify (via some list of the unique field ids).
My query text uses bool with must to supply a filter for the unique identifiers and additional clauses to actually search the documents. See example below.
To search a large number of documents, I generate multiple query strings and run them concurrently. Each query handles up to 64K unique ids (determined by the limit on terms).
In this case, I have 262,144 documents to search (list comes, at run time, from a separate mongo DB query). So my code generates 4 query strings (see example below).
I run them concurrently.
Unfortunately, this search takes over 22 seconds to complete.
When I run the same search but drop the terms node (so it searches all the documents), a single such query completes the search in 1.8 seconds.
An incredible difference.
So my question: Is there an efficient way to specify which documents are to be searched (when each document has a unique self-identifying keyword field)?
I want to be able to specify up to a few 100K of such unique ids.
Here's an example of my search specifying unique document identifiers:
{
"_source" : "talentId",
"from" : 0,
"size" : 10000,
"query" : {
"bool" : {
"must" : [
{
"bool" : {
"must" : [ { "match_phrase" : { "freeText" : "java" } },
{ "match_phrase" : { "freeText" : "unix" } },
{ "match_phrase" : { "freeText" : "c#" } },
{ "match_phrase" : { "freeText" : "cnn" } } ]
}
},
{
"bool" : {
"filter" : {
"bool" : {
"should" : [
{
"terms" : {
"talentId" : [ "goGSXMWE1Qg", "GvTDYS6F1Qg",
"-qa_N-aC1Qg", "iu299LCC1Qg",
"0p7SpteI1Qg", ... 4,995 more ... ]
}
}
]
}
}
}
}
]
}
}
}
#jarmod is right.
But if you don't wanna completely redo your architecture, is there some other single talent-related shared field you could query instead of thousands of talendIds? It could be one more simple match_phrase query.

Combining results of two queries

I'm using Kibana v6.1.1 and trying to get within one GET request two different queries in order to use the "must" or "should" terms more than once.
When I run this query under "Dev Tools" in the Kibana, it works.
When I want to apply this "double query" (without the GET line of course) under "Discover"->"Add a filter"->"Edit filter"->"Edit Query DSL", it doesn't accept the syntax {} in order to create an 'OR' between the queries.
It is necessary that these two "must" terms will be separated but stay in the same filter.
GET _my_index/_search
{
"query" : {
"bool" : {
"must" : [{
...
}]
}
}
}
{}
{
"query" : {
"bool" : {
"must" : [{
...
}]
}
}
}
P.S.
Using the simple_query_string doesn't seem to solve the problem and so far, I couldn't find the way to combine these two queries.
I'm not sure what you actually want to achieve. Use the following if at least one of the shoulds has to match (there is an implicit minimum_should_match if there are no other conditions, but you can also set an explicit value for that):
{
"query" : {
"bool" : {
"should" : [
{
...
},
{
...
}
]
}
}
}
If you want to run independent queries, use a multi search.

escaping forward slashes in elasticsearch

I am doing a general search against elasticsearch (1.7) and all is well except my account numbers have forward slashes in them. The account number field is not the id field and is "not_analyzed".
If I do a search on an account number e.g. AC/1234/A01 then I get thousands of results, presumably because it is doing a regex search (?).
{
"query" : { "query_string" : {"query" : "AC/1234/A01"} }
}
I can get the result I want by doing an exact match search
{
"query" : { "query_string" : {"query" : "\"AC/1234/A01\""} }
}
This actually gives me the result I want and probably will fit the bill as a backup option (surrounding all 'single word' searches with quotes). However I'm thinking if they do a multiple word search including the account number I will be back to thousands of results and although I can't see the value of that search I would like to avoid it happening.
Essentially I have a java app querying elastic search and I would like to escape all forward slashes entered in the GUI.
My Googling has told me that
{
"query" : { "query_string" : {"query" : "AC\\/1234\\/A01"} }
}
ought to do this but it makes no difference, the query works but I still get thousands of results.
Could anyone point me in the right direction ?
You should get what you want without escaping anything simply by specifying a keyword analyzer for the query string, like this:
{
"query" : {
"query_string" : {
"query" : "AC\\/1234\\/A01",
"analyzer": "keyword" <---- add this line
}
}
}
If you don't do this, the standard analyzer is used (and will tokenize your query string) whatever the type of your field is or whether it is not_analyzed or not.
Use this query as example:
{
"query": {
"query_string": {
"fields": [
"account_number.keyword"
],
"query": "AC\\/1234\\/A01",
"analyzer": "keyword"
}
}
}
I use query_string because I want to give my users the possibility to do complex queries with OR and AND. Having the search break when a slash is used (e.g. when searching for an URL) is not helpful.
I worked around that issue by adding quotes when a slash is in the search string but no quotes:
if (strpos($query, '/') !== false && strpos($query, '"') === false) {
$query = '"' . $query . '"';
}

How to use elastic search for advanced queries:

I'm using elasticsearch. I'm already pretty deep into it but I'm very confused as to how to go about writing advanced queries. There are queries / filters / etc. I'm confused as to how to proceed.
I have a schema that looks like this:
photos: {people: [{person_id: 1, person_name:"john kealy"}],
tags: [{tag_id: 1, tag_name:"other tag"},
by_line: "John D Kealy/My website.com",
location: "Some Place OUt West"]
I need to be able to string together these queries dynamically ALWAYS pulling in FULL MATCHES, e.g. I would like to search for
people.person_id: [1,2] (pulls in only photos with BOTH or more peole)
tags.tag_id: [1,2,3] (pulls in only photos with all three or more tags)
by_line: "John D. Kealy/My Website.com" (the full name including the slash)
location: "some place out west"
I would like to write one query with all these items. I need to include the slash in "by_line", i don't care up upper or lower case. I need the exact match "some place out west". What do I use here? Queries or filters / filtered?
General guidelines for bool filters/queries can be found here.
If you are constructing an "exact match" query, you can often use the term filter (or query).
If you are constructing a search that requires a solid performance speed wise, a filtered query is often advisable, as filters are set before the query is run, often improving performance.
As for your specific example, the below filters should work, throw it around a matchAll query or anything else you need [With the non-analyzed by_line field, the analyzed one has a query). This should give you an idea as how to construct future queries:
NOTE: This assumes that your by_line field is not analyzed. The double slash will escape your slash delimiter, if you are using an analyzed field you must use a match query.
Without analyzer on by_line
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{ "terms" : {"people.person_id" : ["1", "2"]}},
{ "terms" : {"tags.tag_id" : ["1", "2", "3"]}},
{ "term" : {"by_line" : "John D. Kealy\\/My Website.com"}},
{ "term" : {"location" : "some place out west"}}
]
}
}
}
}
}
I will keep the above there for future readers, however I see in your post history that you are using the standard analyzer, your query should be structured as follows.
With analyzer on by_line
{
"query" : {
"filtered" : {
"query": {
"match": {
"by_line": "John Kealy/BFA.com"
}
},
"filter" : {
"bool" : {
"must" : [
{ "terms" : {"people.person_id" : ["1", "2"]}},
{ "terms" : {"tags.tag_id" : ["1", "2", "3"]}},
{ "term" : {"location" : "some place out west"}}
]
}
}
}
}
}

Elasticsearch bool search matching incorrectly

So I have an object with an Id field which is populated by a Guid. I'm doing an elasticsearch query with a "Must" clause to match a specific Id in that field. The issue is that elasticsearch is returning a result which does not match the Guid I'm providing exactly. I have noticed that the Guid I'm providing and one of the results that Elasticsearch is returning share the same digits in one particular part of the Guid.
Here is my query source (I'm using the Elasticsearch head console):
{
query:
{
bool:
{
must: [
{
text:
{
couchbaseDocument.doc.Id: 5cd1cde9-1adc-4886-a463-7c8fa7966f26
}
}]
must_not: [ ]
should: [ ]
}
}
from: 0
size: 10
sort: [ ]
facets: { }
}
And it is returning two results. One with ID of
5cd1cde9-1adc-4886-a463-7c8fa7966f26
and the other with ID of
34de3d35-5a27-4886-95e8-a2d6dcf253c2
As you can see, they both share the same middle term "-4886-". However, I would expect this query to only return a record if the record were an exact match, not a partial match. What am I doing wrong here?
The query is (probably) correct.
What you're almost certainly seeing is the work of the 'Standard Analyzer` which is used by default at index-time. This Analyzer will tokenize the input (split it into terms) on hyphen ('-') among other characters. That's why a match is found.
To remedy this, you want to set your couchbaseDocument.doc.Id field to not_analyzed
See: How to not-analyze in ElasticSearch? and the links from there into the official docs.
Mapping would be something like:
{
"yourType" : {
"properties" : {
"couchbaseDocument.doc.Id" : {"type" : "string", "index" : "not_analyzed"},
}
}
}

Resources