_search CJK term in elasticsearch - elasticsearch

Does elasticsearch not support query cjk character in url ?
I need to query term 北京(Beijing in chinese) for field(name) in index (old_merge_result). The following query seems not working. ES would retur
GET /old_merge_result/tempid/_search?q=name:北京
ES would return :
{
"statusCode": 400,
"error": "Bad Request",
"message": "child \"uri\" fails because [\"uri\" must be a valid uri]",
"validation": {
"source": "query",
"keys": [
"uri"
]
}
}
Instead, query through the following would return exactly what i want.
GET /old_merge_result/tempid/_search
{
"query": {
"term": {
"name": {
"value": "北京"
}
}
}
}
So is there any way query through url like old_merge_result/tempid/_search?q=name:北京 ?

One needs to use percent-encoding/URL-encoding to pass cjk characters as query parameters
For the above example it would be :
GET /old_merge_result/tempid/_search?q=name:%E5%8C%97%E4%BA%AC

Related

Return documents starting from ID, sorted by timestamp

Say I have an index with the following documents:
{
"id": "8e8e3c0c-5d1d-4a3c-a78a-1bd2d206b39e",
"timestamp": "2022-10-18T00:00:02"
}
{
"id": "0ebeb7b1-dcd0-4b37-a70d-fa7377f07f8c",
"timestamp": "2022-10-18T00:00:03"
}
{
"id": "ea779299-1781-4465-b8a1-53f7b14fbe0c",
"timestamp": "2022-10-18T00:00:01"
}
{
"id": "3624a119-4830-4ec2-a840-f656c048fc5c",
"timestamp": "2022-10-18T00:00:04"
}
I need a search query that returns documents from a specified id, sorted by timestamp up to a limit (say 100). So given the id of 8e8e3c0c-5d1d-4a3c-a78a-1bd2d206b39e, the following documents will be returned (in this exact order, note that document with id ea779299-1781-4465-b8a1-53f7b14fbe0c is missing because its timestamp is earlier than the document I'm looking for):
{
"id": "8e8e3c0c-5d1d-4a3c-a78a-1bd2d206b39e",
"timestamp": "2022-10-18T00:00:02"
}
{
"id": "0ebeb7b1-dcd0-4b37-a70d-fa7377f07f8c",
"timestamp": "2022-10-18T00:00:03"
}
{
"id": "3624a119-4830-4ec2-a840-f656c048fc5c",
"timestamp": "2022-10-18T00:00:04"
}
I know how to do this in two queries by first getting the document by its id, and then another query to get all the documents "after" that document's timestamp, but I'm hopeful there's a more efficient way to do this using one single query?
Note that the index is expected to have tens/hundreds of millions of documents, so performance concerns are a factor (I'm unsure what "work" ES is doing under the covers, such as sorting first and then visiting each document to check the id), although the cluster will be sized appropriately.
You can use below bool query which will give you your expected result. match_all inside must will return all the documents and term inside should clause will boost the document where ID is matching.
If your id field is defined as keyword type then use id only in term query and if it is defined as text and keyword both then use id.keyword.
{
"size": 100,
"sort": [
{
"_score": "desc"
},
{
"timestamp": {
"order": "asc"
}
}
],
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"should": [
{
"term": {
"id.keyword": {
"value": "8e8e3c0c-5d1d-4a3c-a78a-1bd2d206b39e"
}
}
}
]
}
}
}

Exact match search on text field

I'm using ElasticSearch to search data. My data contains text field and when I tried to match query on input, it outputs the input with another string.
_mapping
"direction": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
Elastic Data
[
{
direction: "North"
},
{
direction: "North East"
}
]
Query
{
match: {
"direction" : {
query: "North",
operator : "and"
}
}
}
Result
[
{
direction: "North"
},
{
direction: "North East"
}
]
Expected Result
[
{
direction: "North"
}
]
Noted: It should output exact match direction
You may want to look at Term Queries which are used on keyword datatype to perform exact match searches.
POST <your_index_name>/_search
{
"query": {
"term": {
"direction.keyword": {
"value": "North"
}
}
}
}
The reason you observe what you observe, is because you are querying on Text field using Match Query. The values of the text field are broken down into tokens which are then stored in inverted indexes. This process is called Analysis. Text fields are not meant to be used for exact match.
Also note that whatever words/tokens you'd mention in Match Query, they would also go through the analysis phase before getting executed.
Hope it helps!
Based on you mapping, you should not search on field direction but on direction.keyword if you want exact match. The field direction is type text and gets analyzed - in your case to the words north and east.
Try this
{ "query" : { "bool" : { "must": { "term": { "direction": "North" } } } } }

Filter out records with a wildcard

I am using ElasticSearch + Kibana to log errors. In the Kibana dashboard, I can filter out records by a certain field by clicking on the magnifier glass with the minus sign. It then generates the following query to exclude:
{
"query": {
"match": {
"message": {
"query": "Invalid HTTP_HOST header: '12.34.567.89'. You may need to add '12.34.567.89' to ALLOWED_HOSTS.",
"type": "phrase"
}
}
}
}
Now I want to exclude these records for all possible IP addresses, so I need a wildcard (or regexp). I found the documentation about wildcards and regexps here. However, they do not resemble the syntax used above.
If I change the query above to the one from the documentation, it doesn't filter it at all. Example:
{
"query": {
"wildcard": {
"message": "Invalid HTTP_HOST header: *"
}
}
}
If I try to combine them, I get a parsing error: Discover: [parsing_exception] [match] unknown token [START_OBJECT] after [query], with { line=1 col=444 }. Example:
{
"query": {
"match": {
"message": {
"query": {
"wildcard": {
"message": "Invalid HTTP_HOST header: *"
}
},
"type": "phrase"
}
}
}
}
I have tried a few more combinations, but I can't get it to work. Any ideas?
Another possibility is to use the regexp query, like this, but depending on how much data you have, it's going to be CPU intensive:
POST _search
{
"query": {
"regexp": {
"message.keyword": {"value":"Invalid HTTP_HOST header: '<1-999>\\.<1-999>\\.<1-999>\\.<1-999>'\\. You may need to add '<1-999>\\.<1-999>\\.<1-999>\\.<1-999>' to ALLOWED_HOSTS\\.",
"flags": "ALL"}
}
}
}
You might be better off analyzing your data before indexing it and split it into better searchable parts.
This sounds weird but it seems like it is not working because of upper case text.
Try this:
{
"query": {
"wildcard": {
"message": "*http_host*"
}
}
Click on Add filter and then click on the top right corner of the dialog box Edit as Query DSL:
Case 1:
Case sensitive search containing the word http_host in the string.
Wildcard supports ? or * regex functionality only.
{
"wildcard": {
"message.keyword": "*http_host*"
}
}
Case 2:
Case insensitive search containing the word http_host in the string.
{
"query": {
"multi_match": {
"query": "http_host",
"fields": [
"message"
],
"type": "best_fields"
}
}
}

Elasticsearch with AND query in DSL

this drives me crazy. I have no clue why this elastic search do not return me value.
I put values with this:
PUT /customer/person-test/1?pretty
{
"name": "John Doe",
"personId": 153,
"houseHoldId": 6191136,
"quarter": "2016_Q1"
}
PUT /customer/person-test/2?pretty
{
"name": "John Doe",
"personId": 153,
"houseHoldId": 6191136,
"quarter": "2016_Q2"
}
and when I query like this, it do not returns me value:
GET /customer/person-test/_search
{
"query": {
"bool": {
"must" : [
{
"term": {
"name": "John Doe"
}
},
{
"term": {
"quarter": "2016_Q1"
}
}
]
}
}
}
this query i copied from A simple AND query with Elasticsearch
I just want to get the person with "John Doe" AND "2016_Q1", why this did not work?
You should use match instead of term :
GET /customer/person-test/_search
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "John Doe"
}
},
{
"match": {
"quarter": "2016_Q1"
}
}
]
}
}
}
Explanation
Why doesn’t the term query match my document ?
String fields can be of type text (treated as full text, like the body
of an email), or keyword (treated as exact values, like an email
address or a zip code). Exact values (like numbers, dates, and
keywords) have the exact value specified in the field added to the
inverted index in order to make them searchable.
However, text fields are analyzed. This means that their values are
first passed through an analyzer to produce a list of terms, which are
then added to the inverted index.
There are many ways to analyze text: the default standard analyzer
drops most punctuation, breaks up text into individual words, and
lower cases them. For instance, the standard analyzer would turn the
string “Quick Brown Fox!” into the terms [quick, brown, fox].
This analysis process makes it possible to search for individual words
within a big block of full text.
The term query looks for the exact term in the field’s inverted
index — it doesn’t know anything about the field’s analyzer. This
makes it useful for looking up values in keyword fields, or in numeric
or date fields. When querying full text fields, use the match query
instead, which understands how the field has been analyzed.
...
its not working because of u r using default standard analyzer link for 'name' and 'quarter' .
You have two more options :-
1)change mapping :-
"name": {
"type": "string",
"index": "not_analyzed"
},
"quarter": {
"type": "string",
"index": "not_analyzed"
}
2)try this , lowercase your value since by default standard analyzer use Lower Case Token Filter :-
{
"query": {
"bool": {
"must" : [
{
"term": {
"name": "john_doe"
}
},
{
"term": {
"quarter": "2016_q1"
}
}
]
}
}
}

Elastic Search Term Query Not Matching URL's

I am a beginner with Elastic search and I am working on a POC from last week.
I am having a URL field as a part of my document which contains URL's in the following format :"http://www.example.com/foo/navestelre-04-cop".
I can not define mapping to my whole object as every object has different keys except the URL.
Here is how I am creating my Index :
POST
{
"settings" : {
"number_of_shards" : 5,
"mappings" : {
"properties" : {
"url" : { "type" : "string","index":"not_analyzed" }
}
}
}
}
I am keeping my URL field as not_analyzed as I have learned from some resource that marking a field as not_analyzed will prevent it from tokenization and thus I can look for an exact match for that field in a term query.
I have also tried using the whitespace analyzer as the URL value thus not have any of the white space character. But again I am unable to get a successful Hit.
Below is my term query :
{
"query":{
"constant_score": {
"filter": {
"term": {
"url":"http://www.example.com/foo/navestelre-04-cop"
}
}
}
}
}
I am guessing the problem is somewhere with the Analyzers and Tokenizers but I am unable to get to a solution. Any kind of help would be great to enhance my knowledge and would help me reach to a solution.
Thanks in Advance.
You have the right idea, but it looks like some small mistakes in your settings request are leading you astray. Here is the final index request:
POST /test
{
"settings": {
"number_of_shards" : 5
},
"mappings": {
"url_test": {
"properties": {
"url": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Notice the added url_test type in the mapping. This lets ES know that your mapping applies to this document type. Also, settings and mappings are also different keys of the root object, so they have to be separated. Because your initial settings request was malformed, ES just ignored it, and used the standard analyzer on your document, which led to you not being able to query it with your query. I point you to the ES Mapping docs
We can index two documents to test with:
POST /test/url_test/1
{
"url":"http://www.example.com/foo/navestelre-04-cop"
}
POST /test/url_test/2
{
"url":"http://stackoverflow.com/questions/37326126/elastic-search-term-query-not-matching-urls"
}
And then execute your unmodified search query:
GET /test/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"url": "http://www.example.com/foo/navestelre-04-cop"
}
}
}
}
}
Yields this result:
"hits": [
{
"_index": "test",
"_type": "url_test",
"_id": "1",
"_score": 1,
"_source": {
"url": "http://www.example.com/foo/navestelre-04-cop"
}
}
]

Resources