backslash seach not working in chinese character in marklogic - xpath

I am using following cts query for the search in MarkLogic
cts:element-word-query(xs:QName('c:l10n'),'\*\漢\*',('wildcarded','case-insensitive','whitespace-sensitive'))
It is not giving any result although there exists some data in the database with "\漢" words.
already tried:
It is working fine with English characters like \r,\n or /r,/n.
also, it gives me perfect result if I use only \ or 漢. but always show 0 results whenever I use \ with any Chinese character.

It is possible that there is a tokenization bug here, but it is hard to tell.
What you (should) have here is a phrase query for "\" (wildcard-word) "\" "漢" "\" and (wildcard-word) in that order. It is punctuation sensitive. Do you have an example of some content you think should match?
What does the query plan show you? What are your index settings?

Related

How do I escape the word "And" in Elasticsearch if I want to search by the literal "And"?

I'm trying to search over an index that includes constellation code names, and the code name for the Andromeda constellation is And.
Unfortunately, if I search using And, all results are returned. This is the only one that doesn't work, across dozens of constellation code names, and I assume it's because it's interpreted as the logical operator AND.
(constellation:(And)) returns my entire result set, regardless of the value of constellation.
Is there a way to fix this without doing tricks like indexing with an underscore in front?
Thanks!
I went for a bit of a hack, indexing the constellation as __Foo__ and then changing my search query accordingly by adding the __ prefix and suffix to the selected constellation.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

Search for a string that start with a wildcard in ElasticSearch

I am building a kibana dashboard that displays information about X509 certificates. I would like to build a pie chart of certificates that contain a wildcard in their CN or SAN attributes, but I cannot find a query syntax that works.
To match a string like subject.cn: "*.example.net", I tried the following kibana queries:
subject.cn:/\*./
subject.cn:/^\*./
subject.cn:\*\.
subject.cn:\*.
subject.cn:*.
Could someone point me to the proper syntax? Is this even something ES/Lucene supports?
Analysing *.example.net with the standard analyser will give you a single term of example.net - i.e. the asterisk and first "." have been stripped.
Using not_analyzed will store the complete field *.example.net (as expected!)
If the wildcard is always at the beginning of the CN name then using a simple prefix query will work (I've simplified the field name):
curl -XGET 'http://localhost:9200/mytest/certificates/_search?pretty' -d '{
"query": {
"prefix": { "cn.raw":"*"}
}
}'
However if you want to search against different levels of the domain name you'll need to change the analyser you're using.
E.g. use the pattern analyser and define "." as your delimiter or possibly create a custom analyzer that calls the path hierarchy tokenizer - it's going to depend on how user's want to search your data.
Thanks to Olly's answer, I was able to find a solution that works. Once the raw fields defined, the trick is to escape the wildcard to treat it as a character, and to surround it with unescape wildcards, to accept surrounding characters:
ca:false AND (subject.cn.raw:*\** OR x509v3Extensions.subjectAlternativeName.raw:*\**)

Using AND, OR and NOT in Solr Query

I am trying a Solr query which is like this
+field1:* AND (field2:1 OR field2:10) NOT(field3:value1 OR field3:value2)
But field3 part of the query is not making any impact. It still brings record which has value1 or value2 in field3
Why is this?
Try this
+field1:* +(field2:1 OR field2:10) -(field3:value1 OR field3:value2)
I think a AND / OR is missing between the two last blocks. It would then become something like :
+field1:* AND (field2:1 OR field2:10) AND NOT(field3:value1 OR field3:value2)
You need to urlencode certain characters in the Solr query to meet UTF8 standards and the + (plus) symbol is one of them, as well as space, brackets etc.
Things to encode are:
Space => +
+ => %2B
( => %28
) => %29
and so forth, you can see an example of an encoded URL on the SOLR website:
https://wiki.apache.org/solr/SolrQuerySyntax
Try:
str_replace(array('+','(',')',' '), array('%2B','%28','%29','+'), '+field1:* (field2:1 field2:10) -(field3:value1 field3:value2)');
This should give you:
%2Bfield1:*+%2B%28field2:1+field2:10%29+-%28field3:value1+field3:value2%29
IF your default query parser operation is set to OR, then any space between fields will be interpreted as an OR operator.
The above result is far from clean & readable, but it is a correctly formatted UTF8 string which Solr requires you to pass to it. You'll notice the difference as soon as you run it.
Why str_replace instead of urlencode? Well you can use urlencode because it will correctly format the string as UTF8 but it might format some string components that don't need to be encoded.

Zend Lucene fails all searches with special characters

if anyone knows a simple answer to this, I don't have to wade through creating an extra index with escaped strings and crying my eyes out while littering my pretty code.
Basically, the Lucene search we have running cannot handle any non-letter characters. Space, percent signs, dots, dashes, slashes, you name it. This is higly infuriating, because I cannot make any search on items containing these characters, no matter wherever I escape them or not.
I have two options: Kill these characters in a separate index and strip them from the names I'm searching or stop goddamn searching.
You can escape special characters using '/'. Lucene treats followings the following as special characters and you will have to escape those characters to make it work.
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
If you want to search "2+3", query should be "2/+3"
Use QueryParser.escape(String s) to escape the query string.
According to http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#-
The escape character is slash-backward, not -forward: .
And to answer Ankit, $ doesn't seem to have to be escaped since it's not a special character.
Escaping the dash as suggested by Ralph doesn't make a difference for me (Zend Lucene). You'd think that when a word 'abc-def' is indexed and you search for 'abc-def' you'll somehow find that word, regardless of whether the dash is ignored at the indexing step or not. Same input should have same result. The word seems to be indexed as two separate tokens 'abc' and 'def'. Yet searching for 'abc-def' gives no results when 'abc def' does.

Resources