Searching within text fields in CloudKit - nspredicate

How are people searching within a string (a substring) field using CloudKit?
For making predicates for use with CloudKit, from what I gather, you can only can only do BEGINSWITH and TOKENMATCHES to search search a single text field's (prefix) or all fields (exact match) respectively. CONTAINS only works on collections despite these examples. I can't determine a way to find, for example, roses in the following string "Red roses are pretty"
I was thinking of making a tokenized version of certain string fields; for example the following fields on a hypothetical record:
description: 'Red roses are pretty'
descriptionTokenized: ['Red', 'roses', 'are', 'pretty']
testing this out makes CONTAINS somewhat useful when searching for distinct, space separated substrings but still not as good as SQL LIKE would be.

Related

Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
So there is not much in the documentation about that.
When looking at the sample entry the doc shows, it can looks like below:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
So, breaking it down, the first element is the dictionary text:
東京スカイツリー - Tokyo Sky Tree
東京 スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
トウキョウ スカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.
Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?
Thanks
Do you try to define "tokyo sky tree" like this
"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"
I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

Parse.com search performance

I have a Parse class, that I want to make searchable on a String attribute . According to Parse's blog the most efficient way to do this is by using a list of lowercased words as your searchable String attribute. For example, instead of "My Search Term", use ["my", "search", "term"].
The problem is, that I want to be able to search substrings as well. So, an object with description "abc" should be returned for any of the following search strings: a, b, c, ab, bc or abc.
I thought about further breaking down the ["my", "search", "term"] tags into more words, so turn "my" into ["m", "y", "my"]. This list might become huge, though, in case of a slightly longer string (consider "search" for example).
Using query.matches would solve this, but is not recommended for large datasets.
So, what is best way to approach this, considering, that most of the tags will be a lot longer than only 2 characters?
You identified correctly the only two reasonable alternatives in parse. And for nontrivial keywords, I think query.matches will be the only workable choice.
Consider relaxing the requirement that the user can search any substring of a string. Isn't it unreasonable to suppose that I might search for the word "requirement" with the string "ire"? If search is limited to matching prefixes, the query can use startsWith.

Is there a way to search fhir resources on a text search parameter using wildcards?

I'm trying to search for all Observations where "blood" is associated with the code using:
GET [base]/Observation?code:text=blood
It appears that the search is matching Observations where the associated text starts with "blood" but not matching on associated text that contains "blood".
Using the following, I get results with a Coding.display of "Systolic blood pressure" but I'd like to also get these Observations by searching using the text "blood".
GET [base]/Observation?code:text=sys
Is there a different modifier I should be using or wildcards I should use?
The servers seem to do as the spec requests: when using the modifier :text on a token search parameter (like code here), the spec says:
":text The search parameter is processed as a string that searches
text associated with the code/value"
If we look at how a server is supposed to search a string, we find:
"By default, a field matches a string query if the value of the field
equals or starts with the supplied parameter value, after both have
been normalized by case and accent."
Now, if code would have been a true string search parameter, we could have applied the modifier contains, however we cannot stack modifiers, so in this case code:text:containts would may logical, but is not part of the current specification.
So, I am afraid that there is currently no "standard" way to do what you want.

Ignore elements in cts:search

I am having some xml documents which have a structure like this:
<root>
<intro>...</intro>
...
<body>
<p>..................
some text CO<sub>2</sub>
.................. </p>
</body>
</root>
Now I want to search all the results with phrase CO2 and also want to get results of above type in search results.
For this purpose, I am using this query -
cts:search
(fn:collection ("urn:iddn:collections:searchable"),
cts:element-query
(
fn:QName("http://iddn.icis.com/ns/fields","body"),
cts:word-query
(
"CO2",
("case-insensitive","diacritic-sensitive","punctuation-insensitive",
"whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
1
)
)
,
("unfiltered", "score-logtfidf"),
0.0)
But using this I am not able to get document with CO<sub>2</sub>. I am only getting data with simple phrase CO2.
If I replace the search phrase to CO 2 then I am able to get documents only with CO<sub>2</sub> and not with CO2
I want to get combined data for both CO<sub>2</sub> and CO2 as search results.
So can I ignore <sub> by any means, or is there any other way to cater this problem?
The issue here is tokenization. "CO2" is a single word token. CO<sub>2</sub>, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the <sub> element boundary.
You can't splice together CO<sub>2</sub> into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.
You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).
It appears that you want to add a phrase-through configuration.
Example:
<p>to <b>be</b> or not to be</p>
A phrase-through on <b> would then be indexed as "to be or not to be"

Amazon Cloudsearch not searching with partial string

I'm testing Amazon Cloudsearch for my web application and i'm running into some strange issues.
I have the following domain indexes: name, email, id.
For example, I have data such as: John Doe, John#example.com, 1
When I search for jo I get nothing. If I search for joh I still get nothing, But if I search for john then I get the above document as a hit. Why is it not getting when I put partial strings? I even put suggestors on name and email with fuzzy matching enabled. Is there something else i'm missing? I read the below on this:
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-compound-queries.html
I'm doing the searches using boto as well as with the form on AWS page.
What you're trying to do -- finding "john" by searching "jo" -- is a called a prefix search.
You can accomplish this either by searching
(prefix field=name 'jo')
or
q=jo*
Note that if you use the q=jo* method of appending * to all your queries, you may want to do something like q=jo* |jo because john* will not match john.
This can seem a little confusing but imagine if google gave back results for prefix matches: if you searched for tort and got back a mess of results about tortoises and torture instead of tort (a legal term), you would be very confused (and frustrated).
A suggester is also a viable approach but that's going to give you back suggestions (like john, jordan and jostle rather than results) that you would then need to search for; it does not return matching documents to you.
See "Searching for Prefixes in Amazon CloudSearch" at http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
Are your index field types "Text"? If they are just "Literals", they have to be an exact match.
I think you must have your name and email fields set as the literal type instead of the text type, otherwise a simple text search of 'jo' or 'Joh' should've found the example document.
While using a prefix search may have solved your problem (and that makes sense if the fields are set as the literal type), the accepted answer isn't really correct. The notion that it's "like a google search" isn't based on anything in the documentation. It actually contradicts the example they use, and in general muddies up what's possible with the service. From the docs:
When you search text and text-array fields for individual terms, Amazon CloudSearch finds all documents that contain the search terms anywhere within the specified field, in any order. For example, in the sample movie data, the title field is configured as a text field. If you search the title field for star, you will find all of the movies that contain star anywhere in the title field, such as star, star wars, and a star is born. This differs from searching literal fields, where the field value must be identical to the search string to be considered a match.

Resources