Natural sorting with Cloudant Search Index - sorting

We have tested different analyzers on search Index but none of them gives a natural sorting order. All the analyzers except keyword analyzer tokenize the given input string before applying the sorting algorithm. With Keyword Analyzer we get ASCII sorting, but there is an issue. The lowercase strings are placed at the end of the list and all uppercase strings at the start. This is because, in ASCII, lower case letters are larger than the uppercase.
Example:
Input Strings: steve, John, Dave, george
After sorting: Dave, John, goerge, steve
Expected Output: Dave, george, John, steve
I wish to know if there is a way in cloudant where we can achieve natural/alphabetical sorting order irrespective of the case?

Related

How, do I select the best match for a string in multiple documents, where the score is equal for both?

I have implemented an algorithm in Elm, where I compare a sentence (user input) to other multiple sentences (data). The algorithm is working in such a manner, where the user input and the data is converted to words, and then I compare them by words. the algorithm will mark any sentence from the data, which has the most words in the user input, as the best match.
Now, at the first run, the first sentence from the data will be counted as the best match and then going to the second sentence and looks for matches. If the matches number is greater than the previous one, then the second sentence will be counted as the best match, otherwise the previous one.
In case, if there are equal matches in two sentences, then currently I am comparing the size of these two sentences and select the one, which has the smaller size, as the best match.
There is no semantic meaning involved, so is this the best way to select the best match, which has the smaller size in this case? or are there some other better options available? I have tried to look for some scientific references, but couldn't find any.
Edit:
To summarize, if you want to compare one sentence to two other sentences, based on word occurrences, If both of the sentences have the same number of words, which also exist in your comparing sentence, then which one can be marked as the most similar? which methods are used to retrieve this similarity?
Some factors you can add in to improve the comparison:
String similarity (eg. Levensthein, Jaro-Winkler, ...)
Add a parameter for the sentence length by adding a linear or geometric penalty for a different sentence length (either on character or on word level)
Clean the strings (remove stopwords, special signs etc.)
Add the sequence (position) of words as a parameter. Thus which word is before/after another word.
Use Sentence Embeddings for similarity to also capture some semantics (https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/)
Finally, there will always be some sentences that have the same difference to your input, although they are different. That's OK, as long as they are actually similarly different to your input sentence.

String algorithm with hashing

We have a list of strings. I given smaller subsquences with atmost three dashes in between the letters I have to find the maximum number of matches it can make.
Eg.
1243, 3452, 2343,124
1_4_
Answer is 2 as 1243 and 124 both matches. We can either fill with any number or leave it.
Can anyone suggest me with some efficient hashing techniques?
Hashing wouldn't be a good approach for this problem...I suggest stringifying your numbers and then using a regex to match the characters based on their index in the string.

Comparing words with documents

As far as I know, doc2vec computes both embeddings for documents and words. Can I use a word vector and a document vector to estimate the similarity of a word to a document or only documents against documents and words against words? Any remark would be helpful.

How to get the maximum-character-matching string from an array of strings

I have an array of strings, which is a list of correct standard-disease names. I have another array of strings that is also a list of diseases with some variation in spelling; sometimes they are misspelled in the second array.
I want to map each disease name in the second array to the first array. This is not 100% possible, but I want to suggest a correct mapping against each incorrect disease name. Does someone know an algorithm?
Have a look at Levenshtein distance.
It is the minimum number of character changes required to transform one word to another.
More discussions and implementation can be found at "Measure the distance between two strings with Ruby?".

grouping strings by similarity

I have an array of strings, not many (maybe a few hundreds) but often long (a few hundred chars).
Those string are, generally, nonsense and different one from the other.. but in a group of those string, maybe 5 out of 300, there's a great similarity. In fact they are the same string, what differs is formatting, punctuation and a few words..
How can I work out that group of string?
By the way, I'm writing in ruby, but if nothing else an algorithm in pseudocode would be fine.
thanks
Assuming that you are not worried about misspellings or other errors in each word, you could do the following:
Build an inverted index, which is basically a hash keyed by word, pointing to a list of pointers to the strings which contain that word (how you handle duplicate occurrences is up to you). To determine strings that are similar to a given query string, lookup each query word in the index, and for each source string in the resulting lists, count how many times the source string appears in each list. The strings with the highest counts are your best candidates for similarity, because they contain the most words in common.
Then you can compute the edit distance between the two strings, or whatever other metric you want. This way you avoid the O(n^2) complexity of comparing each string with every other string.

Resources