Highlighting a query word in a document - algorithm

I have a document and a query term. I want to
find the query term in the document.
Pad each occurrence of the query term with a certain text marker.
For example
Text: I solemnly swear that I am upto no good.
Query: swear
Output: I solemnly MATCHSTART swear MATCHEND that I am upto no good.
Assuming that I have multiple query words and a large document, now can I do this efficiently.
I did go over various links on the internet but couldn't find anything very conclusive or definite. Moreover, this is just a programming question and has nothing to do with search engine development or information retrieval.
Any help would be appreciated. Thanks.

If each your query is word (some substring, does not contains SP/TAB/NL, etc), and allowed with very low probability false positive (when you mark some word, omitted in the query set) - you can use Bloom filter: http://en.wikipedia.org/wiki/Bloom_filter
First, load your query words into Bloom filter, and thereafter scan document, and match each word in the filter. If search result is positive - mark this word.
You can use my implementation of bloom filter: http://olegh.cc.st/src/bloom.c.txt

In Python:
text = "I solemnly swear I am up to no good" #read in however you like.
query = input("Query: ")
text.replace(" "+query" "," MATCHSTART "+query+" MATCHEND ")
OUTPUT:
'I solemnly MATCHSTART swear MATCHEND that I am up to no good.'
You could also use regex, but that's slower, so I just used string concat to add whitespace to the beginning and end of the word (so as not to match "swears" or "swearing" or "sportswear". This is easily translatable to whatever language you prefer.

Related

How to tokenize a sentence based on maximum number of words in Elasticsearch?

I have a string like "This is a beautiful day"
What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be:
"This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day"
So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?
Seems that you're looking for shingle token filter it does exactly what you want.
As what #Oleksii said.
in your case max_shingle_size = 2 (which is the default), and min_shingle_size = 1.

Elasticsearch - parts of string with correct word arrangement

I'm wondering how to properly query this scenario:
Field values:
20182199
20182188
20182177
Query-strings (that should match all three):
2018 -> hit
0182 -> fail
821 -> fail
The other requirement is, that if greater than 1 word is present in the query string, both (the whole query string) must match, not every word of the string seperately.
Thats why I choosed a match phrase prefix query. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html). It only doesn't cover hits on inner elements of a word. That's what I am now looking for :-)
I'd appreciate any help. Thank you!
I believe Elasticsearch docs specifically cover your use case and you are looking to match what Elasticsearch refers to as ngrams.
Partial Matching - a quick introduction
Ngrams for Partial Matching - it's worth noting that Elasticsearch search calls sequence of characters ngrams, and sequence of tokens shingles (a slight difference from what you may be used to)
Wildcard and Regexp Queries - the same section on partial matching has notes on these queries that might suffice for you and may not require you to reindex/change analysis

How to search emoticon/emoji in elasticsearch?

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.
The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).
The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

Ignore elements in cts:search

I am having some xml documents which have a structure like this:
<root>
<intro>...</intro>
...
<body>
<p>..................
some text CO<sub>2</sub>
.................. </p>
</body>
</root>
Now I want to search all the results with phrase CO2 and also want to get results of above type in search results.
For this purpose, I am using this query -
cts:search
(fn:collection ("urn:iddn:collections:searchable"),
cts:element-query
(
fn:QName("http://iddn.icis.com/ns/fields","body"),
cts:word-query
(
"CO2",
("case-insensitive","diacritic-sensitive","punctuation-insensitive",
"whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
1
)
)
,
("unfiltered", "score-logtfidf"),
0.0)
But using this I am not able to get document with CO<sub>2</sub>. I am only getting data with simple phrase CO2.
If I replace the search phrase to CO 2 then I am able to get documents only with CO<sub>2</sub> and not with CO2
I want to get combined data for both CO<sub>2</sub> and CO2 as search results.
So can I ignore <sub> by any means, or is there any other way to cater this problem?
The issue here is tokenization. "CO2" is a single word token. CO<sub>2</sub>, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the <sub> element boundary.
You can't splice together CO<sub>2</sub> into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.
You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).
It appears that you want to add a phrase-through configuration.
Example:
<p>to <b>be</b> or not to be</p>
A phrase-through on <b> would then be indexed as "to be or not to be"

Stemming and partial search using MongoDB 2.4

What is the correct way of doing full text search and partial searches in MongoDB?
E.g. the norwegian word "sokk" (sock).
When searching for "sokk" I want to match on "sokker" (sock in plural), "sokk" and "sokkepose"
A search for "sokker" should match "sokk" and "sokker".
I get the wanted result by using this ruby snippet:
def self.search(q)
result = []
# Full text search first
result << Ad.text_search(q).to_a
# Then search for parts of the word
result << Ad.any_of({ title: /.*#{q}.*/i }, { description: /.*#{q}.*/i} ).to_a
result.flatten!
result.uniq
end
Any suggestions? :)
Cheers,
Martin Stabenfeldt
Martin,
A few suggestions / recommendations / corrections:
Full Text Search in 2.4 is not production ready and should not be deployed in production without knowing the tradeoffs being made. You can find more details at - http://docs.mongodb.org/manual/tutorial/enable-text-search/
For Text Search to work, you need to provide appropriate language for the document while adding it (or specific fields in 2.6). This ensures the words are appropriately stemmed and stopped words are removed from indexing that field.
Specify language while searching for a specific field so that it is appropriately stemmed and top words removed for searching and ranking the results appropriately. You can find more details about both indexing and searching at http://docs.mongodb.org/manual/reference/command/text/ . You can also see the languages that are supported by the MongoDB FTS on that webpage.
Ideally you would not be using regular expressions while doing a full text search, but rather specify the words / strings that you are looking for along with the language.

Resources