How to improve detection of sentences in Sphinx? - full-text-search

It is possible to search words in one sentence with Sphinx. For example, we have next text:
Вася молодец, съел огурец, т.к. проголодался. Такие дела.
If I search
молодец SENTENCE огурец
i find this text. If I search
молодец SENTENCE проголодался
I cant find this text, because dot from phrase т.к. regarded as end of sentence.
And how I see, set of delimiters is hardcoded in Sphinx's sources.
My question is how to improve detection of sentence? Better way for me is to use Yandex's Tomita parser or another nlp library with smart detection of sentences.

Split text into sentences with Yandex's Tomita parser. We get the text, which splited by "\n".
Delete all ".", "!", "?" leaving last from each sentence.
Build the Sphinx index with this preprocessed data.

Related

Regular expression to match hashtags in both English and Chinese

I'm trying to write a regex to extract hashtag content in both English and Chinese. Hashtags in Chinese are indicated differently from hashtags in English. Two hashtag symbols are used, and the content is put right in between them, such as #中国#. Also, spaces are not used in Chinese. An example is
我来自#中国#。
The corresponding sentence in English is
I'm from #China.
Is it possible to write a single regex to extract hashtags in these two languages? If so, how?
string = "我来自#中国#。 I'm from #China."
string.scan(/#\w+|#\p{Han}+#/)
=> ["#中国#", "#China"]

How to search emoticon/emoji in elasticsearch?

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.
The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).
The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

Highlighting a query word in a document

I have a document and a query term. I want to
find the query term in the document.
Pad each occurrence of the query term with a certain text marker.
For example
Text: I solemnly swear that I am upto no good.
Query: swear
Output: I solemnly MATCHSTART swear MATCHEND that I am upto no good.
Assuming that I have multiple query words and a large document, now can I do this efficiently.
I did go over various links on the internet but couldn't find anything very conclusive or definite. Moreover, this is just a programming question and has nothing to do with search engine development or information retrieval.
Any help would be appreciated. Thanks.
If each your query is word (some substring, does not contains SP/TAB/NL, etc), and allowed with very low probability false positive (when you mark some word, omitted in the query set) - you can use Bloom filter: http://en.wikipedia.org/wiki/Bloom_filter
First, load your query words into Bloom filter, and thereafter scan document, and match each word in the filter. If search result is positive - mark this word.
You can use my implementation of bloom filter: http://olegh.cc.st/src/bloom.c.txt
In Python:
text = "I solemnly swear I am up to no good" #read in however you like.
query = input("Query: ")
text.replace(" "+query" "," MATCHSTART "+query+" MATCHEND ")
OUTPUT:
'I solemnly MATCHSTART swear MATCHEND that I am up to no good.'
You could also use regex, but that's slower, so I just used string concat to add whitespace to the beginning and end of the word (so as not to match "swears" or "swearing" or "sportswear". This is easily translatable to whatever language you prefer.

Using an "uncommon" delimiter for creating arrays in Ruby on Rails

I am building an app in Ruby on Rails in which I am pulling in content another file, and wonder if there's any simple way to create a unique delimiter for separating string content, or whether there's another approach I should take.
Let's say I have a paragraph of text, I'd like to pull in, and let's say I don't know what the text will contain.
What I would like to do is put some sort of delimiter at, let's say, 5 random points in the paragraph so that, later on, an array can be created in which content up to that delimiter can be separated out into an individual element.
For a bit of context, let's say I have a paragraph pulled in as a string:
Hello, this is a paragraph of text which will be delimited. Goodbye.
Now, let's say I add a delimiter at various points, as follows (I know how to do this in code):
Hello, this [DELIMITER] is a paragraph [DELIMITER] of text which [DELIMITER] will [DELIMITER] be delimitted. Goodbye.
Again, I know how to do this, but let's say I'm able to use the above to create an array as follows:
my_array = ["Hello, this", "is a paragraph", "of text which", "will", "be delimitted. Goodbye"
I'm confident of achieving all of the above. The challenge I'm having is: what should my delimiter be?
Normally, commas are used as delimiters but, if the text already includes a comma, this will result in delimitations where I do not wish them to occur. In the above example, for example, the comma between "Hello" and "this" would cause the "Hello, this" element to be split up into "Hello" and "this"—not what I want.
What I have thought of doing is using a random (hex) number generator to create a new delimiter each time the page is loaded, e.g. "Hello, this 023ABCDEF is a paragraph 023ABCDEF...", but I'm not sure this is the correct approach.
Is there a simpler solution?
Multipart mime messages take (more or less) the approach of a GUID separator; it's adequate.
I view this as a different type of problem, though, closer to a text editor marking sections of text bold, or italic, etc. That can be handled via string parsing (a la Markdown, SO's formatting) or data structures.
The text editor approach is generally more flexible, and instead of a simple collection of strings, uses a collection (or tree) of structures that hold metadata about the section (type, formatting, whatever).
The best approach depends on your needs:
Are sections nestable?
Will this be rendered?
If so, do section "types" need specific rendering?
Are there section "types", or are they all the same?
Will the text in question be edited before, during, or after sectioning?
Etc.

Ruby: Fast way to filter out keywords from text based on an array of words

I have a large text string and about 200 keywords that I want to filter out of the text.
There are numerous ways todo this, but I'm stuck on which way is the best:
1) Use a for loop with a gsub for each keyword
3) Use a massive regular expression
Any other ideas, what would you guys suggest
A massive regex is faster as it's going to walk the text only once.
Also, if you don't need the text, only the words, at the end, you can make the text a Set of downcased words and then remove the words that are in the filter array. But this only works if you don't need the "text" to make sense at the end (usually for tags or full text search).
Create a hash with each valid keyword as key.
keywords = %w[foo bar baz]
keywords_hash = Hash[keywords.map{|k|[k,true]}]
Assuming all keywords are 3 letters or more, and consist of
alphanumeric characters or a dash, case is irrelevant,
and you only want each keyword present in the text returned once:
keywords_in_text = text.downcase.scan(/[[:alnum:][-]]{3,}/).select { |word|
keywords_hash.has_key? word
}.uniq
This should be reasonably efficient even when both the text to be searched and the list of valid keywords are very large.

Resources