Search using Xapian Omega - with Wild Cards or Regular Expressions - full-text-search

We are confronting different search engines for our research
archives and having browsed the Xapian-Omega documentation, we
decided to try it out since the Omega option appears to be an
appropriate solution with several interesting search options.
We installed Xapian-Omega on a Linux Server (Deb 7) and tested
the setup with success. However we are unsure as to how one can
employ or perhaps even enable the use of Wild Cards or Regular
Expressions with Xapian-Omega.
We read that for Xapian one has to enable the Wild Card option
"QueryParser flags"
Could someone clarify this ?
ie. explain with or indicate a page with an example or two.
But we did not see much information regarding examples with Omega
CGI and although this latter runs well, wild card options
(such as * for the general wild card and ? as a single character),
do not seem to work as expected by default and they would be
useful, even though stemming and substrings etc may be functional.
Eg: It would be interesting to be able to employ standard simple
wild char searches with a certain precision such as :
medic* for medicine medical medicament
or with ? for single characters
Can Regexp be recognised with Omega ?
eg : sep[ae]r[ae]te(\w+)?
or searching for structured formats such as Email or Credit Card
Numbers or certain formula types in research papers etc.
In a note from Olly Betts long ago (Dev Mailing List) regarding
this one suggestion was to grep the index file but this would
defeat the RAD advantage of Omega.
Any examples of searches using Omega with Wild Cards or Regular
Expressions would be most appreciated ... even an indication of
a page where information regarding this theme is well presented
with examples illustrating how to develop advanced searches
using Xapian alone would be most welcome (PHP or Python perhaps).
(We are not concerned for the moment about the eventual
substantial increase in the size of the index size or in the
time to index the archive)

You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.
Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.
For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.
If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.

James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work.
We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?
Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.
In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful.
We will certainly try your proposal with the mailing list.
Thank you again.

Related

How to define aspell word delimiters?

Aspell considers words with underscores or dashes as two, e.g. cloud-based is spell checked as "cloud" and "based". Is there any way to specify the word delimiters so as to exclude dash and underscore?
If I understand the question correctly, Aspell cannot do exactly what you want (up to my knowledge). This has to do with conditional compound word treatment, which is on the Aspells TODO list.
On the same list it is mentioned that Hunspell does a better job with compound words, so it might be a viable alternative if you're not bound to Aspell.
OpenOffice uses Hunspell for spellchecking, so it is easy to find out whether it fits your requirements. It does, at least, work for the "cloud-based" example, and does NOT consider all hyphenated words unconditional compounds, i.e. "based-cloud" would not be considered a spelling error.
Aspell is unable to do what you want it to do at this point. The interface it uses for handling word with symbols in them is not sophisticated enough to handle such a case at this time. More information on this is listed here.
Sorry that this cannot be solved up to this point, unless you want to implement your own interface. I would recommend using Hunspell as Mikhail suggested.

How does spell checker and spell fixer of Google (or any search engine) work?

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

Captcha for Japanese and Chinese?

Normally I use Recaptcha for all captcha purposes, but now I'm building a website that is translated into Chinese and Japanese, among other languages. I'd like to make the captcha as accessible to those users as possible. Even if they can read and type English characters (which is not necessarily the case), often times even I as an English-speaker have had trouble figuring out what the word in Recaptcha has to be.
One good solution I've seen (from Google) is to use numbers instead of text. Are there other good solutions? Is there a reliable free captcha service out there such as Recaptcha that offers this option?
The Chinese and Japanese both use a keyboard with Latin characters on. The Chinese input their 1000s of characters via Pinyin (Romanized Chinese) and so they are very familiar with all the same letters that you and I are. Therefore, whatever you are using for English speaking people can also be used for them.
PS - I know this is an answer to an old post, but I'm hoping this answer will help anyone who comes here with the same question.
I have encountered the same problem in the past, I resolved the issue by using the following CAPTCHA which uses a numerical validation:
http://www.tipstricks.org/
However, this may not be the best solution for you, so here is an extensive list of different CAPTCHAs you might want to consider (most of them are text based, but some use alternative methods such as numerical expressions):
http://captcha.org/
Hope this helps

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

Resources