Culture-independent stemmer/analyzer for Lucene.NET - internationalization

We're currently developing a full-text-search-enabled app and we Lucene.NET is our weapon of choice. What's expected is that an app will be used by people from different countries, so Lucene.NET has to be able to search across Russian, English and other texts equally well.
Are there any universal and culture-independent stemmers and analyzers to suit our needs? I understand that eventually we'd have to use culture-specific ones, but we want to get up and running with this potentially quick and dirty approach.

Given that the spelling, grammar and character sets of English and Russian are significantly different, any stemmer which tried to do both would either be massively large or poorly performant (most likely both).
It would probably be much better to use a stemmer for each language, and pick which one to use based on either UI clues (what language is being used to query) or by explicit selection.
Having said that, it's unlikely that any Russian text will match an English search term correctly or vice-versa.
This sounds like a case where a little more business analysis would help more than code.

There is no such a thing as a language-independent stemmer. In fact, whether stemming improves retrieval performance varies per language. The best you can do is language guessing on the documents and queries, then dispatch to the appropriate analyzer/stemmer.
Language guessing on short queries is hard, though (as in state-of-the-art, not quick 'n' dirty). If your queries are short, you might want use a simple whitespace analyzer on the queries and not stem anything.

Related

How can I efficiently find all people mentioned in some text, while tolerating spelling mistakes?

I have a list of names of millions of famous people (from Wikidata), and I need to create a system that efficiently finds all people mentioned in a fairly short text: it can be just one word (eg. "Einstein") to a few pages of text (eg. a Wikipedia page).
I need the system to be fairly tolerant to spelling mistakes (eg. Mikael Jackson instead of Michael Jackson), and short forms (eg. M. Jackson). In case of ambiguity, it should return all possible people (eg. "George Bush" should return both father and son, and possibly other homonyms).
This related question has a few interesting answers, including using the Aho-Corasick algorithm. There are libraries in many languages, including in Python. However, it does not seem to support fuzzy search (ie. tolerate misspellings).
I guess I could extend the vocabulary to include all the possible spellings of each name, but that would make the vocabulary too large, so I would rather avoid that if possible (moreover, I may want to extend this solution to more than just people at one point).
I took a quick look at Lucene/ElasticSearch but it does not seem to support this use case (unless I missed it).
Any ideas?
Elasticsearch has support for fuzzy matching: See documentation here.

How to get informal synonyms (ie technology => tech)?

How can I get informal synonyms or abbreviations for a word? I tried using stemmers (like the Porter filter) and thesauruses, but they don't seem to recognize "informal" synonyms for a word. I guess my examples below are not really synonyms, but rather abbreviations.
Examples include:
Technology => Tech
Business => Biz
Applications => Apps
To the best of my knowledge, there is no such library available. The synonyms/abbreviations you mention in the question are a part of the evolutionary nature of any natural language. That is to say, hard-coding such a list will never ever give you a complete list of such equivalences.
The only good long (or even medium) term solution is to use appropriate NLP/ML paradigms to "learn" them. Such equivalences are highly context dependent. For example:
NLP == natural language processing OR neuro-linguistic programming (ambiguous acronym)
Ft. == foot OR featuring (ambiguous abbreviation)
A historical (and slightly philosophical) presentation of this context dependence is explained here. For a more day-to-day example, see this Wikipedia disambiguation page (this is the second example in the above list).
Basically, what I am trying to illustrate here is that there is no off-the-shelf tool/library for this because resolving synonymy (especially colloquial terms, abbreviations, etc.) is a difficult problem.

Implementing data structures/algorithms in languages that already support them

Does it makes sense to implement your own version of data structures and algorithms in your language of choice even if they are already supported, knowing that care has been taking into tuning them for best possible performance?
Sometimes - yes. You might need to optimise the data structure for your specific case, or give it some specific extra functionality.
A java example is apache Lucene (A mature, widely used Information Retrieval library). Although the Map<S,T> interface and implementations already exists - for performance issues, its usage is not good enough, since it boxes the int to an Integer, and a more optimized IntToIntMap was developed for this purpose, instead of using a Map<Integer,Integer>.
The question contains a false assumption, that there's such a thing as "best possible performance".
If the already-existing code was tuned for best possible performance with your particular usage patterns, then it would be impossible for you to improve on it in respect of performance, and attempting to do so would be futile.
However, it wasn't tuned for best possible performance with your particular usage. Assuming it was tuned at all, it was designed to have good all-around performance on average, taken across a lot of possible usage patterns, some of which are irrelevant to you.
So, it is possible in principle that by implementing the code yourself, you can apply some tweak that helps you and (if the implementers considered that tweak at all) presumably hinders some other user somewhere else. But that's OK, they don't have to use your code. Maybe you like cuckoo hashing and they like linear probing.
Reasons that the implementers might not have considered the tweak include: they're less smart than you (rare, but it happens); the tweak hadn't been invented when they wrote the code and they aren't following the state of the art for that structure / algorithm; they have better things to do with their time and you don't. In those cases perhaps they'd accept a patch from you once you're finished.
There are also reasons other than performance that you might want a data structure very similar to one that your language supports, but with some particular behavior added or removed. If you can't implement that on top of the existing structure then you might well do it from scratch. Obviously it's a significant cost to do so, up front and in future support, but if it's worth it then you do it.
It may makes sense when you are using a compiled language (like C, Assembly..).
When using an interpreted language you will probably have a performance loss, because the native structure parsers are already compiled, and won't waste time "interpreting" the new structure.
You will probably do it only when the native structure or algorithm lacks something you need.

Automate Finding Pertinent Methods in Large Project

I have tried to be disciplined about decomposing into small reusable methods when possible. As the project growing, I am re-implementing the exact same method.
I would like to know how to deal with this in an automated way. I am not looking for an IDE specific solution. Dependency on method names may not be sufficient. Unix and scripting are solutions that would be extremely beneficial. Answers such as "take care" etc. are not the solutions I am seeking.
I think the cheapest solution to implement might be to use Google Desktop. A more accurate solution would probably be much harder to implement - treat your code base as a collection of documents where the identifiers (or tokens in the identifiers) are words of the document, and then use document clustering techniques to find the closest matching code to a query. I'm aware of some research similar to that, but nothing close to out-of-the-box code that you could use. You might try looking on Google Code Search for something. I don't think they offer a desktop version, but you might be able to find some applicable code you can adapt.
Edit: And here's a list of somebody's favorite code search engines. I don't know whether any are adaptable for local use.
Edit2: Source Code Search Engine is mentioned in the comments following the list of code search engines. It appears to be a commercial product (with a free evaluation version) that is intended to be used for searching local code.

Is this backwards naming convention a bad idea (ie. contrary to industry standards)?

I've always reversed names so that they naturally group in intellisense. I am wondering if this is a bad idea.
For example, I run a pet store and I have invoicing pages add, edit, delete, and store pages display, preview, edit. To get the URL for these, I would call the methods (in a suitable class like GlobalUrls.cs
InvoicingAddUrl()
InvoicingEditUrl()
InvoicingDeleteUrl()
StoreDisplayUrl()
StorePreviewUrl()
StoreEditUrl()
This groups them nicely in intellisense. More logical naming would be:
AddInvoiceUrl()
EditInvoiceUrl()
DeleteInvoiceUrl()
DisplayStoreUrl()
PreviewStoreUrl()
EditStoreUrl()
Is it better (better being, more of an industry standard way) to group them for intellisense, or logically?
Grouping in Intellisense is just one factor in creating a naming scheme, but logically grouping by category rather than function is a common practice as well.
Most naming "conventions" dictate usage of characters, casing, underscores, etc. I think it is a matter of personal preference (company, team or otherwise) as to whether you use NounVerb or VerbNoun formatting for your method names.
Here are some resources:
Microsoft - General Naming Conventions
Wikibooks C# Programming/Naming
Akadia .NET Naming Conventions
Related questions:
Naming Conventions - Guidelines for Verbs, Nouns and English Grammar Usage
Do vs. Run vs. Execute vs. Perform verbs
Events - naming convention and style
Check out how the military names things. For example, MREs are Meals, Ready to Eat. They do this because of sort order, efficiency and not making mistakes. They are ready to ignore the standard naming conventions of the language (i.e., English) used outside of their organization because they are not impressed with the quality of operations outside of their organization. In the military, the quality of operations is literally a matter of life and death. Also, by doing things their own way they have a way of identifying who is inside and who is outside of the organization. Anyone unable or unwilling to learn the military way, which is different but not impossibly difficult, is not their first choice for recruitment or promotion.
So, if you are impressed with the standard quality of software out there, then by all means keep doing what everyone else is doing. But, if you wish to do better than you have in the past, or better than your competitor, then I suggest looking at other fields for lessons learned the hard way, such as the military. Then make some choices for your organization, that are not impossible but are for you and your competitiveness. You can choose big-endian names (most significant information comes last) or the military-style little-endian names (most significant information comes first), or you can use the dominant style your competitors probably use, which is doing whatever you feel like whenever you feel like it.
Personally, I prefer little-endian Hungarian (Apps) naming, which was widely seen as superior when it first came out, but then lost favor because Hungarian (Sys) naming destroyed the advantage due to a mistranslation of the basic idea, and because of rampant abbreviations. The original intent was to start a name with what kind of a thing it is, then become increasingly specific until you end with a unique qualification. This is also the order that most array dimensions and object qualifiers are in, so in most languages little-endian naming flows into the larger scheme of the language.
You are on to something. Forward, march.
It's not intrinsically bad. It has the upside of being easier to identify the type while scanning, and groups the options together in Intellisense like you said. As long as you and everyone else on your team picks a way of doing things and stays consistent about it there shouldn't be any big problems.
Based on the methods listed, you might be able to refactor Invoicing and Store out into their own classes, which would be closer to the mythical "industry standard" way.
That said, whatever your programming team can agree on for naming convention should be fine. The important thing is to be consistent within the project.
I don't think it's a good idea to develop a coding standard around a tool (as least not as the first consideration). Even though most IDEs will have Intellisense these days, and most people will be using said IDEs, I think that first and foremost a coding standard should be about making the code legible and navigable on its own merits.
I would opt for most logical naming, personally. When I write code and I have some object I'm about to call a member function on, I'm usually thinking about what member function to call based on the action I'm about to do, because I already know the object I'm manipulating. So my first impulse would be to start typing "Add" if I wanted to add something, and see what Intellisense showed me. This is, of course, subjective.
I have never actually seen anybody using your alphabetical, Intellisense grouping anywhere -- at least not in code that is not worth using as a basis for comparison because it was so horrid in other ways.
That said, if it's your standard, do what you want -- consistency is the important part.

Resources