Related
We are confronting different search engines for our research
archives and having browsed the Xapian-Omega documentation, we
decided to try it out since the Omega option appears to be an
appropriate solution with several interesting search options.
We installed Xapian-Omega on a Linux Server (Deb 7) and tested
the setup with success. However we are unsure as to how one can
employ or perhaps even enable the use of Wild Cards or Regular
Expressions with Xapian-Omega.
We read that for Xapian one has to enable the Wild Card option
"QueryParser flags"
Could someone clarify this ?
ie. explain with or indicate a page with an example or two.
But we did not see much information regarding examples with Omega
CGI and although this latter runs well, wild card options
(such as * for the general wild card and ? as a single character),
do not seem to work as expected by default and they would be
useful, even though stemming and substrings etc may be functional.
Eg: It would be interesting to be able to employ standard simple
wild char searches with a certain precision such as :
medic* for medicine medical medicament
or with ? for single characters
Can Regexp be recognised with Omega ?
eg : sep[ae]r[ae]te(\w+)?
or searching for structured formats such as Email or Credit Card
Numbers or certain formula types in research papers etc.
In a note from Olly Betts long ago (Dev Mailing List) regarding
this one suggestion was to grep the index file but this would
defeat the RAD advantage of Omega.
Any examples of searches using Omega with Wild Cards or Regular
Expressions would be most appreciated ... even an indication of
a page where information regarding this theme is well presented
with examples illustrating how to develop advanced searches
using Xapian alone would be most welcome (PHP or Python perhaps).
(We are not concerned for the moment about the eventual
substantial increase in the size of the index size or in the
time to index the archive)
You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.
Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.
For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.
If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.
James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work.
We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?
Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.
In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful.
We will certainly try your proposal with the mailing list.
Thank you again.
This may be a stupid question, but here goes.
I've seen several projects using some translation library (e.g. gettext) working with plain english placeholders. So for example:
_("Please enter your name");
instead of abstract placeholders (which has always been my instinctive preference)
_("error_please_enter_name");
I have seen various recommendations on SO to work with the former method, but I don't understand why. What I don't get is what do you do if you need to change the english wording? Because if the actual text is used as the key for all existing translations, you would have to edit all the translations, too, and change each key. Or don't you?
Isn't that awfully cumbersome? Why is this the industry standard?
It's definitely not proper normalization to do it this way. Are there massive advantages to this method that I'm not seeing?
Yes, you have to alter the existing translation files, and that is a good thing.
If you change the English wording, the translations probably need to change, too. Even if they don't, you need someone who speaks the other language to check.
You prep a new version, and part of the QA process is checking the translations. If the English wording changed and nobody checked the translation, it'll stick out like a sore thumb and it'll get fixed.
The main language is already existent: you don't need to translate it.
Translators have better context with a real sentence than vague placeholders.
The placeholders are just the keys, it's still possible to change the original language by creating a translation for it. Because when the translation doesn't exists, it uses the placeholder as the translated text.
We've been using abstract placeholders for a while and it was pretty annoying having to write everything twice when creating a new function. When English is the placeholder, you just write the code in English, you have meaningful output from the start and don't have to think about naming placeholders.
So my reason would be less work for the developers.
I like your second approach. When translating texts you always have the problem of homonyms. Like 'open' can mean a state of a window but also the verb to perform the action. In other languages these homonyms may not exist. That's why you should be able to add meaning to your placeholders. Best approach is to put this meaning in your text library. If this is not possible on the platform the framework you use, it might be a good idea to define a 'development language'. This language will add meaning to the text entries like: 'action_open' and 'state_open'. you will off course have to put extra effort i translating this language to plain english (or the language you develop for). I have put this philosophy in some large projects and in the long run this saves some time (and headaches).
The best way in my opinion is keeping meaning separate so if you develop your own translation library or the one you use supports it you can do something like this:
_(i18n("Please enter your name", "error_please_enter_name"));
Where:
i18n(text, meaning)
Interesting question. I assume the main reason is that you don't have to care about translation or localization files during development as the main language is in the code itself.
Well it probably is just that it's easier to read, and so easier to translate. I'm of the opinion that your way is best for scalability, but it does just require that extra bit of effort, which some developers might not consider worth it... and for some projects, it probably isn't.
There's a fallback hierarchy, from most specific locale to the unlocalised version in the source code.
So French in France might have the following fallback route:
fr_FR
fr
Unlocalised. Source code.
As a result, having proper English sentences in the source code ensures that if a particular translation is not provided for in step (1) or (2), you will at least get a proper understandable sentence than random programmer garbage like “error_file_not_found”.
Plus, what do you do if it is a format string: “Sorry but the %s does not exist” ? Worse still: “Written %s entries to %s, total size: %d” ?
Quite old question but one additional reason I haven't seen in the answers yet:
You could end up with more placeholders than necessary, thus more work for translators and possible inconsistent translations. However, good editors like Poedit or Gtranslator can probably help with that.
To stick with your example:
The text "Please enter your name" could appear in a different context in a different template (that the developer is most likely not aware of and shouldn't need to be). E.g. it could be used not as an error but as a prompt like a placeholder of an input field.
If you use
_("Please enter your name");
it would be reusable, the developer can be unaware of the already existing key for an error message and would just use the same text intuitively.
However, if you used
_("error_please_enter_name");
in a previous template, developers wouldn't necessarily be aware of it and would make up a second key (most likely according to a predefined wording scheme to not end up in complete chaos), e.g.
_("prompt_please_enter_name");
which then has to be translated again.
So I think that doesn't scale very well. A pre-agreed wording scheme of suffixes/prefixes e.g. for contexts can never be as precise as the text itself I think (either too verbose or too general, beforehand you don't know and afterwards it's difficult to change) and is more work for the developer that's not worth it IMHO.
Does anybody agree/disagree?
The QA manager where I work just informed me there is a bug in my desktop app due to the sign-on prompt being "Operator Id" when it should be "Operator ID". Her argument being that "Id" refers to the ego portion of Freud's "psychic apparatus" and is not semantically correct.
Now being an anal engineer (AE) I of course had to go and lookup Id vs ID and from my cursory investigations (google) it seems ID is just as commonly used for Freud's ego as Id is.
So my reasoning would be that Id is a shortened version of "Identifier" and is more correct or at least more commonly used than ID which would typically indicate a two word abbreviation.
I could just change the UI but then I wouldn't be holding up my profession as an AE so I was wondering if there any best practices or references for this sort of thing that I could use to support my argument? Keeping in mind that this question relates to the user interface and not the source code where abbreviations and casing are a whole different branch of philosophy.
According to Merriam-Webster, the abbreviation is "ID". If it were a correct abbreviation, it would have to be "Id." with the period.
Personally, I use "Id". The compiler doesn't care but my eyes do. Compare:
GetIDByWhatever <-- looks terrible
GetIdByWhatever <-- oh so pretty!
Aesthetics is more important than grammar when it comes to code, always. (Update: 4 years later, I don't stand by this statement anymore)
The 'D' doesn't stand for anything, so I've always considered it an abbreviation, not an acronym - and therefore I too use 'Id', not 'ID'.
I don't know about your qa's reasoning - words can have more than one meaning - this is not unusual in English :)
But it looks like the common usage is actually 'ID' (right or wrong :P), which is probably the format your users would expect.
As an UAUA (ultra-anal usability analyst), please use ID instead of Id.
Visually, it's more recognizable in English. Grammatically, "Id" is a word (rhymes with "squid") and the Freudian definition has been given above. We're never verbally asked to show "id", but "ID." I.D. is fine but passe, as the periods imply multiple words.
So.
Just use ID, okay?
OK.
It's interesting that so many feel "Id" should be the way to go.
I feel "ID" is appropriate because it hints at how we pronounce it -- I.D. Also, when I read "Id" in a running sentence, I sometimes have to come back and read it again just to ensure it's not a typo for "is" or "it".
So, as a technical writer, this is an issue that comes up for me quite regularly when reviewing other people's work, whether it be programmers, BAs or other writers. Typically, id refers to ego as others have said before me and the accepted abbreviation for identification is ID, just because plenty of people don't know or understand the rule doesn't mean that they are correct (sorry to be blunt), mind you the rules for punctuation and spelling to a large degree are almost as changeable as fashion!
However, what no-one seems to have asked is, does your company have a standard? At the end of the day if your company has a style guide and they have covered this topic in that guide, you should follow the guide. If it is not covered, then may I suggest that you raise the issue with the person that maintains the guide and include any stakeholders in the conversation. Consistency is key here. If the company you work for doesn't have a style guide, then perhaps it is time to start one!
Hope this helps...
The QA manager's line of reasoning is silly. Lots of English words have multiple meanings. "Lead", "lead", "lead" (metal, be at the front of, or a connector).
I would just try to be consistent with the capitalization used elsewhere in the app.
User interface and code are very different beasts...
"ID" is the correct answer for a user interface.
In code, consistency is your friend. Whether you like it or not, go with what is already there elsewhere in the code. If it's not there, then read up and make a decision, or get with the team and work out a way to go that everyone can agree to. Consistency makes life so much easier.
I prefer Id because when used with other 2-letter text, it doesn't become a single all-caps word
Photovoltaics systems ... PVID (one word or 2?) PvId (much more clear).
ID = Idaho! Id = Freud! Let the OCD begin!
There is a little OcD in all of us!
Anyway, the google style guide says this:
"ID: Not Id or id, except in string literals or enums. In some contexts, best to spell out as identifier or identification."
I'm going with that.
Microsoft is more vague, from what I could find.
as a short version of Identifier, I would use Id. Also ID it's freaky when you have functions like
getUserIDByName()
Multiple capitals in domain terms are quite problematic with CamelCase, as they can produce ambiguities and therefore dishomogeneity in your interfaces and namings
How would you say it if you were reading out loud? I'd pronounce the two letters. ID is correct, analogous with similar abbreviations such as TV. (No dots, please, as the letters don't stand for anything.)
When I'm dealing with abbreviations like this, I like to format them in small block capitals, but that's just a personal taste. Capitals, anyway.
(But I probably would continue to use Id in the code itself.)
I think its depend on the way we spell. We don't spell "it", but "ai-di". Id-ID is spell by two sounds, so people make the D in cap to avoid thinking id is a "word". Its more like a character symbol. I like the "ID" more, just because it's nicer.
ID is correct. I have been in the Id camp for years, simply because I believed it was an argument between abbreviations and acronyms. Until one day I learned of a 3rd type (subtype really) called Initialisms which are specific to abbreviations or acronyms that are specifically read one letter at a time rather than pronounced as a single word. Initialisms are all caps.
What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.
Some of my colleagues use special comments on their bug fixes, for example:
// 2008-09-23 John Doe - bug 12345
// <short description>
Does this make sense?
Do you comment bug fixes in a special way?
Please let me know.
I don't put in comments like that, the source control system already maintains that history and I am already able to log the history of a file.
I do put in comments that describe why something non-obvious is being done though. So if the bug fix makes the code less predictable and clear, then I explain why.
Over time these can accumulate and add clutter. It's better to make the code clear, add any comments for related gotchas that may not be obvious and keep the bug detail in the tracking system and repository.
I tend not to comment in the actual source because it can be difficult to keep up to date.
However I do put linking comments in my source control log and issue tracker. e.g. I might do something like this in Perforce:
[Bug-Id] Problem with xyz dialog.
Moved sizing code to abc and now
initialise later.
Then in my issue tracker I will do something like:
Fixed in changelist 1234.
Moved sizing code to abc and now
initialise later.
Because then a good historic marker is left. Also it makes it easy if you want to know why a particular line of code is a certain way, you can just look at the file history. Once you've found the line of code, you can read my commit comment and clearly see which bug it was for and how I fixed it.
Only if the solution was particularly clever or hard to understand.
I usually add my name, my e-mail address and the date along with a short description of what I changed, That's because as a consultant I often fix other people's code.
// Glenn F. Henriksen (<email#company.no) - 2008-09-23
// <Short description>
That way the code owners, or the people coming in after me, can figure out what happened and they can get in touch with me if they have to.
(yes, unfortunately, more often than not they have no source control... for internal stuff I use TFS tracking)
While this may seem like a good idea at the time, it quickly gets out of hand. Such information can be better captured using a good combination of source control system and bug tracker. Of course, if there's something tricky going on, a comment describing the situation would be helpful in any case, but not the date, name, or bug number.
The code base I'm currently working on at work is something like 20 years old and they seem to have added lots of comments like this years ago. Fortunately, they stopped doing it a few years after they converted everything to CVS in the late 90s. However, such comments are still littered throughout the code and the policy now is "remove them if you're working directly on that code, but otherwise leave them". They're often really hard to follow especially if the same code is added and removed several times (yes, it happens). They also don't contain the date, but contain the bug number which you'd have to go look up in an archaic system to find the date, so nobody does.
Comments like this are why Subversion lets you type a log entry on every commit. That's where you should put this stuff, not in the code.
I do it if the bug fix involves something that's not straightforward, but more often than not if the bugfix requires a long explanation I take it as a sign that the fix wasn't designed well. Occasionally I have to work around a public interface that can't change so this tends to be the source of these kinds of comments, for example:
// <date> [my name] - Bug xxxxx happens when the foo parameter is null, but
// some customers want the behavior. Jump through some hoops to find a default value.
In other cases the source control commit message is what I use to annotate the change.
Whilst I do tend to see some comments on bugs inside the code at work, my personal preference is linking a code commit to one bug. When I say one I really mean one bug. Afterwards you can always look at the changes made and know which bug these were applied to.
That style of commenting is extremely valuable in a multi-developer environment where there is a range of skills and / or business knowledge across the developers (e.g. - everywhere).
To the experienced knowledgable developer the reason for a change may be obvious, but for newer developers that comment will make them think twice and do more investigation before messing with it. It also helps them learn more about how the system works.
Oh, and a note from experience about the "I just put that in the source control system" comments:
If it isn't in the source, it didn't happen.
I can't count the number of times the source history for projects has been lost due to inexperience with the source control software, improper branching models etc. There is
only one place the change history cannot be lost - and that's in the source file.
I usually put it there first, then cut 'n paste the same comment when I check it in.
No I don't, and I hate having graffiti like that litter the code. Bug numbers can be tracked in the commit message to the version control system, and by scripts to push relevant commit messages into the bug tracking system. I do not believe they belong in the source code, where future edits will just confuse things.
Often a comment like that is more confusing, as you don't really have context as to what the original code looked like, or the original bad behavior.
In general, if your bug fix now makes the code run CORRECTLY, just simply leave it without comments. There is no need to comment correct code.
Sometimes the bug fix makes things look odd, or the bug fix is testing for something that is out of the ordinary. Then it might be appropriate to have a comment - usually the comment should refer back to the "bug number" from your bug database. For example, you might have a comment that says "Bug 123 - Account for odd behavior when the user is in 640 by 480 screen resolution".
If you add comments like that after a few years of maintaining the code you will have so many bug fix comments you wouldn't be able to read the code.
But if you change something that look right (but have a subtle bug) into something that is more complicated it's nice to add a short comment explaining what you did, so that the next programmer to maintain this code doesn't change it back because he (or she) thinks you over-complicated things for no good reason.
No. I use subversion and always enter a description of my motivation for committing a change. I typically don't restate the solution in English, instead I summarize the changes made.
I have worked on a number of projects where they put comments in the code when bug fixes were made. Interestingly, and probably not coincidentally, these were projects which either didn't use any sort of source control tool or were mandated to follow this sort of convention by fiat from management.
Quite honestly, I don't really see the value in doing this for most situations. If I want to know what changed, I'll look at the subversion log and the diff.
Just my two cents.
If the code is corrected, the comment is useless and never interesting to anybody - just noise.
If the bug isn't solved, the comment is wrong. Then it makes sense. :) So just leave such comments if you didn't really solved the bug.
To locate ones specific comment we use DKBUGBUG - which means David Kelley's fix and reviewer can easily identity, Ofcourse we will add Date and other VSTS bug tracking number etc along with this.
Don't duplicate meta data that your VCS is going to keep for you. Dates and names should be in the automatically added by the VCS. Ticket numbers, manager/user names that requested the change, etc should be in VCS comments, not the code.
Rather than this:
//$DATE $NAME $TICKET
//useful comment to the next poor soul
I would do this:
//useful comment to the next poor soul
If the code is on a live platform, away from direct access to the source control repository, then I will add comments to highlight the changes made as a part of the fix for a bug on the live system.
Otherwise, no the message that you enter at checkin should contain all the info you need.
cheers,
Rob
When I make bugfixes/enhancements in third party libraries/component I often make some comments. This makes it easier find and move the changes if I need to use a newer version of the library/component.
In my own code I seldom comments bugfixes.
I don't work on multi-person projects, but I sometimes add comments about a certain bug to a unit test.
Remember, there's no such thing as bugs, just insufficient testing.
Since I do as much TDD as possible (everything else is social suicide, because every other method will force you to work endless hours), I seldomly fix bugs.
Most of the time I add special remarks like this one to the code:
// I KNOW this may look strange to you, but I have to use
// this special implementation here - if you don't understand that,
// maybe you are the wrong person for the job.
Sounds harsh, but most people who call themselves "developers" deserve no other remarks.