Why does the native spell-checker of Eclipse not work in the following sentence?
$10\%$ of all natives like exotic spices from the neighboring countries.
The native spell checker of Eclipse together with Texlipse seems to not check lines with a percentage sign in them, whether they are escaped or not. If you first write a line with an error and comment it out later, the wrong spelling is still marked. If you first write the percentage symbol and then your comment, it will not be checked. If you happen to have an escaped percentage symbol in your sentence, the rest of the sentence is not spell-checked.
Related
I wrote a parser, which recognizes elements of text based on certain pattern.
My program is able to recognize paragraph, chapter etc. The problem is it shouldn't recognize elements, when there's a quote. For example:
Paragraph 1
Something here...
would be proceed as Paragraph.
And:
Paragraph 1
"Paragraph 2"
shouldn't. But as my program is based on regexp patterns, it looks for the word "Paragraph". I'm going line by line and recognize patterns for each line. I don't know how to tell my program: if you see quotes mark, leave text alone without doing anything? My mentor told me to use raise, but I'm not sure how to do it.
OK, so I'm still a bit of a beginner, I don't know if there is a way to direct the regex to ignore things inside quotes, but if I wanted to solve this problem, I would first make a copy of the text to be parsed, run a regex over that and delete everything inside quotes, then run the parser over the remaining text.
A bit kludgy and inelegant I admit, and may have performance issues over a large enough text, but it would get the job done.
See HERE for link to documentation of ruby regex. About a third of the way down it discusses quotes:
/\p{Pi}/ - 'Punctuation: Initial Quote'
/\p{Pf}/ - 'Punctuation: Final Quote'
You may be able to bake that into the regex with the ^ to direct it to ignore items in quotes.
We are confronting different search engines for our research
archives and having browsed the Xapian-Omega documentation, we
decided to try it out since the Omega option appears to be an
appropriate solution with several interesting search options.
We installed Xapian-Omega on a Linux Server (Deb 7) and tested
the setup with success. However we are unsure as to how one can
employ or perhaps even enable the use of Wild Cards or Regular
Expressions with Xapian-Omega.
We read that for Xapian one has to enable the Wild Card option
"QueryParser flags"
Could someone clarify this ?
ie. explain with or indicate a page with an example or two.
But we did not see much information regarding examples with Omega
CGI and although this latter runs well, wild card options
(such as * for the general wild card and ? as a single character),
do not seem to work as expected by default and they would be
useful, even though stemming and substrings etc may be functional.
Eg: It would be interesting to be able to employ standard simple
wild char searches with a certain precision such as :
medic* for medicine medical medicament
or with ? for single characters
Can Regexp be recognised with Omega ?
eg : sep[ae]r[ae]te(\w+)?
or searching for structured formats such as Email or Credit Card
Numbers or certain formula types in research papers etc.
In a note from Olly Betts long ago (Dev Mailing List) regarding
this one suggestion was to grep the index file but this would
defeat the RAD advantage of Omega.
Any examples of searches using Omega with Wild Cards or Regular
Expressions would be most appreciated ... even an indication of
a page where information regarding this theme is well presented
with examples illustrating how to develop advanced searches
using Xapian alone would be most welcome (PHP or Python perhaps).
(We are not concerned for the moment about the eventual
substantial increase in the size of the index size or in the
time to index the archive)
You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.
Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.
For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.
If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.
James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work.
We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?
Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.
In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful.
We will certainly try your proposal with the mailing list.
Thank you again.
I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...
(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-823
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).
When I see a small program which is written for some students, I often see something like this: (haskell, german):
ueber = "What the haeck!"
instead of
über = "What the häck!"
As many modern languages are specified to allow non-standard charactes in declaration names via UTF-8, is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters (say for a team of german students?) or is this just a historical reason?
I know, that you should keep names in a-zA-Z_0-9 if you develop an applicaio internationally, but are there any reason for avoiding this in a "local" project?
is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters
That is certainly the main reason. Other reasons that come to mind is that many development tools, search functions, editors, parsers, documentors, code search engines etc. will not expect non-ASCII input in code.
Also, you never know where your code may be used one day! The smallest innocent school project can grow into a nice Open-Source tool that gets used around the globe one day. In that case, ASCII is the smallest common denominator, at least at the moment.
I've had to work on a project started by French developers. They had to spend quite a bit of time translating their program to English when more people joined the project. Teach your German students this lesson up front, and not only will they be able to share their code with others, they'll no longer need an über or ueber variable either.
BTW, ü is an alphabetic character. + and - are non-alphanumeric, and I'd say it's obvious why they're disliked in function names.