Evaluating language identification methods - algorithm

Part of my thesis work is to evaluate number of language detection methods that are already available and then finally implement one them.
For this I have chosen the following methods,
N-Gram-Based Text Categorization by Cavnar and Trenkle
Statistical Identification of Language by Ted Dunning
Using compression-based language models for text categorization by Teahan and Harper
Character Set Detection
A composite approach to language/encoding detection
I have to first evaluate the methods and preferably present a table with accuracy for each of these methods. My question is that in order to find the accuracy of each of these methods, do I need to go ahead a build the language models using training data, then test them and record the accuracy or is there any other approach that I can follow here. Though most of the researches already include these accuracy tables, I am not sure if it's accepted in my education to simply grab it and present in the report.
Appreciate any thoughts on this.

I would also suggest asking your thesis advisor. Implementing all of them will be a lot of work, and it is very difficult to really compare them without being able to test them. If I remember correctly the last three have not been well evaluated in the literature, so it would be difficult to compare their results. I have implemented (and evaluated) only the first one of those myself. One big question is also how big a part of your thesis this LI evaluation and implementation is?

Related

QA Algorithm for Q Processing

What Algorithm/method do I use for a Question Answering System's Question Processing?
I have been searching possible algorithms for my Question Answering System, the only thing that I think that would be possible to use is Parsing but I have asked about parsing in my last question and with the answers there i think its not possible to be used?(I'm not sure).
My idea of using Parsing is by Cutting the question into pieces word per word and then it will go through a Storage of Words that would determine what Kind of Word(noun,adjective,verb,etc) is being said. My purpose of using Parsing is to remove or rather to determine the Topic of the question.
The other idea of mine is the ChatterBot. A Chatterbot uses a query of words? Correct me if I'm not mistaken and those words are assigned to another Word. It would randomly choose a word from its Query.
Example: User's Statement: Hello > ChatterBot's Possible Replies: Hi,Hello,Hey
I'm not quite sure what is the possible method/algorithm to use in a Question Answering, I have read the Wikipedia post : http://en.wikipedia.org/wiki/Question_answering but I do not quite understand what algorithm to use in Question Processing.
Thank you.
PS: I'm developing in Javascript. Q = Question
You could use a naive bayes classifier in order to look at the questions and determine their subject. You'd need a lot of training data and a fairly narrow domain.
The sophisticated responses to this problem involve a lot of machine inference techniques which are a bit out of my skill level to explain extremely well. My idea is to use a markov network in which each word has an edge to one or two words next to it. A series of tests are applied to each word which indicate likely memberhood of that word to one of its possible meanings (For example, Mark is more likely a name if it's capitalized, but if the next word is 'a' it probably is used in the sense of a verb.) From there the machine can attempt to determine the actual meaning of the sentence, which will rely on the use of, again, unimaginably large amounts of training data.
Coursera's Probabilistic Graphical Models class (Probably their NLP class too) would probably be the best resource if you're interested in becoming skilled in this area. (PGM is the only reason I know anything about this!)
here's a great book, you may need to read to get a lot of stuff related to NLP, and Question answering systems http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210
the book has a full section (V.Applications) that will help you a lot to develop a good system.
but note that the book is discussing theories and algorithms only (no code)
it's not about parsing text only, you'll need to understand the context to provide better answer. actually you need to extract some keywords and ignore everything else.
also you may read in topics Keywords (Bag of words), algorithms like (TF/IDF).

Expert system for writing programs?

I am brainstorming an idea of developing a high level software to manipulate matrix algebra equations, tensor manipulations to be exact, to produce optimized C++ code using several criteria such as sizes of dimensions, available memory on the system, etc.
Something which is similar in spirit to tensor contraction engine, TCE, but specifically oriented towards producing optimized rather than general code.
The end result desired is software which is expert in producing parallel program in my domain.
Does this sort of development fall on the category of expert systems?
What other projects out there work in the same area of producing code given the constraints?
What you are describing is more like a Domain-Specific Language.
http://en.wikipedia.org/wiki/Domain-specific_language
It wouldn't be called an expert system, at least not in the traditional sense of this concept.
Expert systems are rule-based inference engines, whereby the expertise in question is clearly encapsulated in the rules. The system you suggest, while possibly encapsulating insight about the nature of the problem domain inside a linear algebra model of sorts, would act more as a black box than an expert system. One of the characteristics of expert systems is that they can produce an "explanation" of their reasoning, and such a feature is possible in part because the knowledge representation, while formalized, remains close to simple statements in a natural language; matrices and operations on them, while possibly being derived upon similar observation of reality, are a lot less transparent...
It is unclear from the description in the question if the system you propose would optimize existing code (possibly in a limited domain), or if it would produced optimized code, in that case driven bay some external goal/function...
Well production systems (rule systems) are one of four general approaches to computation (Turing machines, Church recursive functions, Post production systems and Markov algorithms [and several more have been added to that list]) which more or less have these respective realizations: imperative programming, functional programming, rule based programming - as far as I know Markov algorithms don't have an independent implementation. These are all Turing equivalent.
So rule based programming can be used to write anything at all. Also early mathematical/symbolic manipulation programs did generally use rule based programming until the problem was sufficiently well understood (whereupon the approach was changed to imperative or constraint programming - see MACSYMA - hmmm MACSYMA was written in Lisp so perhaps I have a different program in mind or perhaps they originally implemented a rule system in Lisp for this).
You could easily write a rule system to perform the matrix manipulations. You could keep a trace depending on logical support to record the actual rules fired that contributed to a solution (some rules that fire might not contribute directly to a solution afterall). Then for every rule you have a mapping to a set of C++ instructions (these don't have to be "complete" - they sort of act more like a semi-executable requirement) which are output as an intermediate language. Then that is read by a parser to link it to the required input data and any kind of fix up needed. You might find it easier to generate functional code - for one thing after the fix up you could more easily optimize the output code in functional source.
Having said that, other contributors have outlined a domain specific language approach and that is what the TED people did too (my suggestion is that too just using rules).

What does this software quote mean?

I was reading Code Complete (2nd Edition), and came across a quote in the margin on page 87 by Bertrand Meyer.
Ask not first what the system does; ask WHAT it does it to!
What exactly is the point Mr. Meyer is trying to get across here. I have some rough ideas, but I would like to make sure I really understand.
... So this is the second fallacy of teleology
- to attribute goal-directed
behavior to things that are not
goal-directed, perhaps without even
thinking of the things as alive and
spirit-inhabited, but only thinking, X
happens in order to Y. "In order to"
is mentalistic language, even though
it doesn't seem to name a blatantly
mental property like "fearful" or
"thinks it can fly". — Eliezer Yudkowsky, artificial intelligence theorist
concerned with self-improving AIs with stable goal systems
Bertrand Meyer's homily suggests that sound reasoning about systems is grounded in knowing what concrete entities are altered by the system; the purpose of the alterations is an emergent property.
I believe the point here is not on what the system does, but on the data it operates on and what those operations are.
This provides two major thinking shifts:
You think of the data and concepts first
You think of operations on that data
With those two "baselines" you will better prepared to organize a system to achieve your goals so that operations on data are well understood and make sense.
In effect, he is laying the ground work to be able to write the "contracts" on the code you write.
From Google search it picked up Art Gittleman's Computing With C# and the .Net Framework:
Bertrand Meyer gives an example of
payroll program, which produces
paychecks from timecards. Management
may later want to extend this program
to produce statistics or tax
information. The payroll function
itself may need to be changed to
produce weekly checks instead of
biweekly checks, for example. The
procedures used to implement the
original payroll program would need to
be changed to make any of these
modifications. Meyer notes that any of
these payroll programs will manipulate
the same sort of data, employee
records, company regulations, and so
forth.
Focusing on the more stable
aspect of such systems, Mayer states a
principle: "Ask not first what the
system does: Ask WHAT it does to!";
and a definition: "Object-oriented
design is the method which leads to
software architectures based on
objects every system or subsystem
manipulates (rather than "the"
function it meant to ensure)."
We today take UML's class diagram and other OOAD approach for granted, but it was something that was "discovered" along the way.
Also see Object-Oriented Design.
My opinion is that the quote is meant as a method to find good abstractions in your software. The text next to this quote deals with finding real-world objects to design your classes.
An simple example would be something like this:
You are making software for a bank. Because your software is working with bank accounts, it should have a class for an account. Then you start thinking what properties accounts have and the interactions you can have with accounts.
Of course, this quote makes more sense if the objects you are trying to model aren't as clear as this case.
Fred Brooks stated it this way:
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious."
Domain-Driven design... Understand the problem the software is designed to solve. What "domain" entities, (data abstractions) does the system manipulate ? And what does it do to those domain entities?

Algebraic logic

Both Wolfram Alpha and Bing are now providing the ability to solve complex, algebraic logic problems (ie "solve for x, given this equation"), and not just evaluate simple arithmetic expressions (eg "what's 5+5?"). How is this done?
I can read most types of code that might get thrown at me, so it doesn't really make a difference what you use to explain and represent the algorithm. I find that bash makes a really good pseudo-code, not to mention its actually functional, so that'd be ideal. Also, I'm fairly familiar with its in's and out's. Sorry to go ranting on a tangent, but it really irritates me to see people spend effort on crunching out "pseudocode" when they could be getting something 100% functional for just slightly more effort. Anyways, thanks so much for advance.
There are 2 main methods to solve:
Numeric methods. Numerical methods mean, basically, that the solver tries to change the value of x until the equation is satisfied. More info on numerical methods.
Symbolic math. The solver manipulates the equation as a string of symbols, by a number of formal rules. It's not that different from algebra we learn in school, the solver just knows a lot of different rules. More info on computer algebra.
Wolfram|Alpha (W|A) is based on the Mathematica kernel, combined with a natural language parser (which is also built primarily with Mathematica). They have a whole heap of curated data and associated formula that can be used once the question has been interpreted.
There's a blog post describing some of this which came out at the same time as W|A.
Finally, Bing simply uses the (non-free) API to answer questions via W|A.

Minimum CompSci Knowledge Needed for Writing Desktop Apps

Having been a hobbyist programmer for 3 years (mainly Python and C) and never having written an application longer than 500 lines of code, I find myself faced with two choices :
(1) Learn the essentials of data structures and algorithm design so I can become a l33t computer scientist.
(2) Learn Qt, which would help me build projects I have been itching to build for a long time.
For learning (1), everyone seems to recommend reading CLRS. Unfortunately, reading CLRS would take me at least an year of study (or more, I'm not Peter Krumins). I also understand that to accomplish any moderately complex task using (2), I will need to understand at least the fundamentals of (1), which brings me to my question : assuming I use C++ as the programming language of choice, which parts of CLRS would give me sufficient knowledge of algorithms and data structures to work on large projects using (2)?
In other words, I need a list of theoretical CompSci topics absolutely essential for everyday application programming tasks. Also, I want to use CLRS as a handy reference, so I don't want to skip any material critical to understanding the later sections of the book.
Don't get me wrong here. Discrete math and the theoretical underpinnings of CompSci have been on my "TODO: URGENT" list for about 6 months now, but I just don't have enough time owing to college work. After a long time, I have 15 days off to do whatever the hell I like, and I want to spend these 15 days building applications I really want to build rather than sitting at my desk, pen and paper in hand, trying to write down the solution to a textbook problem.
(BTW, a less-math-more-code resource on algorithms will be highly appreciated. I'm just out of high school and my math is not at the level it should be.)
Thanks :)
This could be considered heresy, but the vast majority of application code does not require much understanding of algorithms and data structures. Most languages provide libraries which contain collection classes, searching and sorting algorithms, etc. You generally don't need to understand the theory behind how these work, just use them!
However, if you've never written anything longer than 500 lines, then there are a lot of things you DO need to learn, such as how to write your application's code so that it's flexible, maintainable, etc.
For a less-math, more code resource on algorithms than CLRS, check out Algorithms in a Nutshell. If you're going to be writing desktop applications, I don't consider CLRS to be required reading. If you're using C++ I think Sedgewick is a more appropriate choice.
Try some online comp sci courses. Berkeley has some, as does MIT. Software engineering radio is a great podcast also.
See these questions as well:
What are some good computer science resources for a blind programmer?
https://stackoverflow.com/questions/360542/plumber-programmers-vs-computer-scientists#360554
Heed the wisdom of Don and just do it. Can you define the features that you want your application to have? Can you break those features down into smaller tasks? Can you organize the code produced by those tasks into a coherent structure?
Of course you can. Identify any 'risky' areas (areas that you do not understand, e.g. something that requires more math than you know, or special algorithms you would have to research) and either find another solution, prototype a solution, or come back to SO and ask specific questions.
Moving from 500 loc to a real (eve if small) application it's not that easy.
As Don was pointing out, you'll need to learn a lot of things about code (flexibility, reuse, etc), you need to learn some very basic of configuration management as well (visual source safe, svn?)
But the main issue is that you need a way to don't be overwhelmed by your functiononalities/code pair. That it's not easy. What I can suggest you is to put in place something to 'automatically' test your code (even in a very basic way) via some regression tests. Otherwise it's going to be hard.
As you can see I think it's no related at all to data structure, algorithms or whatever.
Good luck and let us know
I must say that sitting down with a dry old textbook and reading it through is not the way to learn how to do anything effectively, even if you are making notes. Doing it is the best way to learn, using the textbooks as a reference. Indeed, using sites like this as a reference.
As for data structures - learn which one is good for whatever situation you envision: Sets (sorted and unsorted), Lists (ArrayList, LinkedList), Maps (HashMap, TreeMap). Complexity of doing basic operations - adding, removing, searching, sorting, etc. That will help you to select an appropriate library data structure to use in your application.
And also make sure you're reasonably warm with MVC - i.e., ensure your model is separate from your view (the QT front-end) as best as possible. Best would be to have the model and algorithms working on their own, and then put the GUI on top. Or a unit test on top. Etc...
Good luck!
It's like saying you want to move to France, so should you learn french from a book, and what are the essential words - or should you just go to France and find out which words you need to know from experience and from copying the locals.
Writing code is part of learning computer science. I was writing code long before I'd even heard of the term, and lots of people were writing code before the term was invented.
Besides, you say you're itching to write certain applications. That can't be taught, so just go ahead and do it. Some things you only learn by doing.
(The theoretical foundations will just give you a deeper understanding of what you wind up doing anyway, which will mainly be copying other people's approaches. The only caveat is that in some cases the theoretical stuff will tell you what's futile to attempt - e.g. if one of your itches is to solve an NP complete problem, you probably won't succeed :-)
I would say the practical aspects of coding are more important. In particular, source control is vital if you don't use that already. I like bzr as an easy to set up and use system, though GUI support isn't as mature as it could be.
I'd then move on to one or both of the classics about the craft of coding, namely
The Pragmatic Programmer
Code Complete 2
You could also check out the list of recommended books on Stack Overflow.

Resources