Pseudocode interpreter? - algorithm

Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.
Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, matlab-like slices and matrix operations.
I noticed that it's actually quite easy for people to recognise exactly what I'm triying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases, and the moments where each language behaves idiosyncratically.
But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if etc. Clearly I'd have to exclude many 'odd' languages like lisp, APL derivatives, but...
So my questions are,
Does code already exist that recognises the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?
Is it theoretically possible to create a single interpreter or compiler that recognises what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behaviour. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognising funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.
Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?
Have I missed an obvious obstacle to this being possible?
edit
Thanks all for your answers and ideas. I am planning to write a constraint-based heuristic translator that could, potentially, "solve" code for the intended meaning and translate into real python code. It will notice keywords from many common languages, and will use syntactic clues to disambiguate the human's intentions - like spacing, brackets, optional helper words like let or then, context of how variables are previously used etc, plus knowledge of common conventions (like capital names, i for iteration, and some simplistic limited understanding of naming of variables/methods e.g containing the word get, asynchronous, count, last, previous, my etc). In real pseudocode, variable naming is as informative as the operations themselves!
Using these clues it will create assumptions as to the implementation of each operation (like 0/1 based indexing, when should exceptions be caught or ignored, what variables ought to be const/global/local, where to start and end execution, and what bits should be in separate threads, notice when numerical units match / need converting). Each assumption will have a given certainty - and the program will list the assumptions on each statement, as it coaxes what you write into something executable!
For each assumption, you can 'clarify' your code if you don't like the initial interpretation. The libraries issue is very interesting. My translator, like some IDE's, will read all definitions available from all modules, use some statistics about which classes/methods are used most frequently and in what contexts, and just guess! (adding a note to the program to say why it guessed as such...) I guess it should attempt to execute everything, and warn you about what it doesn't like. It should allow anything, but let you know what the several alternative interpretations are, if you're being ambiguous.
It will certainly be some time before it can manage such unusual examples like #Albin Sunnanbo's ImportantCustomer example. But I'll let you know how I get on!

I think that is quite useless for everything but toy examples and strict mathematical algorithms. For everything else the language is not just the language. There are lots of standard libraries and whole environments around the languages. I think I write almost as many lines of library calls as I write "actual code".
In C# you have .NET Framework, in C++ you have STL, in Java you have some Java libraries, etc.
The difference between those libraries are too big to be just syntactic nuances.
<subjective>
There has been attempts at unifying language constructs of different languages to a "unified syntax". That was called 4GL language and never really took of.
</subjective>
As a side note I have seen a code example about a page long that was valid as c#, Java and Java script code. That can serve as an example of where it is impossible to determine the actual language used.
Edit:
Besides, the whole purpose of pseudocode is that it does not need to compile in any way. The reason you write pseudocode is to create a "sketch", however sloppy you like.
foreach c in ImportantCustomers{== OrderValue >=$1M}
SendMailInviteToSpecialEvent(c)
Now tell me what language it is and write an interpreter for that.

To detect what programming language is used: Detecting programming language from a snippet
I think it should be possible. The approach in 1. could be leveraged to do this, I think. I would try to do it iteratively: detect the syntax used in the first line/clause of code, "compile" it to intermediate form based on that detection, along with any important syntax (e.g. begin/end wrappers). Then the next line/clause etc. Basically write a parser that attempts to recognize each "chunk". Ambiguity could be flagged by the same algorithm.
I doubt that this has been done ... seems like the cognitive load of learning to write e.g. python-compatible pseudocode would be much easier than trying to debug the cases where your interpreter fails.
a. I think the biggest problem is that most pseudocode is invalid in any language. For example, I might completely skip object initialization in a block of pseudocode because for a human reader it is almost always straightforward to infer. But for your case it might be completely invalid in the language syntax of choice, and it might be impossible to automatically determine e.g. the class of the object (it might not even exist). Etc.
b. I think the best you can hope for is an interpreter that "works" (subject to 4a) for your pseudocode only, no-one else's.
Note that I don't think that 4a,4b are necessarily obstacles to it being possible. I just think it won't be useful for any practical purpose.

Recognizing what language a program is in is really not that big a deal. Recognizing the language of a snippet is more difficult, and recognizing snippets that aren't clearly delimited (what do you do if four lines are Python and the next one is C or Java?) is going to be really difficult.
Assuming you got the lines assigned to the right language, doing any sort of compilation would require specialized compilers for all languages that would cooperate. This is a tremendous job in itself.
Moreover, when you write pseudo-code you aren't worrying about the syntax. (If you are, you're doing it wrong.) You'll wind up with code that simply can't be compiled because it's incomplete or even contradictory.
And, assuming you overcame all these obstacles, how certain would you be that the pseudo-code was being interpreted the way you were thinking?
What you would have would be a new computer language, that you would have to write correct programs in. It would be a sprawling and ambiguous language, very difficult to work with properly. It would require great care in its use. It would be almost exactly what you don't want in pseudo-code. The value of pseudo-code is that you can quickly sketch out your algorithms, without worrying about the details. That would be completely lost.
If you want an easy-to-write language, learn one. Python is a good choice. Use pseudo-code for sketching out how processing is supposed to occur, not as a compilable language.

An interesting approach would be a "type-as-you-go" pseudocode interpreter. That is, you would set the language to be used up front, and then it would attempt to convert the pseudo code to real code, in real time, as you typed. An interactive facility could be used to clarify ambiguous stuff and allow corrections. Part of the mechanism could be a library of code which the converter tried to match. Over time, it could learn and adapt its translation based on the habits of a particular user.
People who program all the time will probably prefer to just use the language in most cases. However, I could see the above being a great boon to learners, "non-programmer programmers" such as scientists, and for use in brainstorming sessions with programmers of various languages and skill levels.
-Neil

Programs interpreting human input need to be given the option of saying "I don't know." The language PL/I is a famous example of a system designed to find a reasonable interpretation of anything resembling a computer program that could cause havoc when it guessed wrong: see http://horningtales.blogspot.com/2006/10/my-first-pli-program.html
Note that in the later language C++, when it resolves possible ambiguities it limits the scope of the type coercions it tries, and that it will flag an error if there is not a unique best interpretation.

I have a feeling that the answer to 2. is NO. All I need to prove it false is a code snippet that can be interpreted in more than one way by a competent programmer.

Does code already exist that
recognises the programming language
of a text file?
Yes, the Unix file command.
(Surely this must be a less
complicated task than eclipse's syntax
trees or than google translate's
language guessing feature, right?) In
fact, does the SO syntax highlighter
do anything like this?
As far as I can tell, SO has a one-size-fits-all syntax highlighter that tries to combine the keywords and comment syntax of every major language. Sometimes it gets it wrong:
def median(seq):
"""Returns the median of a list."""
seq_sorted = sorted(seq)
if len(seq) & 1:
# For an odd-length list, return the middle item
return seq_sorted[len(seq) // 2]
else:
# For an even-length list, return the mean of the 2 middle items
return (seq_sorted[len(seq) // 2 - 1] + seq_sorted[len(seq) // 2]) / 2
Note that SO's highlighter assumes that // starts a C++-style comment, but in Python it's the integer division operator.
This is going to be a major problem if you try to combine multiple languages into one. What do you do if the same token has different meanings in different languages? Similar situations are:
Is ^ exponentiation like in BASIC, or bitwise XOR like in C?
Is || logical OR like in C, or string concatenation like in SQL?
What is 1 + "2"? Is the number converted to a string (giving "12"), or is the string converted to a number (giving 3)?
Is there any language or interpreter
in existence, that can manage this
kind of flexible interpreting?
On another forum, I heard a story of a compiler (IIRC, for FORTRAN) that would compile any program regardless of syntax errors. If you had the line
= Y + Z
The compiler would recognize that a variable was missing and automatically convert the statement to X = Y + Z, regardless of whether you had an X in your program or not.
This programmer had a convention of starting comment blocks with a line of hyphens, like this:
C ----------------------------------------
But one day, they forgot the leading C, and the compiler choked trying to add dozens of variables between what it thought was subtraction operators.
"Flexible parsing" is not always a good thing.

To create a "pseudocode interpreter," it might be necessary to design a programming language that allows user-defined extensions to its syntax. There already are several programming languages with this feature, such as Coq, Seed7, Agda, and Lever. A particularly interesting example is the Inform programming language, since its syntax is essentially "structured English."
The Coq programming language allows "syntax extensions", so the language can be extended to parse new operators:
Notation "A /\ B" := (and A B).
Similarly, the Seed7 programming language can be extended to parse "pseudocode" using "structured syntax definitions." The while loop in Seed7 is defined in this way:
syntax expr: .while.().do.().end.while is -> 25;
Alternatively, it might be possible to "train" a statistical machine translation system to translate pseudocode into a real programming language, though this would require a large corpus of parallel texts.

Related

Testing membership in context-free language

I'm working on a slot-machine mini-game application. The rules for what constitutes a winning prize are rather complex (n of a kind, n of any kind, specific sequences), and to make matters even more complicated, this code should work for a slot-machine with (n >= 3) reels.
So, after some thought, I believe defining a context-free language is the most efficient and extensible way to go. This way I could define the grammar in an XML file.
So my question is, given a string of symbols S, how do I go about testing if S is in a given Context-Free Language? Would I simply exhaust rules until I'm out of valid rules/symbols, or is there a known algorithm that could help. Thanks.
Also, a language like this seems non-regular, am I correct? I've never been good at proofs, so I've avoided trying.
Any comments on my approach would be appreciated as well.
Thanks.
"...given a string of symbols S, how do I go about testing if S is in
a given Context-Free Language?"
If a string w is in L(G); the process of finding a sequence of production rules of G by which w is derived is call parsing. So, you have to create a parse tree to search for some derivation. To do this you perform an exhaustive Breadth-First-Search. There is a serious issue that arises: The searching process may never terminate. To prevent endless searches you have to transform the grammer into what is known as normal form.
"Also, a language like this seems non-regular, am I correct?"
Not necessarily. Every regular language is context-free (because it can be described by a CTG), but not every context-free language is regular.
General cases of context free grammers are hard to evaluate.
However, there are methods to parse grammers in subsets of the context free grammers.
For example: SLR and LL grammers are often used by compilers to parse programming languages, which are also context free languages. To use these, your grammer must be in one of these "families" (remember - there are infinite number of grammers for each context free language).
Some practical tools you might want to use that are generally used for compilers are JavaCC in java and bison in C++.
(If I remember correctly, Bison is SLR parser and JavaCC is LL Parser, but I could be wrong)
P.S.
For a specific slot machine, with n slots and k symbols - the language is definetly regular, since there are at most kn "words" in it, and every finite language is regular. Things obviously get compilcated if you are looking for a grammer for all slot machines.
Your best bet is to actually code this with a proper programming language. A CFG is overkill, because it can be extremely hard to code some, as you say, "rather complex" rules. For example, grammars are poorly suited to talking about the number of things.
For example, how would you code "the number of cherries is > the number of any other object" in such a language? How would the person you're giving the program to do so? CFGs cannot easily express such concepts, and regular expressions cannot sanely do so by any stretch.
The answer is that grammars are not right for this task, unless the slot machines is trying to make English sentences.
You also have to consider what happens when TWO or more "prize sequences" match! Assuming you want to give out the highest prize, you need an ordered list of recognizers. This is not to say you can't code your recognizers with (for example) regular expressions in addition to arbitrary functions. I'm just saying that general CFG parsing is overkill, because what CFGs get you over regular languages (i.e. regular expressions) is the ability to consider parse trees of arbitrary depth (like nested parentheses of level N or more), which is probably not what you care about.
This is not to say that you don't, for example, want to allow regular expressions. You can make that job easy by using a parser generator to recognize regexes involving cherries bananas and pears, see http://en.wikipedia.org/wiki/Comparison_of_parser_generators, which you can then embed, though you might want to simply roll your own recursive descent parser (assuming again you don't care about CFGs, especially if your tokens are bounded length).
For example, here is how I might implement it in pseudocode (ideally you'd use a statically typechecked language with good list manipulation, which I can't think of off the top of my head):
rules = []
function Rule(name, code) {
this.name = name
this.code = code
rules.push(this) # adds them in order
}
##########################
Rule("All the same", regex(.*))
Rule("No two-in-a-row", function(list, counts) {
not regex(.{2}).match(list)
})
Rule("More cherries than anything else", function(list, counts) {
counts[cherries]>counts[x] for all x in counts
or
sorted(counts.items())[0]==cherries
or
counts.greatest()==cherries
})
for token in [cherry, banana, ...]:
Rule("At least 50% "+token, function(list, counts){
counts[token] >= list.length/2
})

Why doesn't Haskell have symbols (a la ruby) / atoms (a la erlang)?

The two languages where I have used symbols are Ruby and Erlang and I've always found them to be extremely useful.
Haskell does have algebraic datatypes, but I still think symbols would be mighty convenient. An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
The syntactic sugar for atoms can be minor - :something or <something> is an atom. All atoms are instances of a Type called Atom which derives Show and Eq. You can then use it for more descriptive error codes, for example
type ErrorCode = Atom
type Message = String
data Error = Error ErrorCode Message
loginError = Error :redirect "Please login first"
In this case :redirect is more efficient than using a string ("redirect") and easier to understand than an integer (404).
The benefit may seem minor, but I say it is worth adding atoms as a language feature (or at least a GHC extension).
So why have symbols not been added to the language? Or am I thinking about this the wrong way?
I agree with camccann's answer that it's probably missing mainly because it would have to be baked quite deeply into the implementation and it is of too little use for this level of complication. In Erlang (and Prolog and Lisp) symbols (or atoms) usually serve as special markers and serve mostly the same notion as a constructor. In Lisp, the dynamic environment includes the compiler, so it's partly also a (useful) compiler concept leaking into the runtime.
The problem is the following, symbol interning is impure (it modifies the symbol table). Because we never modify an existing object it is referentially transparent, however, but if implemented naïvely can lead to space leaks in the runtime. In fact, as currently implemented in Erlang you can actually crash the VM by interning too many symbols/atoms (current limit is 2^20, I think), because they can never get garbage collected. It's also difficult to implement in a concurrent setting without a huge lock around the symbol table.
Both problems can be (and have been) solved, however. For example, see Erlang EEP 20. I use this technique in the simple-atom package. It uses unsafePerformIO under the hood, but only in (hopefully) rare cases. It could still use some help from the GC to perform an optimisation similar to indirection shortening. It also uses quite a few IORefs internally which isn't too great for performance and memory usage.
In summary, it can be done but implementing it properly is non-trivial. Compiler writers always weigh the power of a feature against its implementation and maintenance efforts, and it seems like first-class symbols lose out on this one.
I think the simplest answer is that, of the things Lisp-style symbols (which is where both Ruby and Erlang got the idea, I believe) are used for, in Haskell most are either:
Already done in some other fashion--e.g. a data type with a bunch of nullary constructors, which also behave as "convenient names for integers".
Awkward to fit in--things that exist at the level of language syntax instead of being regular data usually have more type information associated with them, but symbols would have to either be distinct types from each other (nearly useless without some sort of lightweight ad-hoc sum type) or all the same type (in which case they're barely different from just using strings).
Also, keep in mind that Haskell itself is actually a very, very small language. Very little is "baked in", and of the things that are most are just syntactic sugar for other primitives. This is a bit less true if you include a bunch of GHC extensions, but GHC with -XAndTheKitchenSinkToo is not the same language as Haskell proper.
Also, Haskell is very amenable to pseudo-syntax and metaprogramming, so there's a lot you can do even without having it built in. Particularly if you get into TH and scary type metaprogramming and whatever else.
So what it mostly comes down to is that most of the practical utility of symbols is already available from other features, and the stuff that isn't available would be more difficult to add than it's worth.
Atoms aren't provided by the language, but can be implemented reasonably as a library:
http://hackage.haskell.org/package/simple-atom
There are a few other libs on hackage, but this one looks the most recent and well-maintained.
Haskell uses type constructors* instead of symbols so that the set of symbols a function can take is closed, and can be reasoned about by the type system. You could add symbols to the language, but it would put you in the same place that using strings would - you'd have to check all possible symbols against the few with known meanings at runtime, add error handling all over the place, etc. It'd be a big workaround for all the compile-time checking.
The main difference between strings and symbols is interning - symbols are atomic and can be compared in constant time. Both are types with an essentially infinite number of distinct values, though, and against the grain of Haskell's specifying arguments and results with finite types.
I'm more familiar with OCaml than Haskell, so "type constructor" may not be the right term. Things like None or Just 3.
An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
Use Enum instead.
data FileType = GZipped | BZipped | Plain
deriving Enum
descr ft = ["compressed with gzip",
"compressed with bzip2",
"uncompressed"] !! fromEnum ft

Are there good alternative Scheme syntaxes?

I imagine Scheme (and perhaps Lisp) could be made more `user friendly' by using a different syntax. For example, instead of nested S-expressions with ugly parentheses, one could devise some kind of syntax closer to some of the more widely used languages (e.g. Java-like without needing to define classes).
It's not necessarily a bad thing if it's more verbose. For example, the syntax may require line separators and commas in the places where many people will expect them, and expect explicit return statements. Also, it doesn't seem that difficult to allow some operators to be used infix style (just obey the generally accepted operator preference rules).
And if it doesn't make things too messy, the syntax could even be backwards-compatible, so that in any place where an expression is expected, a normal S-expression between parentheses can be used.
What are your opinions and ideas about this? And does anything like this exist? (I expect it does, but "Scheme" is a worthless google term, I can't find anything!)
Originally, Lisp was planned to use a syntax called M-Expressions, with S-Expressions being only a transitional solution for easier compiler building. When M-Expressions were ready to be introduces, the programmers who had already taken on Lisp just stayed with what they had become accustomed to, and M-Expressions never caught on.
There is an infix notation in Guile, but it's rarely used. A good Lisp programmer doesn't even see the parens anymore, and prefix notation does have its merits...
I think "sweet expressions" might be one of the more thoughtful approaches to getting rid of the parentheses in Lisp. It apparently even supports macros.
http://www.dwheeler.com/readable/sweet-expressions.html
However, I think most people eventually get over the parentheses or use another language.
Take a look at "sweet-expressions", which provides a set of additional abbreviations for traditional s-expressions. They add syntactically-relevant indentation, a way to do infix, and traditional function calls like f(x). Unlike nearly all past efforts to make Lisps readable, sweet-expressions are backwards-compatible (you can freely mix well-formatted s-expressions and sweet-expressions), generic, and homoiconic.
Sweet-expressions were developed on http://readable.sourceforge.net and there is a sample implementation.
For Scheme there is a SRFI for sweet-expresssions: http://srfi.schemers.org/srfi-110/
Try SRFI 49 for size. :-P
(Seriously, though, as Rafe commented, "I don't think anybody wants this".)
Some people consider Python to be a kind of Scheme with infix notation for operators, algebraic notation for functions and which uses a more "java-like" syntax for representing the language. I don't agree with that assessment, but I can see where the idea comes from.
The big problem with changing the notation for Scheme is that macros become very hard to write (to see how hard, take a look at the Nimrod language or Boo). Instead of working directly with the code as lists, you have to parse the input language first. This usually involves constructing an AST (abstract syntax tree) for the language from the input. When working directly with Scheme, this is unnecessary.
However, you might check out the SIX expression syntax in Gambit Scheme. There's a nice set of slides here which contains a discussion of this:
http://www.iro.umontreal.ca/~gambit/Gambit-inside-out.pdf
But don't tell anyone about it! (The inside joke is that someone suggests writing a Lisp without parentheses and with infix notation about once a day, and someone announces an implementation about once a month.)
There are some languages that do exactly that. For instance: Dylan.

What makes a language readable or not readable? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I heard people say they can understand their python code a year later but not their XYZ code. Why? I dont know what is good about python syntax or what is bad about another. I like C# but i have a feeling VB.NET code is easier to read. I am doing language design so what do you find makes code/syntax/language readable or not readable?
Experience.
IMO, one of the big things is significant white space. Block indention goes a long ways and languages like Python and F# that provide a level of significant white space can help with readability.
Code like Java and C# tend to be structured and the readability becomes a focal point of how it was coded to begin with and not of the language itself.
Code is readable when its written in a style of explicit "stating what you want to do".
This only depends on the language in sofar
it allows you to express what you want (functional-programming!)
it doesn't emphasize on cryptical statements
The rest depends on the style you use to write code (even Perl can be understandable!), but certain languages make it easier to hacky statements.
Clear:
expr = if not (ConditionA and ConditionB) then ... else ...
Unclear:
expr = (!(conditionA && conditionB)) ? ... : ...
Clear:
foreach line in lines:
if (line =~ /regex/):
// code
Unclear:
... if /regex/ foreach(#lines);
Clear:
x = length [ x | x <- [1..10], even x ]
Unclear:
int x = 0;
for (int i = 1; i <= 10; ++i)
if ((i&&1)==0) ++x;
Generally what makes Python considered readable is that it forces a standardized indentation. This means that you'll never be forced to wonder whether you're in an if block or a function, it is clear as day. Even poorly written code therefore becomes obvious.
One language which I generally consider difficult to read is PHP for the same reason (or rather, its opposite). Since programmers are allowed to indent at will, and store variables anywhere, it can get convoluted very quickly. Further, since PHP historically did not have case sensitive function names (PHP < 4.4.7 I believe), this means that there really isn't a consistency in the implementation of the core language either... (Don't get me wrong, I like the language, but a bad coder can REALLY make a mess).
JavaScript also has a lot of problems with undisciplined developers. You'll find yourself wondering where variables have been defined and what scope you're in. Code will not be in one consolidated place, but rather spread across multiple files, and often lurking where unexpected.
ActionScript 3 is a bit better. Generally, there has been a move to have everyone use similar syntax's, and Adobe has gone so far as to define its standards and make them accessible and common. It does not take much to see how the ECMAScript implementation which is supported by a for-profit company is superior to the generalized one.
Readability is a function that takes a lot of inputs. I don't think it's really possible to compile a full list of things that can affect a language's readability. The most general way to describe it is "minimizing cognitive load." A few major factors:
Subtleties of meaning. If two code snippets look very similar at a glance but do different things, it hurts readability because the reader has to stop and deduce what's actually happening.
Meaningless code — aka boilerplate. This doesn't necessarily mean code that does nothing, but code that doesn't tell me anything about what we're actually doing. Every bit of code that doesn't express the actual intent of a function or object reduces readability by that much.
Cramming meaning — aka golf. This is the opposite of the boilerplate problem. It's possible to compress code so far that the reader is forced to stop and examine it pretty much character by character. The exact line where this occurs is somewhat subjective (which is part of why some people love Perl and some people hate it), but it's definitely a real phenomenon.
The programmer makes code readable or unreadable, not the language. Thinking otherwise is just fooling yourself. This is because the only people who are qualified to judge readability are those who know the language. To the non-programmer, all languages are equally unreadable.
I heard people say they can understand their python code a year later but not their XYZ code. Why?
Firstly, I don't think that people say that based solely on syntax. There are a lot of other factors to take into consideration, to name just a few:
The fact that some languages tend to promote only one right way to do something (like Python), and others promote many different ways (Ruby for example, from what I hear [disclaimer: I am not a Ruby programmer])
The libraries the language has. The better designed ones tend to be incredibly easy to understand without needing documentation, and this also tends to help remember. A language with good libraries will therefore make things easier.
Having said that, my personal take on Python is the fact that many people call it "executable pseud-code". It supports a wide variety of things that tend to appear in pseudo-code, and as an extension, are the standard way to think about things.
Also, Python's un-C-like syntax, one of the features that make it so disliked by so many people, also makes Python look more like pseudocode.
Well, that's my take on Python's readability.
To be honest when it comes to what makes a language readable it is really seems to boil down to a combination of simplicity and personal preference. (Of course - it is always possible to write unreadable code in any language if you try hard enough). Since personal preference can't really be controlled, it comes down to ease of expression - the more complicated it is in a language to use simple features, the more difficult that language is likely to be in general from a readability standpoint.
A word required when one character will suffice - a stone in the garden of Pascal and VB.
Compare:
Block ()
Begin
// Content
End
vs.
Block
{
// Content
}
It requires extra brain processing to read a word and mentally associate it with a concept, while a single symbol is immediately recognized by its image.
It is the same thing as the difference with natural languages, usual textual languages vs. symbol languages with hieroglyphs (Asian group). The processing of the first group is slower because basically a text is parsed to a set of concepts while hieroglyphs represent concepts themselves. Compare it with what you already know - will a serialization/deserialization from an XML be faster than a custom search over a binary format?
IMHO, the more a computer language resembles a spoken language, the more readable it is. For extreme examples, take languages like J or Whitespace or Brainfuck... completely unreadable to the untrained eye.
But a language that resembles English can be more easily understood. Not that this makes it the best language, as COBOL can attest.
I think it has more to do with the person writing the code rather than the actual language itself. You can write very readable code in any language, and unreadable code in any language. Even a complex Regular expression can be formatted and commented so as to make it easy to read.
a coworker of mine use to have a saying: "You can write crap code in any language." I liked it and wanted to share today. What makes code readable? Here are my thoughts
The ability to read the syntax of the language.
Well formatted code.
Meaningfully named variables and functions
Comments to explain complex processing. Beware, too much commentes can make the code hard to read
Short functions are easier to read than long ones.
None of these have anything to do with the language, it's all about the coder, and the quality of their work.
I will try saying that a code is readable by its simplicity.
You got to get at first sight what it does and what its purpose is. Why write a thousand lines of code when only a few does what is required?
This is the spirit of a functional language like F#, for instance.
For me its mainly the question of wether the language allows you to develop more readable abstractions which do prevent getting lost in details.
This is where OOP comes in very handy with the hiding of details. If i can hide the details of a task behind an interface that has the behavior of a common concept (i.e. iterators in C++) i usually don't have to read the implementation details.
I think, language design (for normal languages, not brainfuck :) ) not matters. To make code readable you should follow standards, code conventions, and don't forget about refactoring.
It's all about clean code.
Keep it small, simple, well named, and formatted.
class ShoppingCart {
def add(item) {
println "you added some $item"
}
def remove(item) {
println "you just took out the $item"
}
}
def myCart = new ShoppingCart()
myCart.with {
add "juice"
add "milk"
add "cookies"
add "eggs"
remove "cookies"
}
The literacy level of the reader.
Two distinct aspects, really. First is syntax and whitespace. Python enforces a whitespace standard, dropping unnecessary {, } and ; characters. This makes it easy on the eyes. Second, and most importantly, clarity of expression- i.e. how easy is it to map code back to the way you think. There are several features (and non-features) in programming languages that contribute to the latter point:
Disallowing jumps. The goto statement in C is a typical example. Code that doesn't keep running out of structured blocks is easier to read.
Minimizing side-effects. Global variables are evil, remember?
Using more tailored functions. How can your head track a for loop with 5 iteration variables? The Common Lisp loop is much easier to read (although VERY difficult to write, but that's a different story)
Lexical closures. You can figure out a variable's value by just looking at it, as opposed to running the code in your head, and then figuring out which statement is shadowing which.
A couple of examples:
(loop
do for tweet = (master-response-parser (twitter-show-status tweet-id))
for tweet-id = tweet-id then (gethash in-reply-to tweet)
while tweet-id
collecting tweet)
and
listOfFacs = [x | x <- [1 ..], x == sumOfFacDigits x]
where sumOfFacDigits x = sum [factorial (x `div` y) | y <- [1 .. 10]]
Concerning the syntax, I think it it imperative that it be fairly descriptive. For instance, in many languages, you have the foreach statement, and each one handles it a bit differently.
// PHP
foreach ($array as $variable) ...
// Ruby
array.each{ |variable| ... }
// Python
for variable in array ...
// Java
for (String variable : array)
Honestly, I feel that PHP and Python have the clearest means of understanding, but, it all boils down to how smart and clear the programmer wants to be. For instance, a bad programmer could write the following:
// PHP
foreach ($user as $_user) ...
My guess is that you would have almost no idea what the heck the code is doing unless you tracked back and attempted to figure out what $user was and why you were iterating over it. Being clear and concise is all about making small chunks of code make sense without having to trace back through the program to figure out what variables/function names are.
Also, I would have to completely agree with whitespace. Tabs, newlines and spacing in-between operators really make a huge difference!
Edit 1: I might also interject that some languages have the syntax and tools readily available to make things more clear. Take Ruby for example:
if [1,2,3].include? variable then ... end
verses, say Java:
if (variable != 1 || variable != 2 || variable != 3) { ... }
One of these (IMHO) is certainly more clear and readable than the other.

Avoiding Mixup of Language Details

Today someone asked me what was wrong with their source code. It was obvious. "Use double equals in place of that single equal in that if statement. Um, I think..." As I remember some languages actually take a single equals for comparison. Since I sometimes forget or mix up the syntax details among the several languages I use, I stepped over to my laptop to try a quickie experiment.
It costs a bit of time and is a break in the flow to try "quick" experiments (though maybe the practice is good for memory.) What tips do you have for keeping straight in your mind the syntax (and other) details of multiple languages?
(And nowadays, this applies just as well to the many wiki-like markups!)
To me, the hardest part isn't the syntax -usually you get into the mode when looking at the code you're working on. The really hard part is remembering the library of the language so you don't go inventing the wheel over and over again. Now if only people would organize their help files so it was easy to search for particular stuff in the library.
IDEs that can draw red and yellow squiggles can help, until you develop that mental muscle memory.
One of the annoying things with XCode (for Cocoa/ObjectiveC) is that you don't get said squiggles until you compile. (As opposed to Eclipse/Java where you get live squiggles).
In my case it's just experience. I think once you code in a language for long enough your brain seems to be able to do language-context-switching with it.
Indeed, on SO I advised not to forget avoiding if (a = b) in Java, and someone reminded me that it is legal only if a and b are boolean! Of course, the advice is good for C, C++, JavaScript and a number of other C-like languages.
Likewise, I realized only recently that var v in JavaScript have a function-level scope only, not a brace-level scope.
Somehow, that's the pitfall of having similar syntaxes, but different behaviors.
For the anecdote, some people in the Lua mailing list complain that this language isn't C-like, with the terse and familiar curly braces, the += and ++, the bitwise operators. They say it hurts adoption of the language, because people are more familiar with C-like syntax.
That's non-sense, Basic was (and still is) widely used with its verbose syntax. And so is Pascal (Delphià. And lot of people find the Lua syntax readable and easy to learn, good for those non familiar to programming (game AI specialists, for example).
Moreover, and to the point, Lua is designed to be integrated to C/C++ programs and to be extended with C[++] functions. And people say the quite different syntaxes helps in the mindset shifting.

Resources