Unicode in Bourne Shell source code - shell

Is it safe to use UTF-8, and not just the 7-bit ASCII subset, in modern Bourne Shell interpreters, be it in comments (e.g., using box-drawing characters), or by passing arguments to a function or program? I'm considering whether filesystems can safely handle Unicode in path names outside of the scope of this question.
I know at least to not put a BOM in my shell scripts… ever, as that would break the kernel's shebang line parsing.

The thing about UTF-8 is that any old code that's just passing string data along and uses the C string convention of terminating strings with a null byte works fine. That generally characterizes how the shell handles command names and arguments.
Even if the shell does some string processing with special meanings for ascii characters, UTF-8 still mostly works fine because ascii characters encode exactly the same in UTF-8. So for example the shell will still be able to recognize all its keywords and syntax characters like []{}()<>/.?;'"$&* etc. That characterizes how string literals and other syntax bits of a script should be handled, for example.
You should be able to use UTF-8 in comments, string literals, command names, and command arguments. (of course the system will have to support UTF-8 file names to have UTF-8 commands, and the commands will have to handle UTF-8 command line arguments.)
You may not be able to use UTF-8 in function names or variables, since the shell may be looking for strings of ascii characters there. Although if your locale is UTF-8 then an interpreter that's using the locale based character classification functions internally might work with UTF-8 identifiers as well, but it's probably not portable.

It really depends on what you are trying to do... In general plain vanilla Bourne-derived shells cannot handle Unicode characters inside the scripts, which means your script text must be purely 8-bit ASCII(+) if you care for portability. At the same time pipes are completely encoding neutral, so you can have things like a | b where a outputs UTF-8 and b receives it. So, assuming find is capable of handling UTF-8 paths and your processing tool for them can work with UTF-8 strings, you should be OK.

Multi-Byte support was added in 1989 to the Bourne Shell and given that UNICODE was introduced in 1992, you cannot expect UTF-8 from a shell that is older than UNICODE. SunOS introduced UNICODE support when it has become available.
So any Bourne Shell that was derived from the SVr4 Bourne Shell and compiled and linked to a modern library environment should support UTF-8 in scripts.
If you like to verify that, you may get a portable version from the OpenSolaris Bourne Shell in the schily-tools: http://sourceforge.net/projects/schilytools/files/
osh is the original Bourne Shell made portable only.
sh is the Bourne Shell with modern enhancements.

Related

Escaping for proper shell injection prevention

I need to run some shell commands from a Lua interpreter embedded into another Mac/Windows application where shell commands are the only way to achieve certain things, like opening a help page in a browser. If I have a list of arguments (which might be the result of user input), how can I escape each argument to prevent trouble?
Inspired by this article an easy solution seems to be to escape all non-alphanumeric characters, on Unix-like systems with \, on Windows with ^. As far as I can tell, this prevents that any argument will cause
execution of another command because of on intervening newline, ; (Unix) or & (Windows)
command substitution on Unix with $ or `
variable evaluation on Windows with %
redirection with <, | and >
In addition, any character that on the respective platform works as escape character will be escaped properly.
This seems sound to me, but are there any pitfalls I might have missed? I know that in bash, \ followed by a newline will effectively remove the newline, which is not a problem here.
EDIT
My conclusion: There is no single mechanism that by swapping the escape character will work on both Windows and *nix. It turns out it is not so straightforward to make sure that a Windows program actually sees the command line arguments we want it to see as splitting the command string into arguments on Windows is not handled by the shell but by the called program itself.
Therefore two layers of escaping need to be taken into account:
First, the Windows shell will process what we give it. What it might do is variable substitution at %, splitting into multiple commands at & or piping to another command at |.
Then, it will hand on a single command string to the called program which the program will split, ideally but not necessarily following the rules described by Microsoft.
Assuming it follows these rules, one can work one's way backwards, first escaping to these rules, then escaping further for the shell.
Calling sub-processes with dynamic arguments is prone to error and danger, and many languages don't provide good mechanisms to protect the developer. In Python, for example, os.system() is no longer recommended, instead the subprocess module provides a proper mechanism for safely making system calls. In particular, you pass subprocess.run() a list of arguments, rather than a single string, thereby avoiding needing to implement any error-prone escaping in the first place.
A quick search for a subprocess-like tool for Lua uncovered lua-subprocess, which doesn't appear to be being actively developed, but might still be better than trying to implement proper escaping yourself.
If you must do so, take a look at the Python code for shlex.quote() (source) - it properly escapes an input string for use "in a shell command line":
# use single quotes, and put single quotes into double quotes
# the string $'b is then quoted as '$'"'"'b'
return "'" + s.replace("'", "'\"'\"'") + "'"
You ought to be able to replicate that in Lua.

Is Bash an interpreted language?

From what I've read so far, bash seems to fit the defintion of an interpreted language:
it is not compiled into a lower format
every statement ends up calling a subroutine / set of subroutines already translated into machine code (i.e. echo foo calls a precompiled executable)
the interpreter itself, bash, has already been compiled
However, I could not find a reference to bash on Wikipedia's page for interpreted languages, or by extensive searches on Google. I've also found a page on Programmers Stack Exchange that seems to imply that bash is not an interpreted language- if it's not, then what is it?
Bash is definitely interpreted; I don't think there's any reasonable question about that.
There might possibly be some controversy over whether it's a language. It's designed primarily for interactive use, executing commands provided by the operating system. For a lot of that particular kind of usage, if you're just typing commands like
echo hello
or
cp foo.txt bar.txt
it's easy to think that it's "just" for executing simple commands. In that sense, it's quite different from interpreted languages like Perl and Python which, though they can be used interactively, are mainly used for writing scripts (interpreted programs).
One consequence of this emphasis is that its design is optimized for interactive use. Strings don't require quotation marks, most commands are executed immediately after they're entered, most things you do with it will invoke external programs rather than built-in features, and so forth.
But as we know, it's also possible to write scripts using bash, and bash has a lot of features, particularly flow control constructs, that are primarily for use in scripts (though they can also be used on the command line).
Another distinction between bash and many scripting languages is that a bash script is read, parsed, and executed in order. A syntax error in the middle of a bash script won't be detected until execution reaches it. A Perl or Python script, by contrast, is parsed completely before execution begins. (Things like eval can change that, but the general idea is valid.) This is a significant difference, but it doesn't mark a sharp dividing line. If anything it makes Perl and Python more similar to compiled languages.
Bottom line: Yes, bash is an interpreted language. Or, perhaps more precisely, bash is an interpreter for an interpreted language. (The name "bash" usually refers to the shell/interpreter rather than to the language that it interprets.) It has some significant differences from other interpreted languages that were designed from the start for scripting, but those differences aren't enough to remove it from the category of "interpreted languages".
Bash is an interpreter according to the GNU Bash Reference Manual:
Bash is the shell, or command language interpreter, for the GNU operating system.

can mvprintw(), curses function work with usual ascii codes?

I've developed a little console C++ game, that uses ASCII graphics, using cout for the moment. But because I want to make things work better, I have to use pdcurses. The thing is curses functions like printw(), or mvprintw() don't use the regular ascii codes, and for this game I really need to use the smiley characters, heart, spades and so on.
Is there a way to make curses work with the regular ascii codes ?
You shouldn't think of characters like the smiley face as "regular ASCII codes", because they really aren't ASCII at all. (ASCII only covers characters 32-127, plus a handful of control codes under 32.) They're a special case, and the only reason you're able to see them in (I assume?) your Windows CMD shell is that it's maintaining backwards compatibility with IBM Code Page 437 (or similar) from ancient DOS systems. Meanwhile, outside of the DOS box, Windows uses a completely different mapping, Windows-1252 (a modified version of ISO-8859-1), or similar, for its 8-bit, so-called "ANSI" character set. But both of these types of character sets are obsolete, compared to Unicode. Confused yet? :)
With curses, your best bet is to use pure ASCII, plus the defined ACS_* macros, wherever possible. That will be portable. But it won't get you a smiley face. With PDCurses, there are a couple of ways to get that smiley face: If you can safely assume that your console is using an appropriate code page, then you can pass the A_ALTCHARSET attribute, or'ed with the character, to addch(); or you can use addrawch(); or you can call raw_output(TRUE) before printing the character. (Those are all roughly equivalent.) Alternatively, you can use the "wide" build of PDCurses, figure out the Unicode equivalents of the CP437 characters, and print those, instead. (That approach is also portable, although it's questionable whether the characters will be present on non-PCs.)

Ruby system() doesn't accept UTF-8?

I am using Ruby 1.9.3 in Windows and trying to perform an action where I write filenames to a file one per line (we'll call it a filelist) and then later read this filelist, and call system() to run another program where I will pass it a filename from the filelist. That program I'm calling with system() will take the filename I pass it and convert it to a binary format to be used in a proprietary system.
Everything works up to the point of calling system(). I have a UTF-8 filelist, and reading the filename from the filelist is giving me the proper result. But when I run
system("c:\foo.exe -arg #{bar}")
the arg "bar" being passed is not in UTF-8 format. If I run the program manually with a Japanese, chinese, or whatever filename it works fine and codes the file correctly, but if I do it using system() it won't. I know the variable in bar is stored properly because I use it elsewhere without issue.
I've also tried:
system("c:\foo.exe -arg #{bar.encoding("UTF-8")}")
system("c:\foo.exe -arg #{bar.force_encoding("UTF-8")}")
and neither work. I can only assume the issue here is passing unicode to system.
Can someone else confirm if system does, in fact, support or not support this?
Here is the block of code:
$fname.each do |file|
flist.write("#{file}\n") # This is written properly in UTF-8
system("ia.exe -r \"#{file}\" -q xbfadd") # The file being passed here is not encoding right!
end
Ruby's system() function, like that in most scripting languages, is a veneer over the C standard library system() call. The MS C runtime uses Win32 ANSI APIs for all the byte-oriented C stdlib functions.
The ANSI APIs use the Windows system locale (aka 'ANSI codepage') to map between byte-oriented strings and Windows's native-UTF16LE strings which are used for filenames and shell commands. Unfortunately, it is impossible to set the system locale to UTF-8; you can set the codepage to 65001 (Windows's equivalent to UTF-8) on a particular console, but the MS CRT has long-standing bugs in its handling of code page 65001 which make a lot of applications fail.
So using the standard cross-platform byte-oriented C interfaces means you can't support Unicode filenames or shell commands, which is rather sad. Some scripting languages have added support for Unicode filenames by calling the Win32 'W' (Unicode) APIs explicitly instead of the C stdlib interfaces. Ruby 1.9.x is making progress in this area, but system() has not been looked at yet.
You can fix it by calling the Win32 API yourself, for example CreateProcessW but it's not especially pretty.
I upvoted bobince's answer; I believe it correct.
The only thing I'd add is that an additional work-around, this being a windows problem, is to write out the commandline to a batch file and then use system() to call the batchfile.
I used this approach to successfully get around the problem while running Calibre's ebook-convert commandline tool for a book with UTF-8/non-English chars in its title.
I think that bobince answer is correct and the solution that worked for me was:
system("c:\foo.exe -arg #{bar.encoding("ISO-8859-1")}")

Safe way to localize bash scripts?

In the BashFAQ of Gregs's Wiki, the following is written:
Don't mark strings that contain variables or other substitutions.
and
Bash (at least up through 4.0) performs locale expansion before other substitutions. Thus, in a case like this:
echo "The answer is $answer"
The literal string $answer will become part of the marked string.
Now I can understand that using variables in strings marked as translatable is security-wise dangerous as described in http://www.gnu.org/software/gettext/manual/html_node/bash.html.
However, neither removing the variables nor splitting the strings is viable, as this makes the translation difficult/impossible (because of the different sentence structure in e.g. Russian, French, German and English).
So my question is: Does any sane and safe way of bash localization exists, or does one use a more expressive programming language (like Python, Ruby or Perl) when it comes to localization?
http://www.linuxtopia.org/online_books/advanced_bash_scripting_guide/localization.html looks like a good tutorial for Bash localization using gettext, but I have not used it.

Resources