Ruby system() doesn't accept UTF-8? - ruby

I am using Ruby 1.9.3 in Windows and trying to perform an action where I write filenames to a file one per line (we'll call it a filelist) and then later read this filelist, and call system() to run another program where I will pass it a filename from the filelist. That program I'm calling with system() will take the filename I pass it and convert it to a binary format to be used in a proprietary system.
Everything works up to the point of calling system(). I have a UTF-8 filelist, and reading the filename from the filelist is giving me the proper result. But when I run
system("c:\foo.exe -arg #{bar}")
the arg "bar" being passed is not in UTF-8 format. If I run the program manually with a Japanese, chinese, or whatever filename it works fine and codes the file correctly, but if I do it using system() it won't. I know the variable in bar is stored properly because I use it elsewhere without issue.
I've also tried:
system("c:\foo.exe -arg #{bar.encoding("UTF-8")}")
system("c:\foo.exe -arg #{bar.force_encoding("UTF-8")}")
and neither work. I can only assume the issue here is passing unicode to system.
Can someone else confirm if system does, in fact, support or not support this?
Here is the block of code:
$fname.each do |file|
flist.write("#{file}\n") # This is written properly in UTF-8
system("ia.exe -r \"#{file}\" -q xbfadd") # The file being passed here is not encoding right!
end

Ruby's system() function, like that in most scripting languages, is a veneer over the C standard library system() call. The MS C runtime uses Win32 ANSI APIs for all the byte-oriented C stdlib functions.
The ANSI APIs use the Windows system locale (aka 'ANSI codepage') to map between byte-oriented strings and Windows's native-UTF16LE strings which are used for filenames and shell commands. Unfortunately, it is impossible to set the system locale to UTF-8; you can set the codepage to 65001 (Windows's equivalent to UTF-8) on a particular console, but the MS CRT has long-standing bugs in its handling of code page 65001 which make a lot of applications fail.
So using the standard cross-platform byte-oriented C interfaces means you can't support Unicode filenames or shell commands, which is rather sad. Some scripting languages have added support for Unicode filenames by calling the Win32 'W' (Unicode) APIs explicitly instead of the C stdlib interfaces. Ruby 1.9.x is making progress in this area, but system() has not been looked at yet.
You can fix it by calling the Win32 API yourself, for example CreateProcessW but it's not especially pretty.

I upvoted bobince's answer; I believe it correct.
The only thing I'd add is that an additional work-around, this being a windows problem, is to write out the commandline to a batch file and then use system() to call the batchfile.
I used this approach to successfully get around the problem while running Calibre's ebook-convert commandline tool for a book with UTF-8/non-English chars in its title.

I think that bobince answer is correct and the solution that worked for me was:
system("c:\foo.exe -arg #{bar.encoding("ISO-8859-1")}")

Related

Getting the default RTL codepage in Lazarus

Lazarus Wiki states
Lazarus (actually its LazUtils package) takes advantage of that API
and changes it to UTF-8 (CP_UTF8). It means also Windows users now use
UTF-8 strings in the RTL
In our cross-platform and cross-compiler code, we'd like to detect this specific situation. GetACP() Windows API function still returns "1252", and so does GetDefaultTextEncoding() function in Lazarus. But the text (specifically, the filename returned by FindFirst() function) contains the string with UTF8-encoded filename, and the codepage of the string (variable) is 65001 too.
So, how do we figure out that the RTL operates with UTF8 strings by default? I've spent several hours trying to figure this out from Lazarus source code, but probably I am missing something ...
I understand that in many scenarios, we need to inspect the codepage of each specific string, but I am interested in the way to find out the default RTL codepage which is UTF8 in Lazarus, yet Windows-defined one in FPC/Windows without Lazarus.
Turns out, that there's no single code page variable or function. Results of the filesystem API calls are converted to the codepage, defined in DefaultRTLFileSystemCodePage variable. The only problem is that this variable is present in the source code and is supposed to be in system unit, but the compiler doesn't see it.

Is Bash an interpreted language?

From what I've read so far, bash seems to fit the defintion of an interpreted language:
it is not compiled into a lower format
every statement ends up calling a subroutine / set of subroutines already translated into machine code (i.e. echo foo calls a precompiled executable)
the interpreter itself, bash, has already been compiled
However, I could not find a reference to bash on Wikipedia's page for interpreted languages, or by extensive searches on Google. I've also found a page on Programmers Stack Exchange that seems to imply that bash is not an interpreted language- if it's not, then what is it?
Bash is definitely interpreted; I don't think there's any reasonable question about that.
There might possibly be some controversy over whether it's a language. It's designed primarily for interactive use, executing commands provided by the operating system. For a lot of that particular kind of usage, if you're just typing commands like
echo hello
or
cp foo.txt bar.txt
it's easy to think that it's "just" for executing simple commands. In that sense, it's quite different from interpreted languages like Perl and Python which, though they can be used interactively, are mainly used for writing scripts (interpreted programs).
One consequence of this emphasis is that its design is optimized for interactive use. Strings don't require quotation marks, most commands are executed immediately after they're entered, most things you do with it will invoke external programs rather than built-in features, and so forth.
But as we know, it's also possible to write scripts using bash, and bash has a lot of features, particularly flow control constructs, that are primarily for use in scripts (though they can also be used on the command line).
Another distinction between bash and many scripting languages is that a bash script is read, parsed, and executed in order. A syntax error in the middle of a bash script won't be detected until execution reaches it. A Perl or Python script, by contrast, is parsed completely before execution begins. (Things like eval can change that, but the general idea is valid.) This is a significant difference, but it doesn't mark a sharp dividing line. If anything it makes Perl and Python more similar to compiled languages.
Bottom line: Yes, bash is an interpreted language. Or, perhaps more precisely, bash is an interpreter for an interpreted language. (The name "bash" usually refers to the shell/interpreter rather than to the language that it interprets.) It has some significant differences from other interpreted languages that were designed from the start for scripting, but those differences aren't enough to remove it from the category of "interpreted languages".
Bash is an interpreter according to the GNU Bash Reference Manual:
Bash is the shell, or command language interpreter, for the GNU operating system.

Why does Scala use a reversed shebang (!#) instead of just setting interpreter to scala

The scala documentation shows that the way to create a scala script is like this:
#!/bin/sh
exec scala "$0" "$#"
!#
/* Script here */
I know that this executes scala with the name of the script file and the arguments passed to it, and that the scala command apparently knows to read a file that starts like this and ignore everything up to the reversed shebang !#
My question is: is there any reason why I should use this (rather verbose) format for a scala script, rather than just:
#!/bin/env scala
/* Script here */
This, as far a I can tell from a quick test, does exactly the same thing, but is less verbose.
How old is the documentation? Usually, this sort of thing (often referred to as 'the exec hack') was recommended before /bin/env was common, and this was the best way to get the functionality. Note that /usr/bin/env is more common than /bin/env, and ought to be used instead.
Note that it's /usr/bin/env, not /bin/env.
There are no benefits to using an intermediate shell instead of /usr/bin/env, except running in some rare antique Unix variants where env isn't in /usr/bin. Well, technically SCO still exists, but does Scala even run there?
However the advantage of the shell variant is that it gives an opportunity to tune what is executed, for example to add elements to PATH or CLASSPATH, or to add options such as -savecompiled to the interpreter (as shown in the manual). This may be why the documentation suggests the shell form.
I am not on the Scala development team and I don't know what the historical motivation for the Scala documentation was.
Scala did not always support /usr/bin/env. No particular reason for it, just, I imagine, the person who wrote the shell scripting support was not familiar with that syntax, back in the mid 00's. The documentation followed what was supported, and I added /usr/bin/env support at some point (iirc), but never bothered changing the documentation, it would seem.

Unicode in Bourne Shell source code

Is it safe to use UTF-8, and not just the 7-bit ASCII subset, in modern Bourne Shell interpreters, be it in comments (e.g., using box-drawing characters), or by passing arguments to a function or program? I'm considering whether filesystems can safely handle Unicode in path names outside of the scope of this question.
I know at least to not put a BOM in my shell scripts… ever, as that would break the kernel's shebang line parsing.
The thing about UTF-8 is that any old code that's just passing string data along and uses the C string convention of terminating strings with a null byte works fine. That generally characterizes how the shell handles command names and arguments.
Even if the shell does some string processing with special meanings for ascii characters, UTF-8 still mostly works fine because ascii characters encode exactly the same in UTF-8. So for example the shell will still be able to recognize all its keywords and syntax characters like []{}()<>/.?;'"$&* etc. That characterizes how string literals and other syntax bits of a script should be handled, for example.
You should be able to use UTF-8 in comments, string literals, command names, and command arguments. (of course the system will have to support UTF-8 file names to have UTF-8 commands, and the commands will have to handle UTF-8 command line arguments.)
You may not be able to use UTF-8 in function names or variables, since the shell may be looking for strings of ascii characters there. Although if your locale is UTF-8 then an interpreter that's using the locale based character classification functions internally might work with UTF-8 identifiers as well, but it's probably not portable.
It really depends on what you are trying to do... In general plain vanilla Bourne-derived shells cannot handle Unicode characters inside the scripts, which means your script text must be purely 8-bit ASCII(+) if you care for portability. At the same time pipes are completely encoding neutral, so you can have things like a | b where a outputs UTF-8 and b receives it. So, assuming find is capable of handling UTF-8 paths and your processing tool for them can work with UTF-8 strings, you should be OK.
Multi-Byte support was added in 1989 to the Bourne Shell and given that UNICODE was introduced in 1992, you cannot expect UTF-8 from a shell that is older than UNICODE. SunOS introduced UNICODE support when it has become available.
So any Bourne Shell that was derived from the SVr4 Bourne Shell and compiled and linked to a modern library environment should support UTF-8 in scripts.
If you like to verify that, you may get a portable version from the OpenSolaris Bourne Shell in the schily-tools: http://sourceforge.net/projects/schilytools/files/
osh is the original Bourne Shell made portable only.
sh is the Bourne Shell with modern enhancements.

Compilers for shell scripts

Do you know if there's any tool for compiling bash scripts?
It doesn't matter if that tool is just a translator (for example, something that converts a bash script to a C program), as long as the translated result can be compiled.
I'm looking for something like shc (it's just an example -- I know that shc doesn't work as a compiler). Are there any other similar tools?
A Google search brings up CCsh, but it will set you back $50 per machine for a license.
The documentation says that CCsh compiles Bourne Shell (not bash ...) scripts to C code and that it understands how to replicate the functionality of 50 odd standard commands avoiding the need to fork them.
But CCsh is not open source, so if it doesn't do what you need (or expect) you won't be able to look at the source code to figure out why.
I don't think you're going to find anything, because you can't really "compile" a shell script. You could write a simple script that converts all lines to calls to system(3), then "compile" that as a C program, but this wouldn't have a major performance boost over anything you're currently using, and might not handle variables correctly. Don't do this.
The problem with "compiling" a shell script is that shell scripts just call external programs.
In theory you could actually get a good performance boost.
Think of all the
if [ x"$MYVAR" == x"TheResult" ]; then echo "TheResult Happened" fi
(note invocation of test, then echo, as well as the interpreting needed to be done.)
which could be replaced by
if ( !strcmp(myvar, "TheResult") ) printf("TheResult Happened");
In C: no process launching, no having to do path searching. Lots of goodness.

Resources