Incorporating bash scripts into an R package? - bash

Background
I am writing an R package to support reproducible research. At this point, the workflow is mostly held together by bash scripts, and I can run an analysis by sending a single command like ./runscript.sh. I use bash for the following:
file manipulation tar, rsync, 'rename'
running bash files locally and via ssh
running R scripts using R --vanilla that in turn call R functions
find and replace text within files using sed
submitting jobs via qsub
It seems to me that it would be much more efficient (cleaner and easier) to execute the entire workflow from an R function or R script. I am partial to R since I am more familiar with it and mostly work within emacs ESS.
Questions
Would it be worthwhile to encapsulate all of these uses of bash within R using the system and files functions?
Are there other R packages that I have not yet found that would be helpful for doing this?
Notes
Following Al3xa's answer, I realize that it is important to note that the speed penalty of using eg. R vs bash versions of tar and gsub on 1000-2000 files would likely be less than the current rate limiting steps in the workflow: computations by JAGS (~10-20min) and FORTRAN (>4hrs)

I'm a big fan of using R as your "integrated" environment vs. bash scripts. I'm in the process of moving all of my bash and ruby scripts to Rscript as I need to make changes to them.
There are only a couple of reasons not to move everything into R that come to mind. I'm referring mainly to using Rscript to accomplish this
1) Speed, which from my testing is a moderate impact in any situation I've come across, and would be trivial relative to the times you mentioned.
2) Portability, in that paths to Rscript, etc. may be different across systems. I've had no problems writing things on OS X and moving them to a Linux server, but might break on Windows.
The advantages in my book are:
1) Much easier for me to write. I don't have to switch back and forth between the slight idiosyncrasies with things like conditional statements and for loops.
2) More forgiving. I can't describe how much time I've spent trying to get bash scripts to work because I accidentally had a space where I shouldn't have. R is much nicer in that regard (yes, of course, we should all follow conventions in R perfectly, but I'd rather that it not stall me up for hours if I don't).
3) I do better work. For tar a file it doesn't matter, but I find I do better text manipulation in R vs. awk/sed for example.
Re: packages that are helpful -- This doesn't exist, to my knowledge, but I'd love a version of make that's based on R. make's syntax is one of the most inflexible out there (tabs vs spaces? really?) - I'd love to write an R-based alternative. Some day, I will...

Well, there are functions like tar, gsub etc. Anyway, I guess you're willing to create a crossplatform solution. You should prefer bash for the sake of speed, and use R only for R-specific functions. I don't find it useful to wrap all system-based commands within system and/or file.*... it would be much slower... If you're using Linux, I suggest littler over Rscript interface.

Related

Is it possible to capture the bitstream of post-interpreted code? (pre-execution) eg. speedup calls I make often

I've wondered this many times and in many cases, and I like to learn so general or close-but-more needed answers are acceptable to me.
I'll get specific, to help explain the question. Please remember that this question is more about accelerating common interpreted language calls (yes, exactly the same arguments), than it is about the specific programs I'm calling in this case.
Here we go:
Using i3WM I use i3lock-fancy to lock my workspace with a key-combo mapped to the command:
i3lock-fancy -p -f /usr/share/fonts/fantasque_mono.ttf
So here is why I think this is possible, though my google-fu has failed me:
i3lock-fancy is a bash script, and bash is an interpreted language
each time I run the command I call it with the same arguments
Theoretically the interpreter is spitting out the same bitstream to be executed, right?
Please don't complain about portability, I understand it, the captured bitstream, would not be
For visual people:
When I call the above command > bash interpreter converts bash-code to byte-code > CPU executes byte-code
I want to:
execute command > bash interpreter converts to byte-code > save to file
so that I can effectively skip interpretation (since it's EXACTLY the same every time):
call file > CPU executes byte-code
What I tried:
Looking around on SO before asking the question lead me shc which is similar in some ways to what I'm asking for.
But this is not what shc is for (thanks #stefan)
is there a way to do this which is more like what I've described?
Simply put, is there a way to interpret bash, and save the result without actually running it?

jq or xsltproc alternative for s-expressions?

I have a project which contains a bunch of small programs tied together using bash scripts, as per the Unix philosophy. Their exchange format originally looked like this:
meta1a:meta1b:meta1c AST1
meta2a:meta2b:meta2c AST2
Where the :-separated fields are metadata and the ASTs are s-expressions which the scripts pass along as-is. This worked fine, as I could use cut -d ' ' to split the metadata from the ASTs, and cut -d ':' to dig into the metadata. However, I then needed to add a metadata field containing spaces, which breaks this format. Since no field uses tabs, I switched to the following:
meta1a:meta1b:meta1c:meta 1 d\tAST1
meta2a:meta2b:meta2c:meta 2 d\tAST2
Since I envision more metadata fields being added in the future, I think it's time to switch to a more structured format rather than playing a game of "guess the punctuation".
Instead of delimiters and cut I could use JSON and jq, or I could use XML and xsltproc, but since I'm already using s-expressions for the ASTs, I'm wondering if there's a nice way to use them here instead?
For example, something which looks like this:
(echo '(("foo1" "bar1" "baz1" "quux 1") ast1)'
echo '(("foo2" "bar2" "baz2" "quux 2") ast2)') | sexpr 'caar'
"foo1"
"foo2"
My requirements are:
Straightforward use of stdio with minimal boilerplate, since that's where my programs read/write their data
Easily callable from shell scripts or provide a very compelling alternative to bash's process invocation and pipelining
Streaming I/O if possible; ie. I'd rather work with one AST at a time rather than consuming the whole input looking for a closing )
Fast and lightweight, especially if it's being invoked a few times; each AST is only a few KB, but they can add up to hundreds of MB
Should work on Linux at least; cross-platform would be nice
The obvious choice is to use a Lisp/Scheme interpreter, but the only one I'm experienced with is Emacs, which is far too heavyweight. Perhaps another implementation is more lightweight and suited to this?
In Haskell I've played with shelly, turtle and atto-lisp, but most of my code was spent converting between String/Text/ByteString, wrapping/unwrapping Lisps, implementing my own car, cdr, cons, etc.
I've read a little about scsh, but don't know if that would be appropriate either.
You might give Common Lisp a try.
Straightforward use of stdio with minimal boilerplate, since that's
where my programs read/write their data
(loop for (attributes ast) = (safe-read) do (print ...)
Read/write from standard input and output.
safe-read should disable execution of code at read-time. There is at least one implementation. Don't eval your AST directly unless you perfectly know what's in there.
Easily callable from shell scripts or provide a very compelling
alternative to bash's process invocation and pipelining
In the same spirit as java -jar ..., you can launch your Common Lisp executable, e.g. sbcl, with a script in argument: sbcl --load file.lisp. You can even dump a core or an executable core of your application with everything preloaded (save-lisp-and-die).
Or, use cl-launch which does the above automatically, and portably, and generates shell scripts and/or makes executable programs from your code.
Streaming I/O if possible; ie. I'd rather work with one AST at a time
rather than consuming the whole input looking for a closing )
If the whole input stream starts with a (, then read will read up-to the closing ) character, but in practice this is rarely done: source code in Common Lisp is not enclosed in one pair of parenthesis per-file, but as a sequence of forms. If your stream produces not one but many s-exps, the reader will read them one at a time.
Fast and lightweight, especially if it's being invoked a few times;
each AST is only a few KB, but they can add up to hundreds of MB
Fast it will be, especially if you save a core. Lightweight, well, it is well-known that lisp images can take some disk space (e.g. 46MB), but this is rarely an issue. Why is is important? Maybe you have another definition about what lightweight means, because this is unrelated to the size of the AST you will be parsing. There should be no problem reading those AST, though.
Should work on Linux at least; cross-platform would be nice
See Wikipedia. For example, Clozure CL (CCL) runs on Mac OS X, FreeBSD, Linux, Solaris and Windows, 32/64 bits.
Working on a slightly different task, I again found the need to process a bunch of s-expressions. This time I needed to perform some non-trivial processing of the given s-expressions (extracting lists of symbols used, etc.), rather than having the option to pass them along as opaque strings.
I gave Racket a try and was pleasantly surprised; it was much nicer than the other Lisps I've used before (Emacs Lisp and various application-specific Scheme scripts), since it has nice documentation and a batteries included standard library.
Some of the relevant points for this kind of task:
"Ports" for reading and writing data. These can be (dynamically?) scoped across an expression, and default to stdio (i.e. (current-input-port) defaults to stdin and (current-output-port) defaults to stdout). Ports make stdio and file access about as nice to use as a shell: more verbose, but fewer gnarly edge-cases.
Various conversion functions like port->string, file->lines, read, etc. make it easy to get data at the appropriate form of granularity (characters, lines, strings, expressions, etc.).
I couldn't find a "standard" way to read multiple s-expressions, since read only returns one, so iteration/recursion would be needed to do this in a streaming fashion.
If streaming isn't needed, I found it easiest to read the whole input as a string, append "(\n" and "\n)", then use (with-input-from-string my-modified-input read) to get one big list.
I found Racket's startup time to be pretty slow, so I wouldn't recommend invoking a script over and over as part of a loop if speed is a concern. It was easy enough to move my looping into Racket and have the script invoked once though.

BASH shell process control - any other examples of controlling/scheduling work

I've inherited a medium sized project in which the main (batch) program is fed work through a large set of shell scripts that do a lot of process control (waiting for process to complete, sleeping, checking for conditions, etc) [ and reprocessed through perl scripts ]
Are there other examples of process control by shell scripts ? I would like to see what other people have done as a comparison. (as i'm not really fond of the 6,668 line shell script)
It may lead to that the current program works and doesn't need to be messed with or for maintenance reasons - it's too cumbersome and doing it another way will be easier to maintain, but I need other examples.
To reduce the "generality" of the question here's an example of what I'm looking for: procsup
Inquisitor project relies on process control from shell scripts extensively. You might want to see it's directory with main function set or directory with tests (i.e. slave processes) that it runs.
This is quite general question, and therefore giving specific answers may be a little bit difficult. (And you wont be happy with 5000 lines long example.) Most probably architecture of your application is faulty, and requires rather complete rework.
As you probably already know, process control with bash is pretty simple:
./test_script.sh &
test_script_pid=$!
wait $test_script_pid # waits until it's done
./test_script2.sh
echo $? # Prints return code of previous command
You can do same things with for example Python subprocess (or with Perl, obviously). If you have complex architecture with large number of different programs, then process is obviously non-trivial.
That is an awfully bug shell script. Have you considered refactoring it?
From the sound of it, there may be a lot of instances where you could replace several lines of code with a call to a shell function. If you can simplify the code in this way, then it will be easier to see where there are errors in the logic.
I've used this tactic successfully with a humongous PERL script and it turned out to have some serious logic errors and to be a security risk because it had embedded passwords that were obfuscated in an easily reversible way. The passwords that were exposed could have been used by persons unknown (well, a disgruntled employee) to shut down an entire global network.
Some managers were leaning towards making a security exception because this script was so important, but when the logic error was explained and it was clear that this script was providing incorrect data, it was decided that no data was better than dirty data. The guy who wrote that script taught himself programming with a PERL book and the writing of the script.

Real life SHELL SCRIPTS usage?

I'm learning UNIX/LINUX shell scripting and trying to think about it appropriate usage?
The only thing that comes into mind - it'll be nice for let's say backup operations and logs management....But I'm sure it goes way beyond that...or is it?
I'm sure there are people on this server who use Shell scripting on the daily basis.
Can you tell me what do you use it for in your organization/business?
Thanks:)
Why use shell scripts
Basically, there are any number of tasks related to backup, maintenance, etc. that need to be automated, and shell scripts do that.
You can do quite everything in shell, but it is easy to write ugly and slow scripts.
First domain of expertise of shells is to start and combine other programs. This is exceptionally well suited for:
file manipulations: list, move, copy, compress, archive
text lines manipulation: filter (grep), modify (sed), delete lines (sed), combine files (paste), sort (sort), unify (sort -u)
All those operation are NOT shell operation, but the shell is the glue that put them all together.
file operations are generally combined with flow control instructions (while, if, for)
line operations are combined with pipes | and named pipe mkfifo
Things you can do in less than 20 lines with shell commands.
I personally use it to batch miscellaneous daily/weekly commands and start up long running processes. They can be unwieldy and hard to debug when they get big. Unknown variables evaluate to empty strings (icky).
Scripting languages languages such as Python, Perl, and Ruby become more attractive as the code becomes more complex.
I work on an actively developed software project that runs in a unix environment. Unfortunately it uses a lot of different environment variables for configuration and stashes binary programs, data files, and shared libraries on version dependent paths.
All that is a pain to set up.
But it gets worse: at any given time I might want to work with the stable version, the pretty-stable-but-more-up-to-date version, the bleeding-edge-every-new-feature version, or my personally hacked development version.
Switching between them is a even bigger hassle.
Enter a shell script which insures that I am set up for exactly one version at a time. Ta da!
BTW--The script I use for this makes extensive use of the accepted answer to How do I manipulate $PATH elements in shell scripts?, so you know Stack Overflow works for me in the real world. More over, I've infected several other people with this technology.
I've seen and worked-on full-blown applications (medical records and scheduling processing) written in Korn shell.
Batch programming, PostScript print filters, automatic mailers and automated airline checkin systems, regular stock price tracking, software installers, et al, et al.
Better question = what could not be programmed in Shell?
for our company, we use shell scripts for the following:
backups - it would be very disastrous for us if we lose our data. Various parts of our backup like database backup, offsite backup, continuous backups etc all uses shell scripts that runs daily and some runs once a week.
update dates - we do not use ntp so we rely on sh scripts to update the date due to firewall restrictions.
log cleanup
send emails
I didn't think bash programming was particularly powerful until I saw that the OS startup scripts are all written in it. That made me re-examine my assumptions. I now have several dozen important shell scripts that I've written over the years that automate some common tasks.
For example, I wrote one that polls the current load average, and then executes a provided command if it exceeds a certain value (useful for examining events that only happen once or twice a day).
Another that I wrote iterates through all the mysql databases on the server and outputs a mysqldump for each one into its own appropriately-named .sql file.
Another iterates through a list of homedirs and changes the ownership of all the files under the corresponding public_html dir to match the user who should own them to be compliant with suPHP's restrictions.
Another examines the current hardware configuration and downloads, installs, and configures appropriate software for monitoring the health of the currently-attached RAID controller.
These are all relatively simple tasks that could be done by hand -- but whenever I find myself doing the same task more than once, I write a shell script to automate the process.
I also built a base-64 decoder in bash just to see if I could. It works, but it's terribly slow. I use shell scripting for simple tasks that primarily involve executing other programs. I often use Perl when a significant amount of string processing is required, and I use Python for the more complex scripting tasks. The more languages you know, the better you will be at choosing the right one for the job.

Pitfalls of using shell scripts to wrap a program?

Consider I have a program that needs an environment set. It is in Perl and I want to modify the environment (to search for libraries a special spot).
Every time I mess with the the standard way to do things in UNIX I pay a heavy price and I pay a penalty in flexibility.
I know that by using a simple shell script I will inject an additional process into the process tree. Any process accessing its own process tree might be thrown for a little bit of a loop.
Anything recursive to a nontrivial way would need to defend against multiple expansions of the environment.
Anything resembling being in a pipe of programs (or closing and opening STDIN, STDOUT, or STDERR) is my biggest area of concern.
What am I doing to myself?
What am I doing to myself?
Getting yourself all het up over nothing?
Wrapping a program in a shell script in order to set up the environment is actually quite standard and the risk is pretty minimal unless you're trying to do something really weird.
If you're really concerned about having one more process around — and UNIX processes are very cheap, by design — then use the exec keyword, which instead of forking a new process, simply exec's a new executable in place of the current one. So, where you might have had
#!/bin/bash -
FOO=hello
PATH=/my/special/path:${PATH}
perl myprog.pl
You'd just say
#!/bin/bash -
FOO=hello
PATH=/my/special/path:${PATH}
exec perl myprog.pl
and the spare process goes away.
This trick, however, is almost never worth the bother; the one counter-example is that if you can't change your default shell, it's useful to say
$ exec zsh
in place of just running the shell, because then you get the expected behavior for process control and so forth.

Resources