KornShell or Bash Shell for ETL process? - bash

I am working on a project where I need to load the data into data warehouse using ETL process. I have data in csv, unstructured and flat file format. I am thinking about using shell scripting to carry out the ETL process. I know little about both bash shell and KornShell (ksh) but I am very new in ETL process. So my question is what is the better option for ETL process. Whether I should use Bash Shell or KornShell?
The answer from user experienced with ETL process and shell scripting is highly appreciated.
Thank in advance.

Typically, my ETL processes use SQL statements to do in-database transformation, so they're really "ELT" process. The shell simply serves as the tool to move files around, execute data loads & extracts, and execute SQL statements. If your DW is on a sufficiently powerful system, it's usually the best place to do the transformation work, unless you are set on having a system living outside of the EDW that does data transformations.
The choice of shell for such an ELT process that I've described is really one of maintenance. Who will be supporting this when you're gone? Does the company have lots of folks who know bash, but only one that knows KSH? Or is it 99% a .NET shop? Then I'd suggest writing your ETL in little C# console apps. The choice of the language you use to execute your ETL, when you're not using a real "ETL" tool, should be focused on these factors, not on the 'best' language.

Korn is slightly more portable. Bash is a lot more powerful. "Bourne shell" is a good least common denominator.
All things being equal, I'd recommend "bash". Especially if your platform is Linux.
IMHO ..
PS:
The name "bash" stands for "Bourne Again Shell", a pun on its heritage from the original "Bourne" shell. Bourne scripts are bash-compatible, but not vice versa.

Related

BASH shell process control - any other examples of controlling/scheduling work

I've inherited a medium sized project in which the main (batch) program is fed work through a large set of shell scripts that do a lot of process control (waiting for process to complete, sleeping, checking for conditions, etc) [ and reprocessed through perl scripts ]
Are there other examples of process control by shell scripts ? I would like to see what other people have done as a comparison. (as i'm not really fond of the 6,668 line shell script)
It may lead to that the current program works and doesn't need to be messed with or for maintenance reasons - it's too cumbersome and doing it another way will be easier to maintain, but I need other examples.
To reduce the "generality" of the question here's an example of what I'm looking for: procsup
Inquisitor project relies on process control from shell scripts extensively. You might want to see it's directory with main function set or directory with tests (i.e. slave processes) that it runs.
This is quite general question, and therefore giving specific answers may be a little bit difficult. (And you wont be happy with 5000 lines long example.) Most probably architecture of your application is faulty, and requires rather complete rework.
As you probably already know, process control with bash is pretty simple:
./test_script.sh &
test_script_pid=$!
wait $test_script_pid # waits until it's done
./test_script2.sh
echo $? # Prints return code of previous command
You can do same things with for example Python subprocess (or with Perl, obviously). If you have complex architecture with large number of different programs, then process is obviously non-trivial.
That is an awfully bug shell script. Have you considered refactoring it?
From the sound of it, there may be a lot of instances where you could replace several lines of code with a call to a shell function. If you can simplify the code in this way, then it will be easier to see where there are errors in the logic.
I've used this tactic successfully with a humongous PERL script and it turned out to have some serious logic errors and to be a security risk because it had embedded passwords that were obfuscated in an easily reversible way. The passwords that were exposed could have been used by persons unknown (well, a disgruntled employee) to shut down an entire global network.
Some managers were leaning towards making a security exception because this script was so important, but when the logic error was explained and it was clear that this script was providing incorrect data, it was decided that no data was better than dirty data. The guy who wrote that script taught himself programming with a PERL book and the writing of the script.

Incorporating bash scripts into an R package?

Background
I am writing an R package to support reproducible research. At this point, the workflow is mostly held together by bash scripts, and I can run an analysis by sending a single command like ./runscript.sh. I use bash for the following:
file manipulation tar, rsync, 'rename'
running bash files locally and via ssh
running R scripts using R --vanilla that in turn call R functions
find and replace text within files using sed
submitting jobs via qsub
It seems to me that it would be much more efficient (cleaner and easier) to execute the entire workflow from an R function or R script. I am partial to R since I am more familiar with it and mostly work within emacs ESS.
Questions
Would it be worthwhile to encapsulate all of these uses of bash within R using the system and files functions?
Are there other R packages that I have not yet found that would be helpful for doing this?
Notes
Following Al3xa's answer, I realize that it is important to note that the speed penalty of using eg. R vs bash versions of tar and gsub on 1000-2000 files would likely be less than the current rate limiting steps in the workflow: computations by JAGS (~10-20min) and FORTRAN (>4hrs)
I'm a big fan of using R as your "integrated" environment vs. bash scripts. I'm in the process of moving all of my bash and ruby scripts to Rscript as I need to make changes to them.
There are only a couple of reasons not to move everything into R that come to mind. I'm referring mainly to using Rscript to accomplish this
1) Speed, which from my testing is a moderate impact in any situation I've come across, and would be trivial relative to the times you mentioned.
2) Portability, in that paths to Rscript, etc. may be different across systems. I've had no problems writing things on OS X and moving them to a Linux server, but might break on Windows.
The advantages in my book are:
1) Much easier for me to write. I don't have to switch back and forth between the slight idiosyncrasies with things like conditional statements and for loops.
2) More forgiving. I can't describe how much time I've spent trying to get bash scripts to work because I accidentally had a space where I shouldn't have. R is much nicer in that regard (yes, of course, we should all follow conventions in R perfectly, but I'd rather that it not stall me up for hours if I don't).
3) I do better work. For tar a file it doesn't matter, but I find I do better text manipulation in R vs. awk/sed for example.
Re: packages that are helpful -- This doesn't exist, to my knowledge, but I'd love a version of make that's based on R. make's syntax is one of the most inflexible out there (tabs vs spaces? really?) - I'd love to write an R-based alternative. Some day, I will...
Well, there are functions like tar, gsub etc. Anyway, I guess you're willing to create a crossplatform solution. You should prefer bash for the sake of speed, and use R only for R-specific functions. I don't find it useful to wrap all system-based commands within system and/or file.*... it would be much slower... If you're using Linux, I suggest littler over Rscript interface.

Real life SHELL SCRIPTS usage?

I'm learning UNIX/LINUX shell scripting and trying to think about it appropriate usage?
The only thing that comes into mind - it'll be nice for let's say backup operations and logs management....But I'm sure it goes way beyond that...or is it?
I'm sure there are people on this server who use Shell scripting on the daily basis.
Can you tell me what do you use it for in your organization/business?
Thanks:)
Why use shell scripts
Basically, there are any number of tasks related to backup, maintenance, etc. that need to be automated, and shell scripts do that.
You can do quite everything in shell, but it is easy to write ugly and slow scripts.
First domain of expertise of shells is to start and combine other programs. This is exceptionally well suited for:
file manipulations: list, move, copy, compress, archive
text lines manipulation: filter (grep), modify (sed), delete lines (sed), combine files (paste), sort (sort), unify (sort -u)
All those operation are NOT shell operation, but the shell is the glue that put them all together.
file operations are generally combined with flow control instructions (while, if, for)
line operations are combined with pipes | and named pipe mkfifo
Things you can do in less than 20 lines with shell commands.
I personally use it to batch miscellaneous daily/weekly commands and start up long running processes. They can be unwieldy and hard to debug when they get big. Unknown variables evaluate to empty strings (icky).
Scripting languages languages such as Python, Perl, and Ruby become more attractive as the code becomes more complex.
I work on an actively developed software project that runs in a unix environment. Unfortunately it uses a lot of different environment variables for configuration and stashes binary programs, data files, and shared libraries on version dependent paths.
All that is a pain to set up.
But it gets worse: at any given time I might want to work with the stable version, the pretty-stable-but-more-up-to-date version, the bleeding-edge-every-new-feature version, or my personally hacked development version.
Switching between them is a even bigger hassle.
Enter a shell script which insures that I am set up for exactly one version at a time. Ta da!
BTW--The script I use for this makes extensive use of the accepted answer to How do I manipulate $PATH elements in shell scripts?, so you know Stack Overflow works for me in the real world. More over, I've infected several other people with this technology.
I've seen and worked-on full-blown applications (medical records and scheduling processing) written in Korn shell.
Batch programming, PostScript print filters, automatic mailers and automated airline checkin systems, regular stock price tracking, software installers, et al, et al.
Better question = what could not be programmed in Shell?
for our company, we use shell scripts for the following:
backups - it would be very disastrous for us if we lose our data. Various parts of our backup like database backup, offsite backup, continuous backups etc all uses shell scripts that runs daily and some runs once a week.
update dates - we do not use ntp so we rely on sh scripts to update the date due to firewall restrictions.
log cleanup
send emails
I didn't think bash programming was particularly powerful until I saw that the OS startup scripts are all written in it. That made me re-examine my assumptions. I now have several dozen important shell scripts that I've written over the years that automate some common tasks.
For example, I wrote one that polls the current load average, and then executes a provided command if it exceeds a certain value (useful for examining events that only happen once or twice a day).
Another that I wrote iterates through all the mysql databases on the server and outputs a mysqldump for each one into its own appropriately-named .sql file.
Another iterates through a list of homedirs and changes the ownership of all the files under the corresponding public_html dir to match the user who should own them to be compliant with suPHP's restrictions.
Another examines the current hardware configuration and downloads, installs, and configures appropriate software for monitoring the health of the currently-attached RAID controller.
These are all relatively simple tasks that could be done by hand -- but whenever I find myself doing the same task more than once, I write a shell script to automate the process.
I also built a base-64 decoder in bash just to see if I could. It works, but it's terribly slow. I use shell scripting for simple tasks that primarily involve executing other programs. I often use Perl when a significant amount of string processing is required, and I use Python for the more complex scripting tasks. The more languages you know, the better you will be at choosing the right one for the job.

What goodies are present in UNIX shells sans BASH?

I have been using 'bash' since the time I have been using Unix- Linux/Solaris. Now, I am interested to know what do shells like 'ksh','zsh' offer better? What do 'geeks' use?
I'm partial to zsh (it's like a blend of ksh and bash). The guide has a nice overview of its features. This page has a nice chart showing the availability of different features in different shells.
bash: In my experience, most people use bash, partly because it is the standard shell in most Linux systems.
ksh: Many Solaris systems use ksh instead, but that seems to be loosing popularity to bash.
csh: Csh used to be more popular, but was generally superseded by tcsh. tcsh is not bad for those who are very comfortable with c-like syntax.
zsh: I have not used zsh, but it seems to be very feature rich.
Personally, I prefer bash, because it is installed on almost every unix-compatible OS, it is very versatile, and is a good compromise between a simple command-line tool and a scripting language.
Bash is best to know for the most broad compatibility, you can sit at basically any Unix and it will be there.
Zsh is one of the most modern shells, probably. There's all kinds of crazy fun stuff you can do with it
I use ksh93 by preference. This means that the basics of ksh are available on pretty much any system I find myself on, so my interactive experience and 98% of my complex profile stay the same.
bash is a bit slow, but like many FSF programs, it tries to incorporate all known features. I use ksh93, which has largely converged with bash. The main advantage of ksh is that it's got a nice interface for extending the shell with C code. It's also a little easier to customize, e.g., to make tab key do whatever you want, in context. Disadvantage is that command completion is not built in; you have to program it.
Avoid csh and its derivatives :-)
In my experience, there are very few goodies in standard Unix shells (where that means, to me, csh, sh, ksh) that are not also present in at least an equivalent form in bash. Consequently, as long as you are comfortable that bash will be on all your machines, you may as well use it to get the maximum functionality.
OTOH, if you want to deal with portability, you will probably use ksh, which hews pretty close to the POSIX standard - with some extensions (bash is also fairly close to the POSIX standard, but with rather more extensions).
I really like the POSIX $(cmd) notation in place of the classic back-ticks
`cmd`
(That was not fun in Markdown!). One major reason for liking it is that it is much, much easier to nest the calls:
gcclib=$(dirname $(dirname $(which gcc)))/lib
Getting that right on one line with back-ticks is silly enough that you would not attempt to make it into a one-liner. That's in ksh and bash; it is not in the classic Bourne shell (/bin/sh, but be aware that /bin/sh on some machines is not the classic Bourne shell but bash in disguise), nor in the C shell.
If you're using bash and you're happy with it, no need to change right away. It's a good shell. Knowing the history of it tells you something about it too: Bourne Again SHell. It was a good attempt to make a better shell than the C-Shell and its derivatives (like tcsh), allowing you to use /bin/sh syntax for scripting (or for interactive use), but adding some of the nicer features of csh (like history and so on).
The Korn shell and Bash have a lot in common, in concept anyway. Like /bin/sh, the Korn shell came from AT&T originally, and wasn't open sourced until relatively recently. It has a good history mechanism, and does file locking on history state files so that if they're mounted on network file servers, multiple copies of the shell will properly write to the history file without clobbering each other. It also supports /bin/sh syntax, and incorporates some of the good things about /bin/csh. There's a lot to ksh, and it's generally a pretty good shell, if you can find it. I used to use it on Solaris, especially back when I was working at Sun. I didn't want to install anything that didn't come with the OS, since I installed a new OS several times a week, so this was a good choice.
Now I use either Bash or zsh. I prefer zsh because of its rich set of features for command completion and for writing shell functions in general (for my interactive shells -- when programming scripts, I stick to pretty standard Bourne shell stuff).
As others have said, it's best to avoid any version of the C-shell, except for those shells that give you some features of /bin/csh but aren't derived from /bin/csh code.
I had one rigged up to provide per-session shell history.
The unique thing here was each window had its own shell history. Quite convenient.

Bash or KornShell (ksh)? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am not new to *nix, however lately I have been spending a lot of time at the prompt. My question is what are the advantages of using KornShell (ksh) or Bash Shell? Where are the pitfalls of using one over the other?
Looking to understand from the perspective of a user, rather than purely scripting.
The difference between Kornshell and Bash are minimal. There are certain advantages one has over the other, but the differences are tiny:
BASH is much easier to set a prompt that displays the current directory. To do the same in Kornshell is hackish.
Kornshell has associative arrays and BASH doesn't. Now, the last time I used Associative arrays was... Let me think... Never.
Kornshell handles loop syntax a bit better. You can usually set a value in a Kornshell loop and have it available after the loop.
Bash handles getting exit codes from pipes in a cleaner way.
Kornshell has the print command which is way better than the echo command.
Bash has tab completions. In older versions
Kornshell has the r history command that allows me to quickly rerun older commands.
Kornshell has the syntax cd old new which replaces old with new in your directory and CDs over there. It's convenient when you have are in a directory called /foo/bar/barfoo/one/bar/bar/foo/bar and you need to cd to /foo/bar/barfoo/two/bar/bar/foo/bar In Kornshell, you can simply do cd one two and be done with it. In BASH, you'd have to cd ../../../../../two/bar/bar/foo/bar.
I'm an old Kornshell guy because I learned Unix in the 1990s, and that was the shell of choice back then. I can use Bash, but I get frustrated by it at times because in habit I use some minor feature that Kornshell has that BASH doesn't and it doesn't work. So, whenever possible, I set Kornshell as my default.
However, I am going to tell you to learn BASH. Bash is now implemented on most Unix systems as well as on Linux, and there are simply more resources available for learning BASH and getting help than Kornshell. If you need to do something exotic in BASH, you can go on Stackoverflow, post your question, and you'll get a dozen answers in a few minutes -- and some of them will even be correct!.
If you have a Kornshell question and post it on Stackoverflow, you'll have to wait for some old past their prime hacker like me wake up from his nap before you get an answer. And, forget getting any response if they're serving pudding up in the old age home that day.
BASH is simply the shell of choice now, so if you've got to learn something, might as well go with what is popular.
Bash.
The various UNIX and Linux implementations have various different source level implementations of ksh, some of which are real ksh, some of which are pdksh implementations and some of which are just symlinks to some other shell that has a "ksh" personality. This can lead to weird differences in execution behavior.
At least with bash you can be sure that it's a single code base, and all you need worry about is what (usually minimum) version of bash is installed. Having done a lot of scripting on pretty much every modern (and not-so-modern) UNIX, programming to bash is more reliably consistent in my experience.
I'm a korn-shell veteran, so know that I speak from that perspective.
However, I have been comfortable with Bourne shell, ksh88, and ksh93, and for the most I know which features are supported in which. (I should skip ksh88 here, as it's not widely distributed anymore.)
For interactive use, take whatever fits your need. Experiment. I like being able to use the same shell for interactive use and for programming.
I went from ksh88 on SVR2 to tcsh, to ksh88sun (which added significant internationalisation support) and ksh93. I tried bash, and hated it because
it flattened my history. Then I discovered shopt -s lithist and all was well.
(The lithist option assures that newlines are preserved in your command
history.)
For shell programming, I'd seriously recommend ksh93 if you want a consistent programming language, good POSIX conformance, and good performance, as many common unix commands can be available as builtin functions.
If you want portability use at least both. And make sure you have a good test suite.
There are many subtle differences between shells. Consider for example reading from a pipe:
b=42 && echo one two three four |
read a b junk && echo $b
This will produce different results in different shells. The korn-shell runs pipelines from back to front; the last element in the pipeline runs in the current process. Bash did not support this useful behaviour until v4.x, and even then, it's not the default.
Another example illustrating consistency: The echo command itself, which was made obsolete by the split between BSD and SYSV unix, and each introduced their own convention for not printing newlines (and other behaviour). The result of this can still be seen in many 'configure' scripts.
Ksh took a radical approach to that - and introduced the print command, which actually supports both methods (the -n option from BSD, and the trailing \c special character from SYSV)
However, for serious systems programming I'd recommend something other than a shell, like python, perl. Or take it a step further, and use a platform like puppet - which allows you to watch and correct the state of whole clusters of systems, with good auditing.
Shell programming is like swimming in uncharted waters, or worse.
Programming in any language requires familiarity with its syntax, its interfaces and behaviour. Shell programming isn't any different.
This is a bit of a Unix vs Linux battle. Most if not all Linux distributions have bash installed and ksh optional. Most Unix systems, like Solaris, AIX and HPUX have ksh as default.
Personally I always use ksh, I love the vi completion and I pretty much use Solaris for everything.
For scripts, I always use ksh because it smooths over gotchas.
But I find bash more comfortable for interactive use. For me the emacs key bindings and tab completion are the main benefits. But that's mostly force of habit, not any technical issue with ksh.
I don't have experience with ksh, but I have used both bash and zsh. I prefer zsh over bash because of its support for very powerful file globbing, variable expansion modifiers, and faster tab completion.
Here's a quick intro: http://friedcpu.wordpress.com/2007/07/24/zsh-the-last-shell-youll-ever-need/
For one thing, bash has tab completion. This alone is enough to make me prefer it over ksh.
Z shell has a good combination of ksh's unique features with the nice things that bash provides, plus a lot more stuff on top of that.
#foxxtrot
Actually, the standard shell is Bourne shell (sh). /bin/sh on Linux is actually bash, but if you're aiming for cross-platform scripts, you're better off sticking to features of the original Bourne shell or writing it in something like perl.
My answer would be 'pick one and learn how to use it'. They're both decent shells; bash probably has more bells and whistles, but they both have the basic features you'll want. bash is more universally available these days. If you're using Linux all the time, just stick with it.
If you're programming, trying to stick to plain 'sh' for portability is good practice, but then with bash available so widely these days that bit of advice is probably a bit old-fashioned.
Learn how to use completion and your shell history; read the manpage occasionally and try to learn a few new things.
Available in most UNIX system, ksh is standard-comliant, clearly designed, well-rounded.
I think books,helps in ksh is enough and clear, especially the O'Reilly book.
Bash is a mass. I keep it as root login shell for Linux at home only.
For interactive use, I prefer zsh on Linux/UNIX. I run scripts in zsh, but I'll test most of my scripts, functions in AIX ksh though.
Bash is the benchmark, but that's mostly because you can be reasonably sure it's installed on every *nix out there. If you're planning to distribute the scripts, use Bash.
I can not really address the actual programming differences between the shells, unfortunately.
Bash is the standard for Linux.
My experience is that it is easier to find help for bash than for ksh or csh.

Resources