Why use builtin commands over external programs (in bash scripts)? - performance

This question is concerned with the negative impacts of using external programs versus instead of built-in constructs -- specifically concerning sed, and external programs in general.
My thinking is that in order to maximize compatibility across UNIX systems, one should use builtin commands. However, some programs are virtually standard. Consider this example:
# Both functions print an array definition for use in
# assignments, for loops, etc.
uses_external() {
declare -p $1 \
| sed -e "s/declare \-a [^=]*=\'\(.*\)\'\$/\1/" \
| sed "s/\[[0-9]*\]\=//g"
}
uses_builtin() {
local r=$( declare -p $1 )
r=${r#declare\ -a\ *=}
echo ${r//\[[0-9]\]=}
}
In terms of compatibility, is there much of a difference between uses_builtin() and uses_external()?
With regards to compatibility, is there a certain class of external programs that are nearly universal? Is there a resource that gives this kind of info? (For the example above, I had to read though many sources before I felt comfortable assuming that sed is a more compatible choice than awk or a second language.)
I really want to weigh the pros and cons, so feel free to point out other considerations between builtin commands and external programs (i.e. performance, robustness, support, etc). Or, does the question of "builtin vs external" generally a per-program matter?

Objectively speaking, using built-in commands is more efficient, since you don't need to fork any new processes for them. (Subjectively speaking, the overhead of such forking may be negligible.) Large numbers of built-in commands that could be subsumed by a single call to an external program may be slower.
Use of built-ins may or may not produce more readable code. It depends on who is reading the code.

Counterintuitively, the builtin is slower for large data sets with your example
Parameter expansion slow for large data sets

Related

Is there a command to check the datatype of the input? [duplicate]

In Python I can get variable type by:
>>> i = 123
>>> type(i)
<type 'int'>
I saw on this page that there are no variable types in bash. The explanation given is:
Untyped variables are both a blessing and a curse. They permit more flexibility in scripting and make it easier to grind out lines of code (and give you enough rope to hang yourself!). However, they likewise permit subtle errors to creep in and encourage sloppy programming habits.
But I'm not sure what it means and what are the real advantages (and drawbacks).
Bash doesn't have types in the same way as Python (although I would say that Python has classes rather than types). But bash variables do have attributes that are given (mostly) through declare, but the range of attributes is fairly small. You can find an attribute using declare -p, for example, declare -i creates an integer:
declare -i num
num=42
declare -p num
Gives:
declare -i num="42"
But this is a poor feature compared to Python, or almost any modern language. The problem is that in something like Bash the basic type is a text string, and that's fine if all you need is text strings for things like filenames. But once you start needing to do heavy processing you need other types. Bash doesn't support floating point, for example. You also need compound types, like a class describing a file with all the attributes that a file can have.
Bash 4 does have associative arrays (declare -A), similar to Python dictionaries, which extends functionality considerably.
Even so, most would agree that Object Orientation is pretty much impossible in Bash, although some would argue that it can be done in Korn shell (which has much more powerful features). http://en.wikipedia.org/wiki/Object-oriented_programming
What bash has is fine for what it is meant for - simple processing that is quick and easy to get working. But there is a critical mass beyond which using such a language becomes unwieldy, error prone, and slow. That critical mass can be one of scale, i.e. large amount of data, or complexity.
There is no simple cut-off point where you should stop using Bash and switch to Python. Its just that as programs get more complex and larger the case for using Python gets stronger.
I should add that shell scripts rarely get smaller and less complex over time!

Why avoid subshells?

I've seen a lot of answers and comments on Stack Overflow
that mention doing something to avoid a subshell. In some
cases, a functional reason for this is given
(most often, the potential need to read a variable
outside the subshell that was assigned inside it), but in
other cases, the avoidance seems to be viewed as an end
in itself. For example
union of two columns of a tsv file
suggesting { ... ; } | ... rather than
( ... ) | ..., so there's a subshell either way.
unhide hidden files in unix with sed and mv commands
Linux bash script to copy files
explicitly stating,
"the goal is just to avoid a subshell"
Why is this? Is it for style/elegance/beauty? For
performance (avoiding a fork)? For preventing likely
bugs? Something else?
There are a few things going on.
First, forking a subshell might be unnoticible when it happens only once, but if you do it in a loop, it adds up to measurable performance impact. The performance impact is also greater on platforms such as Windows where forking is not as cheap as it is on modern Unixlikes.
Second, forking a subshell means you have more than one context, and information is lost in switching between them -- if you change your code to set a variable in a subshell, that variable is lost when the subshell exits. Thus, the more your code has subshells in it, the more careful you have to be when modifying it later to be sure that any state changes you make will actually persist.
See BashFAQ #24 for some examples of surprising behavior caused by subshells.
sometimes examples are helpful.
f='fred';y=0;time for ((i=0;i<1000;i++));do if [[ -n "$( grep 're' <<< $f )" ]];then ((y++));fi;done;echo $y
real 0m3.878s
user 0m0.794s
sys 0m2.346s
1000
f='fred';y=0;time for ((i=0;i<1000;i++));do if [[ -z "${f/*re*/}" ]];then ((y++));fi;done;echo $y
real 0m0.041s
user 0m0.027s
sys 0m0.001s
1000
f='fred';y=0;time for ((i=0;i<1000;i++));do if grep -q 're' <<< $f ;then ((y++));fi;done >/dev/null;echo $y
real 0m2.709s
user 0m0.661s
sys 0m1.731s
1000
As you can see, in this case, the difference between using grep in a subshell and parameter expansion to do the same basic test is close to 100x in overall time.
Following the question further, and taking into account the comments below, which clearly fail to indicate what they are trying to indicate, I checked the following code:
https://unix.stackexchange.com/questions/284268/what-is-the-overhead-of-using-subshells
time for((i=0;i<10000;i++)); do echo "$(echo hello)"; done >/dev/null
real 0m12.375s
user 0m1.048s
sys 0m2.822s
time for((i=0;i<10000;i++)); do echo hello; done >/dev/null
real 0m0.174s
user 0m0.165s
sys 0m0.004s
This is actually far far worse than I expected. Almost two orders of magnitude slower in fact in overall time, and almost THREE orders of magnitude slower in sys call time, which is absolutely incredible.
https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html
Note that the point of demonstrating this is to show that if you are using a testing method that's quite easy to fall into the habit of using, subshell grep, or sed, or gawk (or a bash builtin, like echo), which is for me a bad habit I tend to fall into when hacking fast, it's worth realizing that this will have a significant performance hit, and it's probably worth the time avoiding those if bash builtins can handle the job natively.
By carefully reviewing a large programs use of subshells, and replacing them with other methods, when possible, I was able to cut about 10% of the overall execution time in a just completed set of optimizations (not the first, and not the last, time I have done this, it's already been optimized several times, so gaining another 10% is actually quite significant)
So it's worth being aware of.
Because I was curious, I wanted to confirm what 'time' is telling us here:
https://en.wikipedia.org/wiki/Time_(Unix)
The total CPU time is the combination of the amount of time the CPU or
CPUs spent performing some action for a program and the amount of time
they spent performing system calls for the kernel on the program's
behalf. When a program loops through an array, it is accumulating user
CPU time. Conversely, when a program executes a system call such as
exec or fork, it is accumulating system CPU time.
As you can see in particularly the echo loop test, the cost of the forks is very high in terms of system calls to the kernel, those forks really add up (700x!!! more time spent on sys calls).
I'm in an ongoing process of resolving some of these issues, so these questions are actually quite relevant to me, and the global community of users who like the program in question, that is, this is not an arcane academic point for me, it's realworld, with real impacts.
well, here's my interpretation of why this is important: it's answer #2!
there's no little performance gain, even when it's about avoiding one subshell… Call me Mr Obvious, but the concept behind that thinking is the same that's behind avoiding useless use of <insert tool here> like cat|grep, sort|uniq or even cat|sort|uniq etc..
That concept is the Unix philosophy, which ESR summed up well by a reference to KISS: Keep It Simple, Stupid!
What I mean is that if you write a script, you never know how it may get used in the end, so every little byte or cycle you can spare is important, so if your script ends up eating billions of lines of input, then it will be by that many forks/bytes/… more optimized.
I think the general idea is it makes sense to avoid creating an extra shell process unless otherwise required.
However, there are too many situations where either can be used and one makes more sense than the other to say one way is overall better than the other. It seems to me to be purely situational.

Create Reduced Ordered Binary Decision Diagram (ROBDD) from truth table

Is there a software package (preferable an application, not library) that creates Reduced Ordered Binary Decision Diagrams (ROBDDs) from a given truth table (in some text format)?
You can also try this: http://formal.cs.utah.edu:8080/pbl/BDD.php
It is the best tool for BDDs I used so far.
With any BDD library you can do what you want. Of course, you must write a piece of code by yourself.
If you are looking for a lightweight tool, I often use an applet like this to have a quick look at the BDD of a function:
http://tams-www.informatik.uni-hamburg.de/applets/java-bdd/
BDDs are a memory constrained data structure because of the heavy reliance on detecting duplicate sub-truthtables. Most BDD packages you'll find aren't exactly a good fit for large, general truth tables, instead optimized for very sparse or highly repetitive expressions.
With the standard BDD packages, you work with expressions operating on variables. So you'd have to iterate over your truth table, constructing something like a product-of-sums expression for 1s in the table. Along the way, most libraries will dynamically reorder the variables to fit memory constraints, causing another huge slowdown. After around 24 variables, even with very sparse truth tables, these libraries will start to thrash on modern systems.
If you're only looking for the final BDD nodes given a truth table with its variable ordering already implicitly defined, you can skip a lot of the complexity with external libraries and horrible runtimes, just using some Unix text processing tools.
A good resource on BDDs, Knuth's TAOCP v4.1b, shows the equivalence of BDD nodes and their "beads," sub-truthtables that are non-square strings. I'm going to address a simpler version, ZDDs which have similar structures called "zeads": positive part sub-truthtables that are not completely zero. To generalize back to BDDs, replace sed+grep in the pipeline with a program filtering square strings instead of keeping positive part non-zero strings.
To print all the zeads of a truthtable (given as a one-line file of ascii '1's and '0's, newline at end), run the following command after setting the number of variables and filename:
MAX=8; FILENAME="8_vars_truthtable.txt"; for (( ITER=0; ITER<=${MAX}; ITER++ )) ; do INTERVAL=$((2 ** ${ITER})); fold -w ${INTERVAL} ${FILENAME} | sed -n '1~2!p' | grep -v "^0*$" | sort -u ; done
This has many benefits over BDD packages:
Simple with essentially no extraneous dependencies.
External sorting means no thrashing unlike in-memory hash tables.
Easily parallelizeable & scalable if you understand line buffering and disk caching when forking in the for loop.
If writing to intermediate files sorting will parallelize too.
I use it regularly for truthtables up to 32 variables large, which are impossible to realistically come up with using BDD libraries. It doesn't tax the memory system at all, barely using a few MBs. But if you have a ton of RAM available, a decent OS like Linux will gladly use it all for caching disk to make it even faster.

`cleartool lsco -r -cvi -me` is extremely slow compared to `cleartool lsco -graphical`. Is it possible to improve it's performance?

I'd like to be able to utilize lsco on the command line for better integration with Emacs, but it runs prohibitively slowly!
Usually, GUIs are slower!
From the technote "Recursively checkout and checkin elements":
It is recommended that if performance is degraded due to this recursive operation that either the operation be changed (say to checkout/checkin in smaller chunks) or to stop the operation all together.
In other words, the recursive nature of lsco (and associated commands) doesn't scale well.
As opposed to the GUI, which might very well launch several request per main directories involved.
Not with the exact same behavior. But if you use a dynamic view,
cleartool lspriv -co -s
is probably much faster than
cleartool lsco -r -cview
And if the former isn't what you want, maybe you can set up a filter script around it to fit your needs.

Is the Unix Philosophy falling out of favor in the Ruby community? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
David Korn, a proponent of the Unix philosophy, chided Perl programmers a few years ago in a Slashdot interview for writing monolithic Perl scripts without making use of the Unix toolkit through pipes, redirection, etc. "Unix is not just an operating system," he said, "it is a way of doing things, and the shell plays a key role by providing the glue that makes it work."
It seems that reminder could apply equally to the Ruby community. Ruby has great features for working together with other Unix tools through popen, STDIN, STDOUT, STDERR, ARGF, etc., yet it seems that increasingly, Rubyists are opting to use Ruby bindings and Ruby libraries and build monolithic Ruby programs.
I understand that there may be performance reasons in certain cases for going monolithic and doing everything in one Ruby process, but surely there are a lot of offline and asynchronous tasks that could be well handled by Ruby programs working together with other small programs each doing one thing well in the Unix fashion, with all the advantages that this approach offers.
Maybe I'm just missing something obvious. Is the Unix Philosophy still as relevant today as it was 10 years ago?
The Unix philosophy of pipes and simple tools is for text. It is still relevant, but perhaps not as relevant as it used to be:
We are seeing more tools whose output is not designed to be easily parseable by other programs.
We are seeing much more XML, where there is no particular advantage to piping text through filters, and where regular expressions are a risky gamble.
We are seeing more interactivity, whereas in Unix pipes information flows in one direction only.
But although the world has changed a little bit, I still agree with Korn's criticism. It is definitely poor design to create large, monolithic programs that cannot interoperate with other programs, no matter what the language. The rules are the same as they have always been:
Remember your own program's output may be another program's input.
If your program deals in a single kind of data (e.g., performance of code submitted by students, which is what I've been doing for the last week), make sure to use the same format for both input and output of that data.
For interoperability with existing Unix tools, inputs and outputs should be ASCII and line-oriented. Many IETF Internet protocols (SMTP, NNTP, HTTP) are sterling examples.
Instead of writing a big program, consider writing several small programs connected with existing programs by shell pipelines. For example, a while back the xkcd blog had a scary pipeline for finding anagrams in /usr/share/dict/words.
Work up to shell scripts gradually by making your interactive shell one you can also script with. (I use ksh but any POSIX-compatible shell is a reasonable choice.)
In conclusion there are really two highly relevant ways of reusing code:
Write small programs that fit together well when connected by shell pipelines (Unix).
Write small libraries that fit together well when connected by import, #include, load, require, or use (Ruby, C++ STL, C Interfaces and Implementations, and many others).
In the first paradigm, dependency structure is simple (always linear) and therefore easy to understand, but you're more limited in what you can express. In the second paradigm, your dependency structure can be any acyclic graph—lots more expressive power, but that includes the power to create gratuitous complexity.
Both paradigms are still relevant and important; for any particular project, which one you pick has more to do with your clients and your starting point than with any intrinsic merit of the paradigm. And of course they are not mutually exclusive!
I think that the Unix philosophy started falling out of favor with the creation of Emacs.
My vote is yes. Subjective, but excellent programming question.
Just a personal anecdote during a time when we were re-writing a mass print output program for insurance carriers. We were literally scolded by advisors for "programming" in the shell. We made aware how it was too disconnected and the languages were too disparate to be complete.
Maybe.
All of a sudden multi-processor Intel boxen became commonplace and fork() didn't really perform as horribly as everyone was always warned in the new age of applications (think VB days). The bulk print programs (which queried a db, transformed to troff output and then to PostScript via msgsnd and then off to the LPD queue in hundreds of thousands) scaled perfectly well against all the systems and didn't require rewrites when the VB runtimes changed.
To your question:
I'm with Mr. Korn, but it is not Perl's fault, it is the Perl programmers who decide that Perl is sufficient enough. In multi-process systems maybe it is good enough.
I hope that Ruby, Perl, Python, and (gasp) even Java developers can keep their edge in the shell pipeline. There is inherent value in the development philosophy for implicit scaling and interfacing, separation of duties, and modular design.
Approached properly, with our massively-cored processing units on the horizon, the Unix philosophy may again gain ground.
It does not appear to be completely lost. I read a recent entry blog entry by Ryan Tomayko that aggrandized the UNIX philosophy and how it is embraced by the Unicorn HTTP Server. However, he did have the same general feeling that the ruby community is ignoring the UNIX philosophy in general.
I guess the rather simple explanation is that Unix tools are only available on Unix. The vast majority of users, however, run Windows.
I remember the last time I tried to install Nokogiri, it failed because it couldn't run uname -p. There was absolutely no valid reason to do that. All the information that can be obtained by running uname -p is also available from within Ruby. Plus, uname -p is actually not even Unix, it's a non-standard GNU extension which isn't even guaranteed to work on Unix, and is for example completely broken on several Linux distributions.
So, you could either use Unix and lose 90% of your users, or use Ruby.
No. Unix and Ruby are alive and well
Unix and Ruby are both alive and well, Unix vendor Apple's stock is headed into orbit, linux has an irrevocable dug-in position running servers, development, and lab systems, and Microsoft's desktop-software empire is surrounded by powerful SaaS barbarians like Google and an army of allies.
Unix has never had a brighter future, and Ruby is one of its key allies.
I'm not sure the software-tools pattern is such a key element of Unix anyway. It was awesome in its day given the general clunky borderline-worthless quality of competing CLI tools but Unix introduced many other things including an elegant process and I/O model.
Also, I think you will find that many of those Ruby programs use various Unix and software-tools interfaces internally. Check for popen() and various Process methods.
I think Ruby simply has its own sphere of influence.

Resources