multiple sed commands: when semicolon, when pipeline? - bash

When I construct a complicated operation in sed, I often start with
cat infile | sed 'expression1' | sed 'expr2' ...
and then optimize that into
cat infile | sed 'expr1;expr2;expr3' | sed 'expr4' | sed 'expr5;expr6' ...
What guidelines are there for which expressions can be combined with semicolons into a single command?
So far, I just ad hoc combine s///'s, and don't combine //d's.
(The optimization is for running it tens of millions of times. Yes, it's measurably faster.)
(Posted here instead of on superuser.com, because that has 20x fewer questions about sed.)

The operation that you're carrying out is fundamentally different in each case.
When you "combine" sed commands using a pipe, the whole file is processed by every invocation of sed. This incurs the cost of launching a separate process for every part of your pipeline.
When you use a semicolon-separated list of commands, each command is applied in turn to every line in the file, using a single instance of sed.
Depending on the commands you're using, the output of these two things could be very different!
If you don't like using semicolons to separate commands, I would propose another option: use sed -e 'expr1' -e 'expr2' -e 'expr3' file. Alternatively, many tools including sed support -f to pass a file containing commands. You can put each command on a newline instead of using semicolons for clarity.

Generally, s and d can coexist peacefully. It's when the different commands interact with each other that you may have to break down and use separate scripts, or switch to a richer language with variables etc.
For example, a sed script which adds thousands separators to numbers which lack them should probably be kept completely separate from other processing. The modularity is probably more important than any possible efficiency gains in the long run.

What guidelines are there for which expressions can be combined with semicolons into a single command? So far, I just ad hoc combine s///'s, and don't combine //d's.
sed has many more commands than just s and d. If those are the only ones you're using, though, then you can join as many as you like in the same sed run. The result will be the same as for a pipeline of multiple single-command seds. If you're going to do that, however, then consider either using a command file, as #anubhava suggested, or giving each independent expression via its own -e argument; either one is much clearer than a single expression consisting of multiple semicolon-separated commands.
Even if you use other commands, for the most part you will get the same result from performing a sequence of commands via a single sed process as you do by performing the same commands in the same order via separate sed processes. The main exceptions I can think of involve commands that are necessarily dependent on one another, such as labels and branches; commands manipulating the hold space and those around them; commands grouped within braces ({}); and the p command under sed -n.
With that said, sed programs get very cryptic very fast. If you're writing a complicated transformation then consider carefully taking #EdMorton's advice and writing the whole thing as a (single) awk program instead of one or several sed programs.

Better to avoid multiple sed with
sed -f mycmd.awk
Where mycmd.awk will contain each sed command be listed on a separate line.
As per man sed:
-f command_file
Append the editing commands found in the file command_file to the list of commands. The editing commands should each be listed on a separate line.

Related

How Bash parse multi-flag commands?

I'm trying to create an overly simplified version of bash, I've tried split the program into "lexer + expander, parser, executor".
In the lexer i store my data (commands, flags, files) and create tokens out of them , my procedure is simply to loop through given input char by char and use a state machine to handle states, states are either a special character, an alphanumeric character or space.
Now when i'm at an alphanumeric state i'm at a command, the way i know where the next flag is when i encounter again alphanumeric state or if input[i] == '-', now the problem is with multi-flag commands.
For example:
$ ls -la | grep "*.c"
I successfully get the command ls, grep and the flag -la, *.c.
However with multi-flag commands like.
$ sed -i "*.bak" "s/a/b/g" file1 file2
It seems to me very difficult, and i can't figure out yet, how can i know where the flags to a specific command ends, so my question is how bash parse these multi-flags commands ? any suggestions regarding my problem, would be appreciated !
The shell does not attempt to parse command arguments; that's the responsibility of the utility. The range of possible command argument syntaxes, both in use and potentially useful, is far too great to attempt that.
On Unix-like systems, the shell identifies individual arguments from the command line, mostly by splitting at whitespace but also taking into account the use of quotes and a variety of other transformations, such as "glob expansion". It then makes a vector of these arguments ("argv") and passes the vector to execve, which hands them to the newly created process.
On Windows systems, the shell doesn't even do that. It just hands over the command-line as a string, and leaves it to the command-line tool to do everything. (In order to provide a modicum of compatibility, there's an intermediate layer which is called by the application initialization code, which eventually calls main(). This does some basic argument-splitting, although its quoting algorithm is quite a bit simplified from that used by a Unix shell.)
No command-line shell that I know of attempts to identify command-line flags. And neither should you.
For a bit of extracurricular reading, here's the description of shell parsing from the Posix standard: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html. Trying to implement all that goes far beyond the requirements given to you for this assignment, and I'm certainly not recommending that you do that. But it might still be interesting, and understanding it will help you immensely if you start using a shell.
Alternatively, you could try reading the Bash manual, which might be easier to understand. Note that Bash implements a lot of extensions to the Posix standard.

zsh: argument list too long: sudo

I've a command which I need to run in which one of the args is a list of comma separated ids. The list of ids is over 50k. I've the stored the list of ids in a file and I'm running the command in the following way:
sudo ./mycommand --ids `cat /tmp/ids.txt`
However I get an error zsh: argument list too long: sudo
This I believe is because the kernel has a max size of arguments it can take. One option for me is to manually split the file into smaller pieces (since the ids are comma separated I can't just break it evenly) and then run the command each time for each file.
Is there a better approach?
ids.txt file looks like this:
24342,24324234,122,54545,565656,234235
Converting comments into a semi-coherent answer.
The file ids.txt contains a single line of comma-separated values, and the total size of the file can be too big to be the argument list to a program.
Under many circumstances, using xargs is the right answer, but it relies on being able to split the input up in to manageable chunks of work, and it must be OK to run the program several times to get the job done.
In this case, xargs doesn't help because of the size and format of the file.
It isn't stated absolutely clearly that all the values in the file must be processed in a single invocation of the command. It also isn't absolutely clear whether the list of numbers must all be in a single argument or whether multiple arguments would work instead. If multiple invocations are not an issue, it is feasible to reformat the file so that it can be split by xargs into manageable chunks. If need be, it can be done to create a single comma-separated argument.
However, it appears that these options are not acceptable. In that case, something has to change.
If you must supply a single argument that is too big for your system, you're hosed until you change something — either the system parameters or your program.
Changing the program is usually easier than reconfiguring the o/s, especially if you take into account reconfiguring upgrades to the o/s.
One option worth reviewing is changing the program to accept a file name instead of the list of numbers on the command line:
sudo ./mycommand --ids-list=/tmp/ids.txt
and the program opens and reads the ID numbers from the file. Note that this preserves the existing --ids …comma,separated,list,of,IDs notation. The use of the = is optional; a space also works.
Indeed, many programs work on the basis that arguments provided to it are file names to be processed (the Unix filter programs — think grep, sed, sort, cat, …), so simply using:
sudo ./mycommand /tmp/ids.txt
might be sufficient, and you could have multiple files in a single invocation by supplying multiple names:
sudo ./mycommand /tmp/ids1.txt /tmp/ids2.txt /tmp/ids3.txt …
Each file could be processed in turn. Whether the set of files constitutes a single batch operation or each file is its own batch operation depends on what mycommand is really doing.

How to get rid of bash control characters by evaluating them?

I have an output file (namely a log from screen) containing several control characters. Inside the screen, I have programs running that use control characters to refresh certain lines (examples would be top or anything printing progress bars).
I would like to output a tail of this file using PHP. If I simply read in that file and echo its contents (either using PHP functions or through calling tail, the output is messy and much more than these last lines as it also includes things that have been overwritten. If I instead run tail in the command line, it returns just what I want because the terminal evaluates the control characters.
So my question is: Is there a way to evaluate the control characters, getting the output that a terminal would show me, in a way that I could then use elsewhere (e.g., write to a file)?
#5gon12eder's answer got rid of some control characters (thanks for that!) but it did not handle the carriage return part that was even more important to me.
I figured out that I could just delete anything from the beginning of a line to the last carriage return inside that line and simply keep everything after that, so here is my sed command accomplishing that:
sed 's/^.*\r\([^\r]\+\)\r\?$/\1\r/g'
The output can then be further cleaned using #5gon12eder's answer:
cat screenlog.0 | sed 's/^.*\r\([^\r]\+\)\r\?$/\1\r/g' | sed 's,\x1B\[[0-9?;]*[a-zA-Z],,g'
Combined, this looks exactly like I wanted.
I'm not sure what you mean by “evaluating” the control characters but you could remove them easily.
Here is an example using sed but if you are already using PHP, its internal regex processing functionality seems more appropriate. The command
$ sed 's,\x1B\[[0-9?;]*[a-zA-Z],,g' file.dat
will dump the contents of file.dat to standard output with all ANSI escape sequences removed. (And I'm pretty sure that nothing else is removed except if your file contains invalid escape sequences in which case the operation is ill-defined anyway.)
Here is a little demo:
$ echo -e "This is\033[31m a \033[umessy \033[46mstring.\033[0m" > file.dat
$ cat file.dat
# The output of the above command is not shown to protect small children
# that might be browsing this site.
$ reset # your terminal
$ sed 's,\x1B\[[0-9?;]*[a-zA-Z],,g' file.dat
This is a messy string.
The less program has some more advanced logic built in to selectively replace some escape sequences. Read the man page for the relevant options.

Bash - control external command's output

I'm writing a bash script to make DVD authoring more automated (but, mainly, so that I can learn some more bash scripting) and I'm trying to find out if it's possible to control how an exterrnal command presents its output.
For instance, the output from ffmpeg is a load of (to me) irrelevant cruft about options, libraries, streams, progress and so on.
What I really want is to be able to select for display only the lines with the input and output filenames and then to display the progress on the same line each time. Similarly for mkisofs and wodim.
I've tried Googling for this and am beginning to suspect that either it's not possible or nobody's thought of it before (or, possibly, that it's so obvious that nobody thinks it necessary to say how :-) ).
Many thanks, in advance,
David Shaw
You want to use grep and pipes. They are your friends. You want to pipe the output of the ffmpeg into grep and have it output only lines containing the text you want.
Assuming you have the input and output file names as command lines arguments $1 and $2 to your shell script, you might try something like
ffmpeg .... | grep "$1\|$2"
^ ^
| +-- escape and OR character
+--pipe character
The '\|' is an escape and an OR character for regular expressions. The OR '|' is also the pipe character so you have to escape that.
This will output only output lines that contain the files you are looking for.
This assumes all output is via stdout. If ffmpeg is outputting text via stderr then you will need to add some redirects at the end of ffmpeg line to redirect those back to stdout.
EDIT: I used the wrong quotes in the first example. Use double quotes or it won't expand the parameters $1 and $2

Can using cat create problems when passing the output to other commands?

In bash, there are multiple ways to direct input and output. For example, these commands do the same thing:
sort <input_file >output_file
cat input_file | sort >output_file
Generally I'd prefer the second way, because I prefer my commands to read left-to-right.
But the answer to this question says:
"sort" can use temporary files to work with input files larger than memory
Which makes me wonder, when sorting a huge file, if cat would short-circuit that process.
Can using cat create problems when passing the output to other commands?
There is a term I throw around a lot called Useless Use of Cat (UUoC) and the 2nd option is exactly that. When a utility can take input on STDIN (such as sort) using redirection not only saves you an extra call to an external process such as cat but it also prevents the overhead of a pipeline.
Other than the extra process and pipeline, the only other "problem" that I see would be you would be subject to the pipeline buffering.
Update
Apparently, there is even a website dedicated to giving out a UUoC Award
"I prefer my commands to read left-to-right"
<input_file sort >output_file
(The canonical way to write this is of course sort input_file >output_file.)
The 'sort' command handles large files regardless of whether the input arrives via standard input and a pipe or I/O redirection or by being directly named on the command line.
Note that you could (and probably should) write:
sort -o output_file input_file
That will work correctly even if the input and output files are the same (or if you have multiple input files, one of which is also the output file).
I see that SiegeX has already take you to task for abusing cat -- feline abuse as it is also known. I'll support his efforts. There are times when it is appropriate to use cat. There are fewer times when it is appropriate than is often recognized.
One example of appropriate use is with the tr command and multiple sources of data:
cat "$#" | tr ...
That is necessary because tr only reads its standard input and only writes to its standard output - the ultimate in 'pure filter' programs.
The authors of Unix have also noted that the general purpose 'cat inputs | command' construct is used instead of the more specialized input redirection (citation missing - books needed not at hand).

Resources