Realtime removal of carriage return in shell - bash

For context, I'm attempting to create a shell script that simplifies the realtime console output of ffmpeg, only displaying the current frame being encoded. My end goal is to use this information in some sort of progress indicator for batch processing.
For those unfamiliar with ffmpeg's output, it outputs encoded video information to stdout and console information to stderr. Also, when it actually gets to displaying encode information, it uses carriage returns to keep the console screen from filling up. This makes it impossible to simply use grep and awk to capture the appropriate line and frame information.
The first thing I've tried is replacing the carriage returns using tr:
$ ffmpeg -i "ScreeningSchedule-1.mov" -y "test.mp4" 2>&1 | tr '\r' '\n'
This works in that it displays realtime output to the console. However, if I then pipe that information to grep or awk or anything else, tr's output is buffered and is no longer realtime. For example: $ ffmpeg -i "ScreeningSchedule-1.mov" -y "test.mp4" 2>&1 | tr '\r' '\n'>log.txt results in a file that is immediately filled with some information, then 5-10 secs later, more lines get dropped into the log file.
At first I thought sed would be great for this: $ # ffmpeg -i "ScreeningSchedule-1.mov" -y "test.mp4" 2>&1 | sed 's/\\r/\\n/', but it gets to the line with all the carriage returns and waits until the processing has finished before it attempts to do anything. I assume this is because sed works on a line-by-line basis and needs the whole line to have completed before it does anything else, and then it doesn't replace the carriage returns anyway. I've tried various different regex's for the carriage return and new line, and have yet to find a solution that replaces the carriage return. I'm running OSX 10.6.8, so I am using BSD sed, which might account for that.
I have also attempted to write the information to a log file and use tail -f to read it back, but I still run into the issue of replacing carriage returns in realtime.
I have seen that there are solutions for this in python and perl, however, I'm reluctant to go that route immediately. First, I don't know python or perl. Second, I have a completely functional batch processing shell application that I would need to either port or figure out how to integrate with python/perl. Probably not hard, but not what I want to get into unless I absolutely have to. So I'm looking for a shell solution, preferably bash, but any of the OSX shells would be fine.
And if what I want is simply not doable, well I guess I'll cross that bridge when I get there.

If it is only a matter of output buffering by the receiving application after the pipe. Then you could try using gawk (and some BSD awk) or mawk which can flush buffers. For example, try:
... | gawk '1;{fflush()}' RS='\r\n' > log.txt
Alternatively if you awk does not support this you could force this by repeatedly closing the output file and appending the next line...
... | awk '{sub(/\r$/,x); print>>f; close(f)}' f=log.out
Or you could just use shell, for example in bash:
... | while IFS= read -r line; do printf "%s\n" "${line%$'\r'}"; done > log.out

Libc uses line-buffering when stdout and stderr are connected to a terminal and full-buffering (with a 4KB buffer) when connected to a pipe. This happens in the process generating the output, not in the receiving process—it's ffmpeg's fault, in your case, not tr's.
unbuffer ffmpeg -i "ScreeningSchedule-1.mov" -y "test.mp4" 2>&1 | tr '\r' '\n'
stdbuf -e0 -o0 ffmpeg -i "ScreeningSchedule-1.mov" -y "test.mp4" 2>&1 | tr '\r' '\n'
Try using unbuffer or stdbuf to disable output buffering.

The buffering of data between processes in a pipe is controlled with some system limits, which is at least on my system (Fedora 17) not possible to modify:
$ ulimit -a | grep pipe
pipe size (512 bytes, -p) 8
$ ulimit -p 1
bash: ulimit: pipe size: cannot modify limit: Invalid argument
$
Although this buffering is mostly related to how much excess data the producer is allowed to produce before it is stopped if the consumer is not consuming at the same speed, it might also affect timing of delivery of smaller amounts of data (not quite sure of this).
That is the buffering of pipe data, and I do not think there is much to tweak here. However, the programs reading/writing the piped data might also buffer stdin/stdout data and this you want to avoid in your case.
Here is a perl script that should do the translation with minimal input buffering and no output buffering:
#!/usr/bin/perl
use strict;
use warnings;
use Term::ReadKey;
$ReadKeyTimeout = 10; # seconds
$| = 1; # OUTPUT_AUTOFLUSH
while( my $key = ReadKey($ReadKeyTimeout) ) {
if ($key eq "\r") {
print "\n";
next;
}
print $key;
}
However, as already pointer out, you should make sure that ffmpeg does not buffer its output if you want real-time response.

Related

Saving filtered redis-cli output to a file

I am trying to find when and how the cache is flushed, planned to use the command redis-cli monitor | grep -iE "del|flush" > redis_log.txt for that, but for some reason the file is empty. If i use the command without > redis_log.txt part - it shows a correct output in the terminal, if i use redis-cli monitor > redis_log.txt command - it also saves an actual output to the file, but together it fails, only an empty file is created. Has anybody met a similar issue before?
As mentioned in the comments, the issue you notice certainly comes from the I/O buffering applied to the grep command, especially when its standard output is not attached to a terminal, but redirected to a file or so.
To be more precise, see e.g. this nice blog article which concludes with this wrap-up:
Here’s how buffering is usually set up:
STDIN is always buffered.
STDERR is never buffered.
if STDOUT is a terminal, line buffering will be automatically selected. Otherwise, block buffering (probably 4096 bytes) will be used.
[…] these 3 points explain all “weird” behaviors.
General solution
To tweak the I/O streams buffering of a program, a very handy program provided by coreutils is stdbuf.
So for your use case:
you may want to replace grep -iE "del|flush" with:
stdbuf -o0 grep -iE "del|flush" to completely disable STDOUT buffering;
or if you'd like some trade-off and just have STDOUT line-buffering,
you may want to replace grep -iE "del|flush" with:
either stdbuf -oL grep -iE "del|flush",
or grep --line-buffered -iE "del|flush".
Wrap-up
Finally as suggested by #jetchisel, you'll probably want to redirect STDERR as well to your log file in order not to miss some errors messages… Hence, for example:
redis-cli monitor | stdbuf -o0 grep -iE "del|flush" > redis_log.txt 2>&1

Nothing prints after piping ping through two commands

Running this:
ping google.com | grep -o 'PING'
Will print PING to the terminal, so I assume that means that the stdout of grep was captured by the terminal.
So why doesn't the follow command print anything? The terminal just hangs:
ping google.com | grep -o 'PING' | grep -o 'IN'
I would think that the stdout of the first grep command would be redirected to the stdin of the second grep. Then the stdout of the second grep would be captured by the terminal and printed.
This seems to be what happens if ping is replaced with echo:
echo 'PING' | grep -o 'PING' | grep -o 'IN'
IN is printed to the terminal, as I would expect.
So what's special about ping that prevents anything from being printed?
You could try being more patient :-)
ping google.com | grep -o 'PING' | grep -o 'IN'
will eventually display output, but it might take half an hour or so.
Under Unix, the standard output stream handed to a program when it starts up is "line-buffered" if the stream is a terminal; otherwise it is fully buffered, typically with a buffer of 8 kilobytes (8,192 characters). Buffering means that output is accumulated in memory until the buffer is full, or, in the case of line-buffered streams, until a newline character is sent.
Of course, a program can override this setting, and programs which produce only small amounts of output -- like ping -- typically make stdout line-buffered regardless of what it is. But grep does not do so (although you can tell Gnu grep to do that by using the --line-buffered command-line option.)
"Pipes" (which are created to implement the | operator) are not considered terminals. So the grep in the middle will have a fully-buffered output, meaning that its output will be buffered until 8k characters are written. That will take a while in your case, because each line contains only five characters (PING plus a newline), and they are produced once a aecond. So the buffer will fill up after about 1640 seconds, which is almost 28 minutes.
Many unix distributions come with a program called stdbuf which can be used to change buffering for standard streams before running a program. (If you have stdbuf, you can find out how it works by typing man 1 stdbuf.) Programming languages like Perl generally provide other mechanisms to call the stdbuf standard library function. (In Perl, you can force a flush after every write using the builtin variable $|, or the autoflush(BOOL) io handle method.)
Of course, when a program successfully terminates, all output buffers are "flushed" (srnt to their respective streams). So
echo PING | grep -o 'PING' | grep -o 'IN'
will immediately output its only output line. But ping does not terminate unless you provide a count command-line option (-c N; see man ping). So if you need immediate piped throughput, you may need to modify buffering behaviour.

Piping sometimes does not lead to immediate output

I observed a few times now that A | B | C may not lead to immediate output, although A is constantly producing output. I have no idea how this even may be possible. From my understanding all three processes ought to be working on the same time, putting their output into the next pipe (or stdout) and taking from the previous pipe when they are finished with one step.
Here's an example where I am currently experiencing that:
tcpflow -ec -i any port 8340 | tee second.flow | grep -i "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
What is supposed to happen:
I look at one port for tcp packages. If something comes it should be a certain XML format and I want to grep the Manufacturer and the Serialnumber from these packages. I would also like to get the full, unmodified output in a text file "second.flow", for later reference.
What happens:
Everything as desired, but instead of getting output every 10 seconds (I'm sure I get these outputs every ten seconds!) I have to wait for a long time and then a lot is printed at once. It's like one of the tools gobbles up everything in a buffer and only prints it if the buffer is full. I don't want that. I want to get each line as fast as possible.
If I replace tcpflow ... with a cat second.flow it works immediately. Can someone describe what's going on? And in case that it's obvious would there be another way to achieve the same result?
Every layer in a series of pipes can involve buffering; by default, tools that don't specify buffering behavior for stdout will use line buffering when outputting to a terminal, and block buffering when outputting anywhere else (including piping to another program or a file). In a chained pipe, all but the last stage will see their output as not going to the terminal, and will block buffer.
So in your case, tcpflow might be producing output constantly, and if it's doing so, tee should be producing data almost at the same rate. But grep is going to limit that flow to a trickle, and won't produce output until that trickle exceeds the size of the output buffer. It's already performed the filtering and called fwrite or puts or printf, but the data is waiting for enough bytes to build up behind it before sending it along to awk, to reduce the number of (expensive) system calls.
cat second.flow produces output immediately because as soon as cat finishes producing output, it exits, flushing and closing its stdout in the process, which cascades, when each step finds its stdin to be at EOF, it exits, flushing and closing its stdout. tcpflow isn't exiting, so the cascade of EOFs and flushing isn't happening.
For some programs, in the general case, you can change the buffering behavior by using stdbuf (or unbuffer, though that can't do line buffering to balance efficiency, and has issues with piped input). If the program is using internal buffering, this still might not work, but it's worth a shot.
In your specific case, though, since it's likely grep that's causing the interruption (by only producing a trickle of output that is sticking in the buffer, where tcpflow and tee are producing a torrent, and awk is connected to stdout and therefore line buffered by default), you can just adjust your command line to:
tcpflow -ec -i any port 8340 | tee second.flow | grep -i --line-buffered "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
At least for Linux's grep (not sure if switch is standard), that makes grep change its own output buffering to line-oriented buffering explicitly, which should remove the delay. If tcpflow itself is not producing enough output to flush regularly (you implied it did, but you could be wrong), you'd use stdbuf on it (but not tee, which, per stdbuf man page notes, manually changes its buffering, so stdbuf doesn't do anything) to make them line buffered:
stdbuf -oL tcpflow -ec -i any port 8340 | tee second.flow | grep -i --line-buffered "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
Update from comments: It looks like some flavors of awk block buffer prints to stdout, even when connected to a terminal. For mawk (the default on many Debian based distros), you can non-portably disable it by passing the -Winteractive switch at invocation. Alternatively, to work portably, you can just call system("") after each print, which portably forces output flushing on all implementations of awk. Sadly, the obvious fflush() is not portable to older implementations of awk, but if you only care about modern awk, just use fflush() to be obvious and mostly portable.
Reduce Buffering
Each application in the pipeline can do its own buffering. You may want to see if you can reduce buffering in tcpflow, as your other commands are line-oriented and unlikely to be the source of your buffering issue. I didn't see any specific options for buffer control in tcpflow, though the -b flag for max_bytes may help in circumstances where the text you want to work with is near the front of the flow.
You can also try modifying the buffering of tcpflow using stdbuf from GNU coreutils. This may help to reduce latency in your pipeline, but the man page provides the following caveats:
NOTE: If COMMAND adjusts the buffering of its standard streams ('tee' does for example) then that will override corresponding changes by 'stdbuf'. Also some filters (like 'dd' and 'cat' etc.) don't use streams for I/O, and are thus unaffected by 'stdbuf' settings.
As an example, the following may reduce output buffering of tcpflow:
stdbuf --output=0 tcpflow -ec -i any port 8340 # unbuffered output
stdbuf --output=L tcpflow -ec -i any port 8340 # line-buffered output
unless one of the caveats above apply. Your mileage may vary.

Tailing a logfile and processing each line is missing data when converting a file with ffmpeg

I am running a script to tail a log file as per the code snippet below. I am running into a problem where by the line passed into $line is missing a number amount of bytes from the beginning when several lines are written to the log file at nearly the same time.
I can check the file afterwards and see that the offending line is complete in the file so why is it incomplete in the script. Some kind of buffering issue perhaps?
The processing can sometimes take several seconds to complete would that make a difference?
#!/bin/bash
tail -F /var/log/mylog.log | while read line
do
log "$line"
ffmpeg -i "from.wav" "to.mp3"
done
Full line in file
"12","","765467657","56753763","test"
example logged $line
657","56753763","test"
Update
I have done some more debugging of my code and it seems the processing that is causing the problem is a call to ffmpeg used to convert a wav to mp3. If I swap that with just a sleep then the problem goes away. Could ffmpeg effect the buffer somehow?
If you are on a platform with a reasonably recent version of GNU Coreutils (e.g. any fairly recent Linux distro), you can use stdbuf to force line buffering.
The example in the stdbuf manpage is highly relevant:
tail -f access.log | stdbuf -oL cut -d ' ' -f1 | uniq
This will immedidately display unique entries from access.log
In a while loop ffmpeg reads from std input, consuming all the arguments at once. To prevent this behavior a common workaround is redirecting ffmpeg's std input to /dev/null, as shown below:
tail -F /var/log/mylog.log | while read line
do
log "$line"
ffmpeg -i "from.wav" "to.mp3" < /dev/null
done
There are also other commands, such as ssh, mplayer, HandBrakeCLI ..., that display the same behavior in a while loop.

Why piping to the same file doesn't work on some platforms?

In cygwin, the following code works fine
$ cat junk
bat
bat
bat
$ cat junk | sort -k1,1 |tr 'b' 'z' > junk
$ cat junk
zat
zat
zat
But in the linux shell(GNU/Linux), it seems that overwriting doesn't work
[41] othershell: cat junk
cat
cat
cat
[42] othershell: cat junk |sort -k1,1 |tr 'c' 'z'
zat
zat
zat
[43] othershell: cat junk |sort -k1,1 |tr 'c' 'z' > junk
[44] othershell: cat junk
Both environments run BASH.
I am asking this because sometimes after I do text manipulation, because of this caveat, I am forced to make the tmp file. But I know in Perl, you can give "i" flag to overwrite the original file after some operations/manipulations. I just want to ask if there is any foolproof method in unix pipeline to overwrite the file that I am not aware of.
Four main points here:
"Useless use of cat." Don't do that.
You're not actually sorting anything with sort. Don't do that.
Your pipeline doesn't say what you think it does. Don't do that.
You're trying to over-write a file in-place while reading from it. Don't do that.
One of the reasons you are getting inconsistent behavior is that you are piping to a process that has redirection, rather than redirecting the output of the pipeline as a whole. The difference is subtle, but important.
What you want is to create a compound command with Command Grouping, so that you can redirect the input and output of the whole pipeline. In your case, this should work properly:
{ sort -k1,1 | tr 'c' 'z'; } < junk > sorted_junk
Please note that without anything to sort, you might as well skip the sort command too. Then your command can be run without the need for command grouping:
tr 'c' 'z' < junk > sorted_junk
Keep redirections and pipelines as simple as possible. It makes debugging your scripts much easier.
However, if you still want to abuse the pipeline for some reason, you could use the sponge utility from the moreutils package. The man page says:
sponge reads standard input and writes it out to the specified
file. Unlike a shell redirect, sponge soaks up all its input before
opening the output file. This allows constricting pipelines that read
from and write to the same file.
So, your original command line can be re-written like this:
cat junk | sort -k1,1 | tr 'c' 'z' | sponge junk
and since junk will not be overwritten until sponge receives EOF from the pipeline, you will get the results you were expecting.
In general this can be expected to break. The processes in a pipeline are all started up in parallel, so the > junk at the end of the line will usually truncate your input file before the process at the head of the pipelining has finished (or even started) reading from it.
Even if bash under Cygwin let's you get away with this you shouldn't rely on it. The general solution is to redirect to a temporary file and then rename it when the pipeline is complete.
You want to edit that file, you can just use the editor.
ex junk << EOF
%!(sort -k1,1 |tr 'b' 'z')
x
EOF
Overriding the same file in pipeline is not advice, because when you do the mistake you can't get it back (unless you've the backup or it's the under version control).
This happens, because the input and output in pipeline is automatically buffered (which gives you an impression it works), but it actually it's running in parallel. Different platforms could buffer the output in different way (based on the settings), so on some you end up with empty file (because the file would be created at the start), on some other with half-finished file.
The solution is to use some method when the file is only overridden when it encounters an EOF with full buffered and processed input.
This can be achieved by:
Using utility which can soaks up all its input before opening the output file.
This can either be done by sponge (as opposite of unbuffer from expect package).
Avoid using I/O redirection syntax (which can create the empty file before starting the command).
For example using tee (which buffers its standard streams), for example:
cat junk | sort | tee junk
This would only work with sort, because it expects all the input to process the sorting. So if your command doesn't use sort, add one.
Another tool which can be used is stdbuf which modifies buffering operations for its standard streams where you can specify the buffer size.
Use text processor which can edit files in-place (such as sed or ex).
Example:
$ ex -s +'%!sort -k1' -cxa myfile.txt
$ sed -i '' s/foo/bar/g myfile.txt
Using the following simple script, you can make it work like you want to:
$ cat junk | sort -k1,1 |tr 'b' 'z' | overwrite_file.sh junk
overwrite_file.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" | tee "$FILENAME"
Note that if you don't want the updated file to be send to stdout, you can use this approach instead
overwrite_file_no_output.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" > "$FILENAME"

Resources