How to limit output from a potentially too verbose command? - bash

I am looking for a bash snippet for limiting the amount of console output from a shell command that could potentially become too verbose.
The purpose of this is for usage in build/CI environments where you do want to limit the amount out console output in order to prevent overloading the CI server (or even client tailing the output).
Full requirements:
display only up to 100 lines from the top (head) of the command output
display only up to 100 lines from the bottom (tail) of the command output
archive both stdout and stderr in full into a command.log.gz file
console output must be displayed relatively in realtime, a solution that output the result at the end is not acceptable as we need to be able to see its execution progress.
Current findings
unbuffer could be used to force the stdout/stderr to be unbuffered
|& tee can be used to send output to both archiver and tail/head
|& gzip --stdout >command.log.gz could archive the console output
head -n100 and tail -n100 can be used to limit the console output they introduce at least some problems like undesired results if number of output lines is under 200.

From what I understand you need to do limit output online (while it's being generated).
Here is a function that I can think of that would be useful for you.
limit_output() {
FullLogFile="./output.log" # Log file to keep the input content
typeset -i MAX=15 # number or lines from head, from tail
typeset -i LINES=0 # number of lines displayed
# tee will save the copy of the input into a log file
tee "$FullLogFile" | {
# The pipe will cause this part to be executed in a subshell
# The command keeps LINES from losing it's value before if
while read -r Line; do
if [[ $LINES -lt $MAX ]]; then
LINES=LINES+1
echo "$Line" # Display first few lines on screen
elif [[ $LINES -lt $(($MAX*2)) ]]; then
LINES=LINES+1 # Count the lines for a little longer
echo -n "." # Reduce line output to single dot
else
echo -n "." # Reduce line output to single dot
fi
done
echo "" # Finish with the dots
# Tail last few lines, not found in head and not more then max
if [[ $LINES -gt $MAX ]]; then
tail -n $(($LINES-$MAX)) "$FullLogFile"
fi
}
}
Use it in a script, load it to current shell or put it in .bash_profile to be loaded on user session.
Usage examples: cat /var/log/messages | limit_output or ./configure | limit_output
The function will read the standard input, save it to a log file, display the first MAX lines, then reduce each line to a single dot (.) on screen, then finally display the last MAX lines (or less if output was shorter then MAX*2).

Here is my current incomplete solution which for convenience is demonstrating processing a 10 lines output and that will (hopefully) limit the output to first 2 lines and last two lines.
#!/bin/bash
seq 10 | tee >(gzip --stdout >output.log.gz) | tail -n2

One way I use to achieve this is:
./configure | tee output.log | head -n 5; tail -n 2 output.log
What this does is:
Write the complete output to a filed called output.log using tee
Only print the first 5 lines using head -n
In the end print the last two lines from the written output.log using tail -n

Related

Show only newly added lines of logfile in terminal

I use tail -f to show the contents of a logfile.
What I want is when the logfile content changes, instead of appending the new lines to my screen, only the newly added lines should be shown on my screen.
So as if a clearscreen was made every time before printing the new lines.
I tried to find a solution by web search but couldn't find anything useful.
edit:
In my case it happens that several lines will be added at once (it is a php error logfile). So I am looking for a solution where more than the single last line can be shown on screen.
The watch command in combination with the tail command shows the last line of a log file with the intervall of every 2 seconds. Basically it doesn't refresh whenever a new line is appended to the log file but since you could specifiy an intervall it might help you for your use case.
watch -t tail -1 <path_to_logfile>
If you need a faster intervall like every 0.5 seconds, then you could specify it with the 'n' option i.e.:
watch -t -n 0.5 tail -1 <path_to_logfile>
Try
$ watch 'tac FILE | grep -m1 -C2 PATTERN | tac'
where
PATTERN is any keyword (or regexp) to identify errors you seek in the log,
tac prints the lines in reverse,
-m is a max count of matching lines to grep,
-C is any number of lines of context (before and after the match) to show (optional).
That would be similar to
$ tail -f FILE | grep -C2 PATTERN
if you didn't mind just appending occurrences to the output in real-time.
But if you don't know any generic PATTERN to look for at all,
you'd have to just follow all the updates as the logfile grows:
$ tail -n0 -f FILE
Or even, create a copy of the logfile and then do a diff:
Copy: cp file.log{,.old}
Refresh the webpage with your .php code (or whatever, to trigger the error)
Run: diff file.log{,.old}
(or, if you prefer sort to diff: $ sort file.log{,.old} | uniq -u)
The curly braces is shorthand for both filenames (see Brace Expansion in $ man bash)
If you must avoid any temp copies, store the line count in memory:
z=$(grep -c ^ file.log)
Refresh the webpage to trigger an error
tail -n +$z file.log
The latter approach can be built upon, to create a custom scripting solution more suitable for your needs (check timestamps, clear screen, filter specific errors, etc). For example, to only show the lines that belong to the last error message in the log file updated in real-time:
$ clear; z=$(grep -c ^ FILE); while true; do d=$(date -r FILE); sleep 1; b=$(date -r FILE); if [ "$d" != "$b" ]; then clear; tail -n +$z FILE; z=$(grep -c ^ FILE); fi; done
where
FILE is, obviously, your log file name;
grep -c ^ FILE counts all lines in a file (that is almost, but not entirely unlike cat FILE|wc -l that would only count newlines);
sleep 1 sets the pause/delay between checking the file timestamps to 1 second, but you could change it to even a floating point number (the less the interval, the higher the CPU usage).
To simplify any repetitive invocations in future, you could save this compound command in a Bash script that could take a target logfile name as an argument, or define a shell function, or create an alias in your shell, or just reverse-search your bash history with CTRL+R. Hope it helps!

No new line produced by >>

I have the following piece of code that selects two line numbers in a file, extracts everything between these lines, replaces the new line characters with tabs and places them in an output file. I want all lines extracted within one loop to be on the same line, but lines extracted on different loops to go on a new line.
for ((i=1; i<=numTimePoints; i++)); do
# Get the starting point for line extraction. This is just an integer.
startScan=$(($(echo "${dataStart}" | sed -n ${i}p)+1))
# Get the end point for line extraction. This is just an integer.
endScan=$(($(echo "${dataEnd}" | sed -n ${i}p)-1))
# From file ${file}, take all lines between ${startScan} and ${endScan}. Replace new lines with tabs and output to file ${tmpOutputFile}
head -n ${endScan} ${file} | tail -n $((${endScan}-${startScan}+1)) | tr "\n" "\t" >> ${tmpOutputFile}
done
This script works mostly as intended, however all new lines are appended to the previous line, rather than placed on new lines (as I thought >> would do). In other words, if I were to now do cat ${tmpOutputFile} | wc then it returns 0 12290400 181970555. Can anyone point out what I'm doing wrong?
Any redirection, including >>, does not have anything to do with newline creation at all -- redirection operations don't generate output themselves, newlines or otherwise; they only control where file descriptors (stdout, stderr, etc) are connected to, and it's the programs performing those writes which are responsible for contents.
Consequently, your tr '\n' '\t' is entirely preventing newlines from being added to the output file -- there's nowhere one could come from that doesn't go through that pipeline.
Consider the following instead:
while read -r startScan <&3 && read -r endScan <&4; do
# generate your output
head -n "$endScan" "$file" | tail -n $(( endScan - startScan + 1 )) | tr '\n' '\t'
# append your newline
printf '\n'
done 3<<<"$dataStart" 4<<<"$dataEnd" >"$tmpOutputFile"
Note:
We aren't paying the cost of running sed to extract startScan and endScan, but rather are reading them a line at a time from herestrings created from the contents of dataStart and dataEnd
We're redirecting to our output file exactly once, and reusing that file handle for the entire loop (over multiple commands -- first the pipeline, and then the printf)
We're actually running a printf to generate that newline, rather than expecting it to be somehow implicitly created by magic.

bash - how to remove first 2 lines from output

I have the following output in a text file:
106 pages in list
.bookmarks
20130516 - Daily Meeting Minutes
20130517 - Daily Meeting Minutes
20130520 - Daily Meeting Minutes
20130521 - Daily Meeting Minutes
I'm looking to remove the first 2 lines from my output. This particular shell script that I use to execute, always has those first 2 lines.
This is how I generated and read the file:
#Lists
PGLIST="$STAGE/pglist.lst";
RUNSCRIPT="$STAGE/runPagesToMove.sh";
#Get List of pages
$ATL_BASE/confluence.sh $CMD_PGLIST $CMD_SPACE "$1" > "$PGLIST";
# BUILD executeable script
echo "#!/bin/bash" >> $RUNSCRIPT 2>&1
IFS=''
while read line
do
echo "$ATL_BASE/conflunce.sh $CMD_MVPAGE $CMD_SPACE "$1" --title \"$line\" --newSpace \"$2\" --parent \"$3\"" >> $RUNSCRIPT 2>&1
done < $PGLIST
How do I remove those top 2 lines?
You can achieve this with tail:
tail -n +3 "$PGLIST"
-n, --lines=K
output the last K lines, instead of the last 10; or use -n +K
to output starting with the Kth
The classic answer would use sed to delete lines 1 and 2:
sed 1,2d "$PGLIST"
awk way:
awk 'NR>2' "$PGLIST"

Fastest way to print a single line in a file

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance).
There are many ways to do this, i manly use these 2
cat ${file} | head -1
or
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?
Drop the useless use of cat and do:
$ sed -n '1{p;q}' file
This will quit the sed script after the line has been printed.
Benchmarking script:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh and run bash benchmark.sh.
Results:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:
Note: timings are different from original post due to being on a faster Linux box.
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
If you want to print only 1 line (say the 20th one) from a large file you could also do:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.
Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
How about avoiding pipes?
Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).
Examples:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
I have done extensive testing, and found that, if you want every line of a file:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.
On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)
The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.
For the sake of completeness you can also use the basic linux command cut:
cut -d $'\n' -f <linenumber> <filename>

how to make a winmerge equivalent in linux

My friend recently asked how to compare two folders in linux and then run meld against any text files that are different. I'm slowly catching on to the linux philosophy of piping many granular utilities together, and I put together the following solution. My question is, how could I improve this script. There seems to be quite a bit of redundancy and I'd appreciate learning better ways to script unix.
#!/bin/bash
dir1=$1
dir2=$2
# show files that are different only
cmd="diff -rq $dir1 $dir2"
eval $cmd # print this out to the user too
filenames_str=`$cmd`
# remove lines that represent only one file, keep lines that have
# files in both dirs, but are just different
tmp1=`echo "$filenames_str" | sed -n '/ differ$/p'`
# grab just the first filename for the lines of output
tmp2=`echo "$tmp1" | awk '{ print $2 }'`
# convert newlines sep to space
fs=$(echo "$tmp2")
# convert string to array
fa=($fs)
for file in "${fa[#]}"
do
# drop first directory in path to get relative filename
rel=`echo $file | sed "s#${dir1}/##"`
# determine the type of file
file_type=`file -i $file | awk '{print $2}' | awk -F"/" '{print $1}'`
# if it's a text file send it to meld
if [ $file_type == "text" ]
then
# throw out error messages with &> /dev/null
meld $dir1/$rel $dir2/$rel &> /dev/null
fi
done
please preserve/promote readability in your answers. An answer that is shorter but harder to understand won't qualify as an answer.
It's an old question, but let's work a bit on it just for fun, without thinking in the final goal (maybe SCM) nor in tools that already do this in a better way. Just let's focus in the script itself.
In the OP's script, there are a lot of string processing inside bash, using tools like sed and awk, sometimes more than once in the same command line or inside a loop executing n times (one per file).
That's ok, but it's necessary to remember that:
Each time the script calls any of those programs, it's created a new process in the OS, and that is expensive in time and resources. So the less programs are called, the better is the performance of script that is executing:
diff 2 times (1 just to print to user)
sed 1 time processing diff result and 1 time for each file
awk 1 time processing sed result and 2 times for each file (processing file result)
file 1 time for each file
That doesn't apply to echo, read, test and others that are builtin commands of bash, so no external program is executed.
meld is the final command that will display the files to user, so it doesn't count.
Even with the builtin commands, redirection pipelines | has a cost too, because the shell has to create pipes, duplicate handles, and maybe even creating forks of the shell (that is a process itself). So again: less is better.
The messages of diff command are locale dependants, so if the system is not in english, the whole script won't work.
Thinking that, let's clean a bit the original script, mantaining the OP's logic:
#!/bin/bash
dir1=$1
dir2=$2
# Set english as current language
LANG=en_US.UTF-8
# (1) show files that are different only
diff -rq $dir1 $dir2 |
# (2) remove lines that represent only one file, keep lines that have
# files in both dirs, but are just different, delete all but left filename
sed '/ differ$/!d; s/^Files //; s/ and .*//' |
# (3) determine the type of file
file -i -f - |
# (4) for each file
while IFS=":" read file file_type
do
# (5) drop first directory in path to get relative filename
rel=${file#$dir1}
# (6) if it's a text file send it to meld
if [[ "$file_type" =~ "text/" ]]
then
# throw out error messages with &> /dev/null
meld ${dir1}${rel} ${dir2}${rel} &> /dev/null
fi
done
A little explaining:
Unique chain of commands cmd1 | cmd2 | ... where the output (stdout) of previous one is the input (stdin) of the next one.
Execute sed just once to execute 3 operations (separated with ;) in diff output:
Deleting lines ending with " differ"
Delete "Files " at the beginning of remaining lines
Delete from " and " to the end of remaining lines
Execute command file once to process the file list in stdin (option -f -)
Use the while bash sentence to read two values separated by : for each line line of stdin.
Use bash variable substitution to extract filename from a variable
Use bash test to compare a file type with a regular expression
For clarity reasons, I didn't considerate that file and directory names may have spaces. In such cases, both scripts will fail. To avoid that is necessary enclose in double quotes any reference to file/dir name variable.
I didn't use awk, because it is powerful enough that can replace almost the entire script ;-)

Resources