bash, chaining grep commands, processing as the data come [duplicate] - bash

This question already has answers here:
Why no output is shown when using grep twice?
(5 answers)
Closed 5 years ago.
When I use a single grep command, it processes and output the data live as it comes.
Here is my simple test file test.sh:
echo a
sleep 1
echo b
sleep 1
echo ab
sleep 1
echo ba
sleep 1
echo baba
I do the following:
sh test.sh | grep a
a
ab
ba
ab
ba
all good so fa. 'a' appears immediately, then 'ab', etc.
But when I pipe multiple grep commands like that
sh ./test.sh | grep a | grep b
ab
ba
baba
I only get the output at the end, not as it comes!
the terminal stays empty till the entire file is processed then outputs everything in one go.
Why is that?
How can i chain/cascade multiple greps without losing that 'process and output as it comes' property?
This is for greping and processing live huge logs with a lot of data where I only have a chance to save to disk the filtered version and not the huge raw ouput that would fill up the disk quite quickly.

There's an option called line-buffered:
Other Options
--line-buffered
Use line buffering on output. This can cause a performance penalty.
So:
sh ./test.sh | grep --line-buffered a | grep b

Related

Bash Terminal: Write only specific lines to logfile

I'm running simulation with lots of terminal output, which would exceed my disc space, if I'd save it to a logfile (e.g by "cmd > logfile"). Now I would like to follow the entire terminal output, but at the same time I would like to save specific data/lines/values from this output to a file.
1) Is there a way to that in bash?
2) Or as alternative: Is it possible to save the logfile, extract the needed data and then delete the processed lines to avoid generating a huge logfile?
If you want to save into logfile only the output containing mypattern but you want the see all the output at the terminal, you could issue:
cmd 2>&1 | tee /dev/tty | grep 'mypattern' > logfile
I also assumed that the output of cmd may be directed to the standard output stream as well as to the standard error stream, by adding 2>&1 after cmd.
What criteria are you using to decide which lines to keep?
1
One common filter is to just store stderr.
cmdlist 2>logfile # stdout still to console
2
For a more sophisticated filter, if you have specific patterns you want to save to the log, you can use sed. Here's a simplistic example -
seq 1 100 | sed '/.*[37]$/w tmplog'
This will generate numbers from 1 to 100 and send them all to the console, but capture all numbers that end with 3 or 7 to tmplog. It can also accept more complex lists of commands to help you be more comprehensive -
seq 1 100 | sed '/.*[37]$/w 37.log
/^2/w 37.log'
c.f. the sed manual for more detailed breakdowns.
You probably also want error output, so it might be a good idea to save that too.
seq 1 100 2>errlog | sed '/.*[37]$/w patlog'
3
For a more complex space-saving plan, create a named pipe, and compress the log from that in a background process.
$: mkfifo transfer # creates a queue file on disk that
$: gzip < transfer > log.gz & # reads from FIFO in bg, compresses to log
$: seq 1 100 2>&1 | tee transfer # tee writes one copy to stdout, one to file
This will show all the output as it comes, but also duplicate a copy to the named pipe; gzip will read it from the named pipe and compress it.
3b
You could replace the tee with the sed for double-dipping space reduction if required -
$: mkfifo transfer
$: gzip < transfer > log.gz &
$: seq 1 100 2>&1 | sed '/.*[37]$/w transfer'
I don't really recommend this, as you might filter out something you didn't realize you would need.

Use For and save and saa in diferent files

I'm trying to take 2 files from one command, in one file I only put 1 entries and the other complete a list, this is the example:
I tried various commands
#!/bin/bash
for i in range 4
do
echo "test" >one >>list
done
I need what in the "one" save the last one loop and in the "list" everyone.
You can use the tee for this, since tee will still write to stdout you can do something like
#!/bin/bash
for i in range 4
do
echo "test" | tee one >>list
done
or this if you want to see echos when you run it, the -a flag tells tee to append rather than truncate
#!/bin/bash
for i in range 4
do
echo "test" | tee one | tee -a list
done

tracking status/progress in gnu parallel

I've implemented parallel in one of our major scripts to perform data migrations between servers. Presently, the output is presented all at once (-u) in pretty colors, with periodic echos of status from the function being executed depending on which sequence is being run (e.g. 5/20: $username: rsyncing homedir or 5/20: $username: restoring account). These are all echoed directly to the terminal running the script, and accumulate there. Depending on the length of time a command is running, however, output can end up well out of order, and long running rsync commands can be lost in the shuffle. Butm I don't want to wait for long running processes to finish in order to get the output of following processes.
In short, my issue is keeping track of which arguments are being processed and are still running.
What I would like to do is send parallel into the background with (parallel args command {#} {} ::: $userlist) & and then track progress of each of the running functions. My initial thought was to use ps and grep liberally along with tput to rewrite the screen every few seconds. I usually run three jobs in parallel, so I want to have a screen that shows, for instance:
1/20: user1: syncing homedir
current file: /home/user1/www/cache/file12589015.php
12/20: user12: syncing homedir
current file: /home/user12/mail/joe/mailfile
5/20: user5: collecting information
current file:
I can certainly get the above status output together no problem, but my current hangup is separating the output from the individual parallel processes into three different... pipes? variables? files? so that it can be parsed into the above information.
Not sure if this is much better:
echo hello im starting now
sleep 1
# start parallel and send the job to the background
temp=$(mktemp -d)
parallel --rpl '{log} $_="Working on#arg"' -j3 background {} {#} ">$temp/{1log} 2>&1;rm $temp/{1log}" ::: foo bar baz foo bar baz one two three one two three :::+ 5 6 5 3 4 6 7 2 5 4 6 2 &
while kill -0 $! 2>/dev/null ; do
cd "$temp"
clear
tail -vn1 *
sleep 1
done
rm -rf "$temp"
It make a logfile for each job. Tails all logfiles every second and removes the logfile when a jobs is done.
The logfiles are named 'working on ...'.
I believe that this is close to what I need, though it isnt very tidy and probably isnt optimal:
#!/bin/bash
background() { #dummy load. $1 is text, $2 is number, $3 is position
echo $3: starting sleep...
sleep $2
echo $3: $1 slept for $2
}
progress() {
echo starting progress loop for pid $1...
while [ -d /proc/$1 ]; do
clear
tput cup 0 0
runningprocs=`ps faux | grep background | egrep -v '(parallel|grep)'`
numprocs=`echo "$runningprocs" | wc -l`
for each in `seq 1 ${numprocs}`; do
line=`echo "$runningprocs" | head -n${each} | tail -n1`
seq=`echo $line | rev | awk '{print $3}' | rev`
# print select elements from the ps output
echo working on `echo $line | rev | awk '{print $3, $4, $5}' | rev`
# print the last line of the log for that sequence number
cat logfile.log | grep ^$seq\: | tail -n1
echo
done
sleep 1
done
}
echo hello im starting now
sleep 1
export -f background
# start parallel and send the job to the background
parallel -u -j3 background {} {#} '>>' logfile.log ::: foo bar baz foo bar baz one two three one two three :::+ 5 6 5 3 4 6 7 2 5 4 6 2 &
pid=$!
progress $pid
echo finished!
I'd rather not depend on scraping all information from ps and would prefer to get the actual line output of each parallel process, but a guy's gotta do what a guy's gotta do. regular output sent to a logfile for parsing later on.

Bash pipes and Shell expansions

I've changed my data source in a bash pipe from cat ${file} to cat file_${part_number} because preprocessing was causing ${file} to be truncated at 2GB, splitting the output eliminated the preprocessing issues. However while testing this change, I was unable to work out how to get Bash to continue acting the same for some basic operations I was using to test the pipeline.
My original pipeline is:
cat giantfile.json | jq -c '.' | python postprocessor.py
With the original pipeline, if I'm testing changes to postprocessor.py or the preprocessor and I want to just test my changes with a couple of items from giantfile.json I can just use head and tail. Like so:
cat giantfile.json | head -n 2 - | jq -c '.' | python postprocessor.py
cat giantfile.json | tail -n 3 - | jq -c '.' | python postprocessor.py
The new pipeline that fixes the issues the preprocessor is:
cat file_*.json | jq -c '.' | python postprocessor.py
This works fine, since every file gets output eventually. However I don't want to wait 5-10 minutes for each tests. I tried to test with the first 2 lines of input with head.
cat file_*.json | head -n 2 - | jq -c '.' | python postprocessor.py
Bash sits there working far longer than it should, so I try:
cat file_*.json | head -n 2 - | jq -c '.'
And my problem is clear. Bash is outputting the content of all the files as if head was not even there because each file now has 1 line of data in it. I've never needed to do this with bash before and I'm flummoxed.
Why does Bash behave this way, and How do I rewrite my little bash command pipeline to work the way it used to, allowing me to select the first/last n lines of data to work with for testing?
My guess is that when you split the json up into individual files, you managed to remove the newline character from the end of each line, with the consequence that the concatenated file (cat file_json.*) is really only one line in total, because cat will not insert newlines between the files it is concatenating.
If the files were really one line each with a terminating newline character, piping through head -n 2 should work fine.
You can check this hypothesis with wc, since that utility counts newline characters rather than lines. If it reports that the files have 0 lines, then you need to fix your preprocessing.

GNU 'ls' command not outputing the same over a pipe [duplicate]

This question already has answers here:
Why does ls give different output when piped
(3 answers)
Closed 6 years ago.
When I execute the command ls on my system, I get the following output:
System:~ user# ls
asd goodfile testfile this is a test file
However, when I pipe ls to another program (such as cat or gawk), the following is output:
System:~ user# ls | cat
asd
goodfile
testfile
this is a test file
How do I get ls to read the terminal size and output the same over a pipe as it does when printing directly to the terminal?
This question has been solved.
Since I'm using bash, I used the following to achieve the desired output:
System:~ user# ls -C -w "$(tput cols)" | cat
Use ls -C to get columnar output again.
When ls detects that its output isn't a terminal, it assumes that its output is being processed by some other process that wants to parse it, so it switches to -1 (one-entry-per-line) mode to make parsing easier. To make it format in columns as when it's outputting directly to a terminal, use -C to switch back to column mode.
(Note, you may also have to use --color if you care about color output, which is also normally suppressed by outputting to a pipe.)
Maybe -x "list entries by lines instead of by columns" with possible -w "assume screen width instead of current value" is what you need.
When the output goes to a pipe or non-terminal, the output format is like ls -1. If you want the columnar output, use ls -C instead.
The reason for the discrepancy is that it is usually easier to parse one-line-per-file output in shell scripts.
Since I'm using bash, I used the following to achieve the desired output:
System:~ user# ls -C -w "$(tput cols)" | cat

Resources