Splitting and looping over live command output in Bash - bash

I am archiving and using split to produce several parts while also printing the output files (from split on STDERR, which I am redirecting to STDOUT). However the loop over the output data doesn't happen until after the command returns.
Is there anyway to actively split over the STDOUT output of a command before it returns?
The following is what I currently have, but it only prints the list of filenames after the command returns:
export IFS=$'\n'
for line in `data_producing_command | split -d -b $CHUNK_SIZE --verbose - $ARCHIVE_PREFIX 2>&1`; do
FILENAME=`echo $line | awk '{ print $3 }'`
echo " - $FILENAME"
done

Try this:
data_producing_command | split -d -b $CHUNK_SIZE --verbose - $ARCHIVE_PREFIX 2>&1 | while read -r line
do
FILENAME=`echo $line | awk '{ print $3 }'`
echo " - $FILENAME"
done
Note however that any variables set in the while loop will not preserve their values after the loop (the while loop runs in a subshell).

There's no reason for the for loop or the read or the echo. Just pipe the stream to awk:
... | split -d -b $CHUNK_SIZE --verbose - test 2>&1 |
awk '{printf " - %s\n", $3 }'
You are going to see some delay from buffering, but unless your system is very slow or you are very perceptive, you're not likely to notice it.

The command substitution needs1 to run before the for loop can start.
for item in $(command which produces items); do ...
whereas a while read -r can start consuming output as soon as the first line is produced (or, more realistically, as soon as the output buffer is full):
command which produces items |
while read -r item; do ...
1 Well, it doesn't absolutely need to, from a design point of view, I suppose, but that's how it currently works.
As William Pursell already noted, there is no particular reason to run Awk inside a while read loop, because that's something Awk does quite well on its own, actually.
command which produces items |
awk '{ print " - " $3 }'
Of course, with a reasonably recent GNU Coreutils split, you could simply do
split --filter='printf " - %s\n" "$FILE"'; cat >"$FILE" ... options

Related

Bash Iterative approach in place of process substitution not working as expected

complete bash noob here. Had the following command (1.) and it worked as expected but it seemed a bit naive for what I needed:
Essentially generating a wordlist from a messy input file with tab delimiters
cat users.txt | tee >(cut -f 1 >> cut_out.txt) >(cut -f 2 >> cut_out.txt) >(cut -f 3 >> cut_out.txt) >(cut -f 4 >> cut_out.txt)
Output:
W Humphrey
SummersW
FoxxR
noreply
DaibaN
PeanutbutterM
PetersJ
DaviesJ
BlaireJ
GongoH
MurphyF
JeffersD
HorsemanB
...
Thought I could cut down on the ridiculous command above with the following
cat users.txt | for i in {1..4}; do cut -f $i >> cut_out.txt; done
Output:
HumphreyW
The command above only returned a single word from the list and some white-space.
The solution. I knew that I could get it working logically by simply looping the entire command instead, this did exactly what I wanted but just wanted to know why the command above (2.) returned an almost empty file?
for i in {1..4}; do cat users.txt | cut -f $i >> cut_out.txt; done
Have a solution, more-so wanted an explanation because I am dumb and still learning about I/O in bash. Cheers.
Just a remark
awk -F '[\t]' '{for(i = 1; i <= 4; i++) print $i}' users.txt > cut_out.txt
Is basically what your cat ... | tee >(cut ...) ... does.
If the order of the output is unimportant, and there are only four coumns in the file, simply
tr '\t' '\n' <users.txt >cut_out.txt
If you only want the first four columns in any order,
cut -f1-4 users.txt |
rt '\t' '\n' >cut_out.txt
(Thanks to #KamilCuk for raising this in a comment.)
Otherwise your third attempt is basically fine, though you want to avoid the useless cat and redirect only once;
for i in {1..4}; do
cut -f "$i" users.txt
done > cut_out.txt
This is obviously less efficient than only reading the file once. If the file is small enough to fit into memory, you could write a simple Awk script to read it once and split it up into variables, and then write out these variables in the order you want.
The second attempt is wrong because cat only supplies a single instance of the data to the pipe, and the first iteration of the loop consumes it all.

Trouble Allocating Memory in Bash Script

I tried to automate the process of cleaning up various wordlists I am working with. This is the following code for it:
#!/bin/bash
# Removes spaces and duplicates in a wordlist
echo "Please be in the same directory as wordlist!"
read -p "Enter Worldlist: " WORDLIST
RESULT=$( awk '{print length, $0}' $WORDLIST | sort -n | cut -d " " -f2- )
awk '!(count[$0]++)' $RESULT > better-$RESULT
This is the error I recieve after running the program:
./wordlist-cleaner.sh: fork: Cannot allocate memory
First post, I hope I formatted it correctly.
You didn't describe your intentions or desired output, but I guess this may do what you want
awk '{print length, $0}' "$WORDLIST" | sort -n | cut -d " " -f2- | uniq > better-RESULT
Notice that it's better-RESULT instead of better-$RESULT as you don't want that as a filename.
Yeah okay I got it to run successfully. I was trying to clean up wordlists I was downloading of the net. I have some knowledge of the basic variable usage in Bash, but not enough of the text manipulation commands like sed or awk. Thanks for the support.

Evaluating a log file using a sh script

I have a log file with a lot of lines with the following format:
IP - - [Timestamp Zone] 'Command Weblink Format' - size
I want to write a script.sh that gives me the number of times each website has been clicked.
The command:
awk '{print $7}' server.log | sort -u
should give me a list which puts each unique weblink in a separate line. The command
grep 'Weblink1' server.log | wc -l
should give me the number of times the Weblink1 has been clicked. I want a command that converts each line created by the Awk command above to a variable and then create a loop that runs the grep command on the extracted weblink. I could use
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
done
(source: Read a file line by line assigning the value to a variable) but I don't want to save the output of the Awk script in a .txt file.
My guess would be:
while IFS='' read -r line || [[ -n "$line" ]]; do
grep '$line' server.log | wc -l | ='$variabel' |
echo " $line was clicked $variable times "
done
But I'm not really familiar with connecting commands in a loop, as this is my first time. Would this loop work and how do I connect my loop and the Awk script?
Shell commands in a loop connect the same way they do without a loop, and you aren't very close. But yes, this can be done in a loop if you want the horribly inefficient way for some reason such as a learning experience:
awk '{print $7}' server.log |
sort -u |
while IFS= read -r line; do
n=$(grep -c "$line" server.log)
echo "$line" clicked $n times
done
# you only need the read || [ -n ] idiom if the input can end with an
# unterminated partial line (is illformed); awk print output can't.
# you don't really need the IFS= and -r because the data here is URLs
# which cannot contain whitespace and shouldn't contain backslash,
# but I left them in as good-habit-forming.
# in general variable expansions should be doublequoted
# to prevent wordsplitting and/or globbing, although in this case
# $line is a URL which cannot contain whitespace and practically
# cannot be a glob. $n is a number and definitely safe.
# grep -c does the count so you don't need wc -l
or more simply
awk '{print $7}' server.log |
sort -u |
while IFS= read -r line; do
echo "$line" clicked $(grep -c "$line" server.log) times
done
However if you just want the correct results, it is much more efficient and somewhat simpler to do it in one pass in awk:
awk '{n[$7]++}
END{for(i in n){
print i,"clicked",n[i],"times"}}' |
sort
# or GNU awk 4+ can do the sort itself, see the doc:
awk '{n[$7]++}
END{PROCINFO["sorted_in"]="#ind_str_asc";
for(i in n){
print i,"clicked",n[i],"times"}}'
The associative array n collects the values from the seventh field as keys, and on each line, the value for the extracted key is incremented. Thus, at the end, the keys in n are all the URLs in the file, and the value for each is the number of times it occurred.

Bash awk append to same line

There are numerous posts about removing leading white space and appending an entry to a single existing line in a file using awk. None of my attempts work - just three examples here of the many I have tried.
Say I have a file called $log with a single line
a:b:c
and I want to add a fourth entry,
awk '{ print $4"d" }' $log | tee -a $log
output seems to be a newline
`a:b:c:
d`
whereas, I want all on the same line;
a:b:c:d
try
BEGIN { FS = ":" } ; awk '{ print $4"d" }' $log | tee -a $log
or, this - avoid a new line
awk 'BEGIN { ORS=":" }; { print $4"d" }' $log | tee -a $log
no change
`a:b:c:
d`
awk is placing a space after c: and then writing d to the next line.
EDIT: | tee -a $log appears to be necessary to write the additional string to the file.
$log contains 39 variables and was generated using awk without | tee -a
odd...
The actual command to write $40 to the single line entries
awk '{ print $40"'$imagedir'" }' $log
output
+ awk '{ print $40"/home/geoland/Asterism-DEVEL/DSO" }'
/home/geoland/.asterism/log
but this does not write to the $log file.
How should I append d to the same line without leading white space using awk - also looking at sed xargs and other alternatives.
Using awk:
awk '{ print $0":d" }' file
Using sed:
sed 's/$/:d/' file
Using only bash:
while IFS= read -r line; do
echo "$line:d"
done < file
Using sed:
$ echo a:b:c | sed 's,\(^.*$\),\1:d,'
a:b:c:d
Thanks all... This is the solution I went with. I also needed to write the entire line to a perpetual log file because the log file is overwritten at each new process instance.
I will further investigate an awk solution.
logname=$imagedir/log_$name
while IFS=: read -r line; do
echo "$line$imagedir"
done < $log | tee $logname
This places $imagedir directly behind the last IFS ':' separator
There is probably room for refinement.
I too am not entirely sure what you're trying to do here.
Your command line, awk '{ print $4"d" }' $log | tee -a $log is problematic in a number of ways.
First, your awk script tries to print the 4th field, which is empty. Unless you say otherwise, fields are separated by whitespace, and the string a:b:c has no whitespace. So .. awk prints "d". And tee -a appends to your existing logfile, so what you're seeing is the original data, along with the d printed by awk. That's totally expected.
Second, it appears to have tee appending to the same file that awk is in the process of reading. This won't make an endless loop, as awk should stop reading the input file after whatever was the last byte when the file was opened, but it does mean you may have repeated data there.
Your other attempts, aside from some syntactical errors, all suffer from the same assumption that $4 means something that it does not.
The following awk snippet sets the input and output field separators to :, then sets the 4th field to "d", then prints the line.
$ echo "a:b:c" | awk 'BEGIN{FS=OFS=":"} {$4="d"} 1'
a:b:c:d
Is that what you want?
If you really do need to append this data to an existing log file, you can do so with tee -a or simple >> redirection. Just bear in mind that awk will only see the content of the file as of the time it was run, and by appending, you are not replacing lines.
One other thing. If you are actually hoping to use the content of the shell variable $imagedir inside awk, you should pass the variable in rather than exiting your quotes. For example:
$ echo "a:b:c" | awk -v d="foo/bar" 'BEGIN{FS=OFS=":"} {$4=d} 1'
a:b:c:foo/bar
sed "s|$|$imagedir|" file | tee newfile
This does the trick. Read 'file' and write the contents of 'file' with the substitution to a 'new file', so as to read the image directory when using a secondary standalone process.
Because the variable is a directory with several / these need to be escaped, so as not to interpret as sed delimiters. I had difficulty with this using a variable.
A neater option was to use an alternative delimiter. Not to be confused with the pipe that follows.

How can I split a string in shell?

I have two strings and I want to split with space and use them two by two:
namespaces="Calc Fs"
files="calc.hpp fs.hpp"
for example, I want to use like this: command -q namespace[i] -l files[j]
I'm a noob in Bourne Shell.
Put them into an array like so:
#!/bin/bash
namespaces="Calc Fs"
files="calc.hpp fs.hpp"
i=1
j=0
name_arr=( $namespaces )
file_arr=( $files )
command -q "${name_arr[i]}" -l "${file_arr[j]}"
echo "hello world" | awk '{split($0, array, " ")} END{print array[2]}'
is how you would split a simple string.
if what you want to do is loop through combinations of the two split strings, then you want something like this:
for namespace in $namespaces
do
for file in $files
do
command -q $namespace -l $file
done
done
EDIT:
or to expand on the awk solution that was posted, you could also just do:
echo $foo | awk '{print $'$i'}'
EDIT 2:
Disclaimer: I don not profess to be any kind of expert in awk at all, so there may be small errors in this explanation.
Basically what the snippet above does is pipe the contents of $foo into the standard input of awk. Awk reads from it's standard in line by line, separating each line into fields based on a field separator, which is any number of spaces by default. Awk executes the program that it is given as an argument. In this case, the shell expands '{ print $'$1' }' into { print $1 } which simply tells awk to print field number 1 of each line of its input.
If you want to learn more I think that this blog post does a pretty good job of describing the basics (as well as the basics of sed and grep) if you skip past the more theoretical stuff at the start (unless you're into that kind of thing).
I wanted to find a way to do it without arrays, here it is:
paste -d " " <(tr " " "\n" <<< $namespaces) <(tr " " "\n" <<< $files) |
while read namespace file; do
command -q $namespace -l $file
done
Two special usage here: process substitution (<(...)) and here strings (<<<). Here strings are a shortcut for echo $namespaces | tr " " "\n". Process substitution is a shortcut for fifo creation, it allows paste to be run using the output of commands instead of files.
If you are using zsh this could be very easy:
files="calc.hpp fs.hpp"
# all elements
print -l ${(s/ /)files}
# just the first one
echo ${${(s/ /)files}[1]} # just the first one

Resources