In bash, how to ingest lines from a file, and then that set of lines as a file itself - bash

In a bash script, is there a way to ingest specific lines from a file, and then treat that set of lines as a file itself? For example, let's say file input_data looks like
boring line 1 A
boring line 2 B
boring line 3 C
start of interesting block
line 1 A
line 2 B
line 3 C
end of interesting block
boring line 4 D
There's not much I can assume about the structure of the file except that 1) it will indeed have "start of interesting block" and "end of interesting block" and 2) it will generally be much larger and more complicated than this example. And let's say I want to do a great deal of processing in the "interesting block".
So I need to do something sort of like this:
interesting_lines=$(tail -n 6 input_file.txt | head -n 5)
process_1 interesting_lines #(Maybe process_1 is a "grep" or something more complicated.)
process_2 interesting_lines
etc.
Of course, that won't work, which is why I'm asking this question, but that's the idea that I need.
I guess one thing that would work would be
tail -n 6 input_file.txt | head -n 5 > tmpfile
process_1 tmpfile
process_2 tmpfile
etc.
rm tmpfile
but I am trying to avoid temporary files.

You can use process substitution.
interesting_lines=$(tail -n 6 input_file.txt | head -n 5)
process_1 <(printf '%s\n' "$interesting_lines")
process_2 <(printf '%s\n' "$interesting_lines")

Don't use head or tail. You can give a range with sed or awk:
process_1 <(sed -n '/start of interesting block/,/end of interesting block/p' input_file.txt)
process_2 <(awk '/start of interesting block/,/end of interesting block/' input_file.txt)
When you need better control over your boundaries, use awk smarter. The next solution only seems to be more verbose, but now you can ad all kind of conditions
process_1 <(awk '/start of interesting block/ {interesting=1}
/end of interesting block/ {interesting=0}
interesting {print}
' input_file.txt)
When you have to look in a very long file for only a few interesting lines, you can do
sed -n '/start of interesting block/,/end of interesting block/p' input_file.txt |
tee >(process_1) | process_2
Demo:
printf "%s\n" {1..10} | tee >(sed 's/4/xxx/p') | sed 's/^/ /'

Related

Bash Iterative approach in place of process substitution not working as expected

complete bash noob here. Had the following command (1.) and it worked as expected but it seemed a bit naive for what I needed:
Essentially generating a wordlist from a messy input file with tab delimiters
cat users.txt | tee >(cut -f 1 >> cut_out.txt) >(cut -f 2 >> cut_out.txt) >(cut -f 3 >> cut_out.txt) >(cut -f 4 >> cut_out.txt)
Output:
W Humphrey
SummersW
FoxxR
noreply
DaibaN
PeanutbutterM
PetersJ
DaviesJ
BlaireJ
GongoH
MurphyF
JeffersD
HorsemanB
...
Thought I could cut down on the ridiculous command above with the following
cat users.txt | for i in {1..4}; do cut -f $i >> cut_out.txt; done
Output:
HumphreyW
The command above only returned a single word from the list and some white-space.
The solution. I knew that I could get it working logically by simply looping the entire command instead, this did exactly what I wanted but just wanted to know why the command above (2.) returned an almost empty file?
for i in {1..4}; do cat users.txt | cut -f $i >> cut_out.txt; done
Have a solution, more-so wanted an explanation because I am dumb and still learning about I/O in bash. Cheers.
Just a remark
awk -F '[\t]' '{for(i = 1; i <= 4; i++) print $i}' users.txt > cut_out.txt
Is basically what your cat ... | tee >(cut ...) ... does.
If the order of the output is unimportant, and there are only four coumns in the file, simply
tr '\t' '\n' <users.txt >cut_out.txt
If you only want the first four columns in any order,
cut -f1-4 users.txt |
rt '\t' '\n' >cut_out.txt
(Thanks to #KamilCuk for raising this in a comment.)
Otherwise your third attempt is basically fine, though you want to avoid the useless cat and redirect only once;
for i in {1..4}; do
cut -f "$i" users.txt
done > cut_out.txt
This is obviously less efficient than only reading the file once. If the file is small enough to fit into memory, you could write a simple Awk script to read it once and split it up into variables, and then write out these variables in the order you want.
The second attempt is wrong because cat only supplies a single instance of the data to the pipe, and the first iteration of the loop consumes it all.

Optimize sed for multiple replacements

I have a file, users.txt, with words like,
user1
user2
user3
I want to find these words in another file, data.txt and add a prefix to it. data.txt has nearly 500K lines. For example, user1 should be replaced with New_user1 and so on. I have written simple shell script like
for user in `cat users.txt`
do
sed -i 's/'${user}'/New_&/' data.txt
done
For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. I tried to refer to Optimize shell script for multiple sed replacements, but still not much improvement was observed.
Is there any other way to make this process faster?
Sed is known to be very fast (probably only worse than C).
Instead of sed 's/X/Y/g' input.txt, try sed '/X/ s/X/Y/g' input.txt. The latter is known to be faster.
Since you only have a "one line at a time semantics", you could run it with parallel (on multi-core cpu-s) like this:
cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'
If you are working with plain ascii files, you could speed it up by using "C" locale:
LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt
You can turn your users.txt into sed commands like this:
$ sed 's|.*|s/&/New_&/|' users.txt
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/
And then use this to process data.txt, either by writing the output of the previous command to an intermediate file, or with process substitution:
sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt
Your approach goes through all of data.txt for every single line in users.txt, which makes it slow.
If you can't use process substitution, you can use
sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt
instead.
Or.. in one go, we can do something like this. Let us say, we have a data file with 500k lines.
$>
wc -l data.txt
500001 data.txt
$>
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct 5 00:25 data.txt
$>
head -2 data.txt ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe
499999|This is a test file maybe
500000|This is a test file maybe
Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt"
$>
cat users.txt
file
maybe
test
So we want to read users.txt and for every word, we want to change that word to a new word. For ex., "file" to "ab_file", "maybe" to "ab_maybe"..
We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. In below example, read word is passed to perl command as $word.
I timed this task and this happens fairly quickly. Did it on my VM hosted on my windows 10 (using Centos7).
time cat users.txt |while read word; do perl -pi -e "s/${word}/ab_${word}/g" data.txt; done
real 0m1.973s
user 0m1.846s
sys 0m0.127s
$>
head -2 data.txt ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe
499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe
In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. head and tail count confirms our operation.
cheers,
Gaurav

Echo something while piping stdout

I know how to pipe stdout:
./myScript | grep 'important'
Example output of the above command:
Very important output.
Only important stuff here.
But while greping I would also like to echo something each line so it looks like this:
1) Very important output.
2) Only important stuff here.
How can I do that?
Edit: Apparently, I haven't specified well enough what I want to do. Numbering of lines is just an example, I want to know in general how to add text (any text, including variables and whatnot) to pipe output. I see one can achieve that using awk '{print $0}' where $0 is the solution I'm looking for.
Are there any other ways to achieve this?
This will number the hits from 0
./myScript | grep 'important' | awk '{printf("%d) %s\n", NR, $0)}'
1) Very important output.
2) Only important stuff here.
This will give you the line number of the hit
./myScript | grep -n 'important'
3:Very important output.
47:Only important stuff here.
If you want line numbers on the new output running from 1..n where n is number of lines in the new output:
./myScript | awk '/important/{printf("%d) %s\n", ++i, $0)}'
# ^ Grep part ^ Number starting at 1
A solution with a while loop is not suited for large files, so you should only use this solution when you do not have a lot important stuff:
i=0
while read -r line; do
((i++))
printf "(%s) Look out: %s" $i "${line}"
done < <(./myScript | grep 'important')

Getting head to display all but the last line of a file: command substitution and standard I/O redirection

I have been trying to get the head utility to display all but the last line of standard input. The actual code that I needed is something along the lines of cat myfile.txt | head -n $(($(wc -l)-1)). But that didn't work. I'm doing this on Darwin/OS X which doesn't have the nice semantics of head -n -1 that would have gotten me similar output.
None of these variations work either.
cat myfile.txt | head -n $(wc -l | sed -E -e 's/\s//g')
echo "hello" | head -n $(wc -l | sed -E -e 's/\s//g')
I tested out more variations and in particular found this to work:
cat <<EOF | echo $(($(wc -l)-1))
>Hola
>Raul
>Como Esta
>Bueno?
>EOF
3
Here's something simpler that also works.
echo "hello world" | echo $(($(wc -w)+10))
This one understandably gives me an illegal line count error. But it at least tells me that the head program is not consuming the standard input before passing stuff on to the subshell/command substitution, a remote possibility, but one that I wanted to rule out anyway.
echo "hello" | head -n $(cat && echo 1)
What explains the behavior of head and wc and their interaction through subshells here? Thanks for your help.
head -n -1 will give you all except the last line of its input.
head is the wrong tool. If you want to see all but the last line, use:
sed \$d
The reason that
# Sample of incorrect code:
echo "hello" | head -n $(wc -l | sed -E -e 's/\s//g')
fails is that wc consumes all of the input and there is nothing left for head to see. wc inherits its stdin from the subshell in which it is running, which is reading from the output of the echo. Once it consumes the input, it returns and then head tries to read the data...but it is all gone. If you want to read the input twice, the data will have to be saved somewhere.
Using sed:
sed '$d' filename
will delete the last line of the file.
$ seq 1 10 | sed '$d'
1
2
3
4
5
6
7
8
9
For Mac OS X specifically, I found an answer from a comment to this Q&A.
Assuming you are using Homebrew, run brew install coreutils then use the ghead command:
cat myfile.txt | ghead -n -1
Or, equivalently:
ghead -n -1 myfile.txt
Lastly, see brew info coreutils if you'd like to use the commands without the g prefix (e.g., head instead of ghead).
cat myfile.txt | echo $(($(wc -l)-1))
This works. It's overly complicated: you could just write echo $(($(wc -l)-1)) <myfile.txt or echo $(($(wc -l <myfile.txt)-1)). The problem is the way you're using it.
cat myfile.txt | head -n $(wc -l | sed -E -e 's/\s//g')
wc consumes all the input as it's counting the lines. So there is no data left to read in the pipe by the time head is started.
If your input comes from a file, you can redirect both wc and head from that file.
head -n $(($(wc -l <myfile.txt) - 1)) <myfile.txt
If your data may come from a pipe, you need to duplicate it. The usual tool to duplicate a stream is tee, but that isn't enough here, because the two outputs from tee are produced at the same rate, whereas here wc needs to fully consume its output before head can start. So instead, you'll need to use a single tool that can detect the last line, which is a more efficient approach anyway.
Conveniently, sed offers a way of matching the last line. Either printing all lines but the last, or suppressing the last output line, will work:
sed -n '$! p'
sed '$ d'
Here is a one-liner that can get you the desired output, and it can be used more generally for getting all lines from a file except the last n lines.
grep -n "" myfile.txt \ # output the line number for each line
| sort -nr \ # reverse the file by using those line numbers
| sed '1,4d' \ # delete first 4 lines (last 4 of the original file)
| sort -n \ # reverse the reversed file (correct the line order)
| sed 's/^[0-9]*://' # remove the added line numbers
Here is the above command in an actual single line and runnable (can't execute the above due to the added comments):
grep -n "" myfile.txt | sort -nr | sed '1,4d' | sort -n | sed 's/^[0-9]*://'
It's a little cumbersome, and this problem can be solved with more comprehensive commands like ghead, but when you can't or don't want to download such tools, it's nice to be able to do this with the more basic options. I've been in situations where it's simply not an option to get better tools.
awk 'NR>1{print p}{p=$0}'
For this job, an awk one-liner is a bit longer than a sed one.

get the second last line from shell pipeline

I want to get the second last line from the ls -l output.
I know that
ls -l|tail -n 2| head -n 1
can do this, just wondering if sed can do this in just one command?
ls -l|sed -n 'x;$p'
It can't do third to last though, because sed only has 1 hold space, so can only remember one older line. And since it processes the lines one at a time, it does not know the line will be next to last when processing it. awk could return thrid to last, because you can have arbitrary number of variables there, but the script would be much longer than the tail -n X|head -n 1.
In a awk one-liner :
echo -e "aaa\nbbb\nccc\nddd" | awk '{v[c++]=$0}END{print v[c-2]}'
ccc
Try this to delete second-last line in file
sed -e '$!{h;d;}' -e x filename
tac filename | sed -n 2p
-- but involves a pipe, too

Resources