Selectively using 'grep --after-context' on one term but not another - shell

I'm pulling in loads of data from a network and filtering for foo and bar, e.g.
for i in example.com example.org example.net
do
echo "Data from $i"
curl $i/data.csv | grep --after-context=3 "foo|bar"
done
Every time foo appears, I need to see the next few lines (grep --after-context=3), but when bar appears, I only need that single line.
Is it possible to make it work in a single grep, sed, awk (or other standard unix) command?

One way:
curl .... | awk '/foo/{x=NR+3}(NR<=x) || /bar/'
When foo is encountered, x is set to current line number + 3, and hence the condition (NR+x) makes the line "foo" and the next 3 lines get printed. /bar/ makes the line containing the bar printed.

awk 'BEGIN {np=0} /bar/ {print; next} /foo/ {np=1;ln=RN;print;next} ln!=0 && RN>(ln+3) {np=0;ln=0} np==1 {print}' INPUTFILE
Instead of the grep, you might use the above. What it does:
in BEGIN sets up the non printing variable.
/bar/ {print} if you can't figure this out, well... (the next is for skipping every other rules and move to the next record).
/foo/ {np=1;ln=RN;print} prints foo lines, saves the line number, and sets print later lines
if the actual row number is greater than the saved line number plus 3, sets printing to off
if we need to print (np>0), then print.

This might work for you (GNU sed);
sed -n '/foo/,+3{p;b};/bar/p' file

Related

Bash Script: Grabbing First Item Per Line, Throwing Into Array

I'm fairly new to the world of writing Bash scripts and am needing some guidance. I've begun writing a script for work, and so far so good. However, I'm now at a part that needs to collect database names. The names are actually stored in a file, and I can grep them.
The command I was given is cat /etc/oratab which produces something like this:
# This file is used by ORACLE utilities. It is created by root.sh
# and updated by the Database Configuration Assistant when creating
# a database.
# A colon, ':', is used as the field terminator. A new line terminates
# the entry. Lines beginning with a pound sign, '#', are comments.
#
# The first and second fields are the system identifier and home
# directory of the database respectively. The third filed indicates
# to the dbstart utility that the database should , "Y", or should not,
# "N", be brought up at system boot time.
#
OEM:/software/oracle/agent/agent12c/core/12.1.0.3.0:N
*:/software/oracle/agent/agent11g:N
dev068:/software/oracle/ora-10.02.00.04.11:Y
dev299:/software/oracle/ora-10.02.00.04.11:Y
xtst036:/software/oracle/ora-10.02.00.04.11:Y
xtst161:/software/oracle/ora-10.02.00.04.11:Y
dev360:/software/oracle/ora-11.02.00.04.02:Y
dev361:/software/oracle/ora-11.02.00.04.02:Y
xtst215:/software/oracle/ora-11.02.00.04.02:Y
xtst216:/software/oracle/ora-11.02.00.04.02:Y
dev298:/software/oracle/ora-11.02.00.04.03:Y
xtst160:/software/oracle/ora-11.02.00.04.03:Y
I turn turned around and wrote grep ":/software/oracle/ora" /etc/oratab so it can grab everything I need, which is 10 databases. Not the most elegant way, but it gets what I need:
dev068:/software/oracle/ora-10.02.00.04.11:Y
dev299:/software/oracle/ora-10.02.00.04.11:Y
xtst036:/software/oracle/ora-10.02.00.04.11:Y
xtst161:/software/oracle/ora-10.02.00.04.11:Y
dev360:/software/oracle/ora-11.02.00.04.02:Y
dev361:/software/oracle/ora-11.02.00.04.02:Y
xtst215:/software/oracle/ora-11.02.00.04.02:Y
xtst216:/software/oracle/ora-11.02.00.04.02:Y
dev298:/software/oracle/ora-11.02.00.04.03:Y
xtst160:/software/oracle/ora-11.02.00.04.03:Y
So, if I want to grab the name, such as dev068 or xtst161, how do I? I think for what I need to do with this project moving forward, is storing them in an array. As mentioned in the documentation, a colon is the field terminator. How could I whip this together so I have an array, something like:
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
I feel like I may be asking for too much assistance here but I'm truly at a loss. I would be happy to clarify if need be.
It is much simpler using awk:
awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
To populate a BASH array with above output use:
mapfile -t arr < <(awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab)
To check output:
declare -p arr
declare -a arr='([0]="dev068" [1]="dev299" [2]="xtst036" [3]="xtst161" [4]="dev360" [5]="dev361" [6]="xtst215" [7]="xtst216" [8]="dev298" [9]="xtst160")'
We can pipe the output of grep to the cut utility to extract the first field, taking colon as the field separator.
Then, assuming there are no whitespace or glob characters in any of the names (which would be subject to word splitting and filename expansion), we can use a command substitution to run the pipeline, and capture the output in an array by assigning it within the parentheses.
names=($(grep ':/software/oracle/ora' /etc/oratab| cut -d: -f1;));
Note that the above command actually makes use of word splitting on the command substitution output to split the names into separate elements of the resulting array. That is why we must be sure that no whitespace occurs within any single database name, otherwise that name would be internally split into separate elements of the array. The only characters within the command substitution output that we want to be taken as word splitting delimiters are the line feeds that delimit each line of output coming off the cut utility.
You could also use awk for this:
awk -F: '!/^#/ && $2 ~ /^\/software\/oracle\/ora-/ {print $1}' /etc/oratab
The first pattern excludes any commented-out lines (starting with a #). The second pattern looks for your expected directory pattern in the second field. If both conditions are met it prints the first field, which the Oracle SID. The -F: flag sets the field delimiter to a colon.
With your file that gets:
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
Depending on what you're doing you could finesse it further and check the last flag is set to Y; although that is really to indicate automatic start-up, it can sometime be used to indicate that a database isn't active at all.
And you can put the results into an array with:
declare -a DBS=(`awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab`)
and then refer to ${DBS[1]} (which evaluates to dev299) etc.
If you'd like them into a Bash array:
$ cat > toarr.bash
#!/bin/bash
while read -r line
do
if [[ $line =~ .*Y$ ]] # they seem to end in a "Y"
then
arr[$((i++))]=${line%%:*}
fi
done < file
echo ${arr[*]} # here we print the array arr
$ bash toarr.bash
dev068 dev299 xtst036 xtst161 dev360 dev361 xtst215 xtst216 dev298 xtst160

Use grep to print only the context

Using grep, you can print lines that match your search query. Adding a -C option will print two lines of surrounding context, like this:
> grep -C 2 'lorem'
some context
some other context
**lorem ipsum**
another line
yet another line
Similarly, you can use grep -B 2 or grep -A 2 to print matching lines with two preceding or two following lines, respectively, for example:
> grep -A 2 'lorem'
**lorem ipsum**
another line
yet another line
Is it possible to skip the matching line and only print the context? Specifically, I would like to only print the line that is exactly 2 lines above a match, like this:
> <some magic command>
some context
If you can allow couple of grep instances to be used, you can try like as I mentioned in the comments section.
$ grep -v "lorem" < <(grep -A2 "lorem" file)
another line
yet another line
$ grep -A2 "lorem" file | grep -v "lorem"
another line
yet another line
If you are interested in a dose of awk, there is a cool way to do it as
$ awk -v count=2 '{a[++i]=$0;}/lorem/{for(j=NR-count;j<NR;j++)print a[j];}' file
another line
yet another line
It works by storing the entire file in its own array and after searching for the pattern lorem, the awk special variable which stores the row number(NR), points at the exact line in which the pattern is found. If we loop for 2 lines before it as dictated by the awk variable -v count, we can print the lines needed.
If you are interested in the printing the pattern also, just change the condition in for-loop as j<=NR instead of j<NR. That's it!
There’s no way to do this purely through a grep command. If there’s only one instance of lorem in the text, you could pipe the output through head.
grep -B2 lorem t | head -1
If there may be multiple occurrence of lorem, you could use awk:
awk '{second_previous=previous; previous=current_line; current_line=$0}; /lorem/ { print second_previous; }'
This awk command saves each line (along with the previous and the one before that) in variables so when it encounters a line containing lorem, it prints the second last line. If lorem happens to occur in the first or second line of the input, nothing would be printed.
awk, as others have said, is your friend here. You don't need complex loops or arrays or other junk, though; basic patterns suffice.
When you use -B N, (and the --no-group-separator flag) you get output in groups of M=N+1 lines. To select precisely one of those lines (in your question, you want the very first of the group), you can use modular arithmetic (tested with GNU awk).
awk -vm=3 -vx=1 'NR%m==x{print}'
You can think of the lines being numbered like this: they count up until you reach the match, at which point they go back to zero. So set m to N+1 and x to the line you want to extract.
1 some context
2 some other context
0 **lorem ipsum**
So the final command would be
grep -B2 --no-group-separator lorem $input | awk -vm=3 -vx=1 'NR%m==x{print}'

How to join lines not starting with specific pattern to the previous line in UNIX?

Please take a look at the sample file and the desired output below to understand what I am looking for.
It can be done with loops in a shell script but I am struggling to get an awk/sed one liner.
SampleFile.txt
These are leaves.
These are branches.
These are greenery which gives
oxygen, provides control over temperature
and maintains cleans the air.
These are tigers
These are bears
and deer and squirrels and other animals.
These are something you want to kill
Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Desired output
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
With sed:
sed ':a;N;/\nThese/!s/\n/ /;ta;P;D' infile
resulting in
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Here is how it works:
sed '
:a # Label to jump to
N # Append next line to pattern space
/\nThese/!s/\n/ / # If the newline is NOT followed by "These", append
# the line by replacing the newline with a space
ta # If we changed something, jump to label
P # Print part until newline
D # Delete part until newline
' infile
The N;P;D is the idiomatic way of keeping multiple lines in the pattern space; the conditional branching part takes care of the situation where we append more than one line.
This works with GNU sed; for other seds like the one found in Mac OS, the oneliner has to be split up so branching and label are in separate commands, the newlines may have to be escaped, and we need an extra semicolon:
sed -e ':a' -e 'N;/'$'\n''These/!s/'$'\n''/ /;ta' -e 'P;D;' infile
This last command is untested; see this answer for differences between different seds and how to handle them.
Another alternative is to enter the newlines literally:
sed -e ':a' -e 'N;/\
These/!s/\
/ /;ta' -e 'P;D;' infile
But then, by definition, it's no longer a one-liner.
Please try the following:
awk 'BEGIN {accum_line = "";} /^These/{if(length(accum_line)){print accum_line; accum_line = "";}} {accum_line = accum_line " " $0;} END {if(length(accum_line)){print accum_line; }}' < data.txt
The code consists of three parts:
The block marked by BEGIN is executed before anything else. It's useful for global initialization
The block marked by END is executed when the regular processing finished. It is good for wrapping the things. Like printing the last collected data if this line has no These at the beginning (this case)
The rest is the code performed for each line. First, the pattern is searched for and the relevant things are done. Second, data collection is done regardless of the string contents.
awk '$1==These{print row;row=$0}$1!=These{row=row " " $0}'
you can take it from there. blank lines, separators,
other unspecified behaviors (untested)
another awk if you have support for multi-char RS (gawk has)
$ awk -v RS="These" 'NR>1{$1=$1; print RS, $0}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Explanation Set the record delimiter as "These", skip the first (empty) record. Reassign field to force awk to restructure the record; print record separator and the rest of the record.
$ awk '{printf "%s%s", (NR>1 ? (/^These/?ORS:OFS) : ""), $0} END{print ""}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Not a one-liner (but see end of answer!), but an awk-script:
#!/usr/bin/awk -f
NR == 1 { line = $0 }
/^These/ { print line; line = $0 }
! /^These/ { line = line " " $0 }
END { print line }
Explanation:
I'm accumulating, building up, lines that start with "These" with lines not starting with "These", outputting the completed lines whenever I find the next line with "These" at the beginning.
Store the first line (the first "record").
If the line starts with "These", print the accumulated (previous, now complete) line and replace whatever we have found so far with the current line.
If it doesn't start with "These", accumulate the line (i.e concatenate it with the previously read incomplete lines, with a space in between).
When there's no more input, print the last accumulated (now complete) line.
Run like this:
$ ./script.awk data.in
As a one-liner:
$ awk 'NR==1{c=$0} /^These/{print c;c=$0} !/^These/{c=c" "$0} END{print c}' data.in
... but why you would want to run anything like that on the command line is beyond me.
EDIT Saw that it was the specific string "These" (/^These/) that was what should be looked for. Previously had my code look for uppercase letters at the start of the line (/^[A-Z]/).
Here is a sed program which avoids branches. I tested it with the --posix option. The trick is to use an "anchor" (a string which does not occur in the file):
sed --posix -n '/^These/!{;s/^/DOES_NOT_OCCUR/;};H;${;x;s/^\n//;s/\nDOES_NOT_OCCUR/ /g;p;}'
Explanation:
write DOES_NOT_OCCUR at the beginning of lines not starting with "These":
/^These/!{;s/^/DOES_NOT_OCCUR/;};
append the pattern space to the hold space
H;
If the last line is read, exchange pattern space and hold space
${;x;
Remove the newline at the beginning of the pattern space which is added by the H command when it added the first line to the hold space
s/^\n//;
Replace all newlines followed by DOES_NOT_OCCUR with blanks and print the result
s/\nDOES_NOT_OCCUR/ /g;p;}
Note that the whole file is read in sed's process memory, but with only 4GB this should not be a problem.

awk output is acting weird

cat TEXT | awk -v var=$i -v varB=$j '$1~var , $1~varB {print $1}' > PROBLEM HERE
I am passing two variables from an array to parse a very large text file by range. And it works, kind of.
if I use ">" the output to the file will ONLY be the last three lines as verified by cat and a text editor.
if I use ">>" the output to the file will include one complete read of TEXT and then it will divide the second read into the ranges I want.
if I let the output go through to the shell I get the same problem as above.
Question:
It appears awk is reading every line and printing it. Then it goes back and selects the ranges from the TEXT file. It does not do this if I use constants in the range pattern search.
I undestand awk must read all lines to find the ranges I request.
why is it printing the entire document?
How can I get it to ONLY print the ranges selected?
This is the last hurdle in a big project and I am beating my head against the table.
Thanks!
give this a try, you didn't assign varB in right way:
yours: awk -v var="$i" -varB="$j" ...
mine : awk -v var="$i" -v varB="$j" ...
^^
Aside from the typo, you can't use variables in //, instead you have to specify with regular ~ match. Also quote your shell variables (here is not needed obviously, but to set an example). For example
seq 1 10 | awk -v b="3" -v e="5" '$0 ~ b, $0 ~ e'
should print 3..5 as expected
It sounds like this is what you want:
awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
e.g.
$ cat file
1
2
foo
3
4
bar
5
foo
6
bar
7
$ awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
foo
3
4
bar
foo
6
bar
but without sample input and expected output it's just a guess and this would not address the SHELL behavior you are seeing wrt use of > vs >>.
Here's what happened. I used an array to input into my variables. I set the counter for what I thought was the total length of the array. When the final iteration of the array was reached, there was a null value returned to awk for the variable. This caused it to print EVERYTHING. Once I correctly had a counter with the correct number of array elements the printing oddity ended.
As far as the > vs >> goes, I don't know. It did stop, but I wasn't as careful in documenting it. I think what happened is that I used $1 in the print command to save time, and with each line it printed at the end it erased the whole file and left the last three identical matches. Something to ponder. Thanks Ed for the honest work. And no thank you to Robo responses.

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Resources