how to extract word from grep result in shell? - shell

Using shell i want to search and print only sub-string with next word to that sub-string.
e.g. logfile has line "today is monday and this is:1234 so I am in."
if grep -q "this is :" ./logfile; then
#here i want to print only sub-string with next word i.e. "this is:1234"
#echo ???
fi

You can use sed with \1 to display the matched string in \(..\):
sed 's/.*\(this is:[0-9a-zA-Z]*\).*/\1/' logfile
EDIT: The above command is only fine for 1 line input.
When you have a file with more lines, you only want to print the lines that match:
sed -n 's/.*\(this is:[0-9a-zA-Z]*\).*/\1/p' logfile
When you have a large file and only want to see the first match, you can combine this command with head -1, but you would like to stop scanning/parsing after the first match. You can use q to quit, but you only want to quit after a match.
sed -n '/.*\(this is:[0-9a-zA-Z]*\).*/{s//\1/p;q}'

You can use a regular expression with a look-behind, if you want only the next word:
$ grep --perl-regexp -o '(?<=(this is:))(\S+)' ./logfile
1234
If you want both, then just:
$ grep --perl-regexp -o 'this is:\S+' ./logfile
this is:1234
The -o option instructs grep to return only the matching part.
In the commands above, we assumed that a "word" is a sequence of non-space characters. You can adjust that according to your needs.

If you have a system with GNU extensions (but aren't certain it was compiled with optional PCRE support), consider:
if result=$(grep -E -m 1 -o 'this is:[^[:space:]]+' logfile); then
echo "value is: ${result#*:}"
fi
${varname#value} expands to the contents of varname, but with value stripped from the beginning if present. Thus, ${result#*:} strips everything up to the first colon in result.
However, this may not work on systems without the non-POSIX options -o or -m.
If you want to support non-GNU systems, awk is a tool worth considering: Unlike answers requiring nonportable extensions (like grep -P), this should work on any modern platform (tested with GNU awk, recent BSD awk, and mawk; also, no warnings with with gawk --posix --lint):
# note that the constant 8 is the length of "this is:"
# GNU awk has cleaner syntax, but trying to be portable here.
if value=$(awk '
BEGIN { matched=0; } # by default, this will trigger END to exit as failure
/this is:/ {
match($0, /this\ is:([^[:space:]]+)/);
print substr($0, RSTART+8, RLENGTH-8);
matched=1; # tell END block to use zero exit status
exit(0); # stop processing remaining file contents, jump to END
}
END { if(matched == 0) { exit(1); } }
'); then
echo "Found value of $value"
else
echo "Could not find $value in file"
fi

You can look for everything up to, but not including the next space like this:
grep -Eo "this is:[^[:space:]]+" logfile
The [] introduces the set of characters you are looking for and the ^ at the start complements the set, so the set of characters you are looking for is a blank space, but complemented, i.e. not a blank space. The + says there must be at least one or more such characters.
The -E tells grep to use extended regular expressions and the -o means to only print the matched part.

Related

Count number of grep occurrences and store it a variable

I want to do something like this - grep for a string in a particular file, store it in a variable and be able to print just the number of occurrences.
#!/bin/bash
count=$(grep *something* *somefile*| wc -l)
echo $count
This always gives a 0 value, when I know it should be more.
This is what I intend to do, but its taking like forever to finish the script.
if egrep -iq "Android 6.0.1" $filename; then
count=$(egrep -ic "Android 6.0.1" $filename)
echo 'Operating System Version leaked number of times: '$count
I have 7 other such if statements and I am running this for around 20 files.
Any more efficient way to make it faster?
grep has its own counting flag
-c, --count
Suppress normal output; instead print a count of matching lines for
each input file. With the -v, --invert-match option (see below), count
non-matching lines. (-c is specified by POSIX .)
count=$( grep -c 'match' file)
Note that the match part is quoted as well so if you use special characters they are not interpreted by the shell.
Also as stated in the excerpt from that man page multiple matches on a single line will be counted as a single match as it only counts matching lines:
$ echo "hello hello hello hello
hello
> bye" | grep -c "hello"
2
A much more efficient approach would be to run Awk once.
awk -v patterns="foo,bar,baz" 'BEGIN { n=split(patterns, pats, ",") }
{ for (i=1; i<=n; ++i) if ($0 ~ pats[i]) ++hits[i] }
END { for (i=1; i<=n; ++i) printf("%8d%s\n", hits[i], pats[i]) }' list of files
For bonus points, format the output in machine-readable format (depending on where it ends up, JSON might be a good choice); and/or add the human-readable explanation for the significance of each hit to the END block.
If that's not what you want, running grep -Eic and ditching any zero value would already improve your run time over grepping the file twice for each match in the worst case. (The pessimal situation would be when the last line and no other line matches your pattern.)

How to search the content of a file using sed

I tried several commands and read a few tutorials, but none helped. I'm using gnu sed 4.2.2.
I just want to search in a file for:
func1(int a)
{
I don't know if it is the fact of a newline followed by "{" but it just doesn't work.
A correct command would help and with an explanation even more. Thanks!
Note: After "int a)" I typed enter, but I think stackoverflow doesn't put a newline. I don't want confusion, I just want to search func1(int a)'newline'{.
sed -n '/func1/{N;/func1(int a)\n{/p}'
Explanation:
sed -n '/func1/{ # look for a line with "func1"
N # append the next line
/func1(int a)\n{/p # try to match the whole pattern "func1(int a)\n{"
# and print if matched
}'
With grep it would be for example:
grep -Pzo 'func1\(int a\)\n{'
but notice that thanks to -z the input to grep will be one large "line" that includes the newline characters too (that is unless the input contains null characters). Parenthesis have to be escaped in this case because it is a Perl regular expression (-P). -o makes grep print just the matched pattern.

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Grep characters before and after match?

Using this:
grep -A1 -B1 "test_pattern" file
will produce one line before and after the matched pattern in the file. Is there a way to display not lines but a specified number of characters?
The lines in my file are pretty big so I am not interested in printing the entire line but rather only observe the match in context. Any suggestions on how to do this?
3 characters before and 4 characters after
$> echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}'
23_string_and
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
This will match up to 5 characters before and after your pattern. The -o switch tells grep to only show the match and -E to use an extended regular expression. Make sure to put the quotes around your expression, else it might be interpreted by the shell.
You could use
awk '/test_pattern/ {
match($0, /test_pattern/); print substr($0, RSTART - 10, RLENGTH + 20);
}' file
You mean, like this:
grep -o '.\{0,20\}test_pattern.\{0,20\}' file
?
That will print up to twenty characters on either side of test_pattern. The \{0,20\} notation is like *, but specifies zero to twenty repetitions instead of zero or more.The -o says to show only the match itself, rather than the entire line.
I'll never easily remember these cryptic command modifiers so I took the top answer and turned it into a function in my ~/.bashrc file:
cgrep() {
# For files that are arrays 10's of thousands of characters print.
# Use cpgrep to print 30 characters before and after search pattern.
if [ $# -eq 2 ] ; then
# Format was 'cgrep "search string" /path/to/filename'
grep -o -P ".{0,30}$1.{0,30}" "$2"
else
# Format was 'cat /path/to/filename | cgrep "search string"
grep -o -P ".{0,30}$1.{0,30}"
fi
} # cgrep()
Here's what it looks like in action:
$ ll /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
-rw-r--r-- 1 rick rick 25780 Jul 3 19:05 /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
$ cat /tmp/rick/scp.Mf7UdS/Mf7UdS.Source | cgrep "Link to iconic"
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
$ cgrep "Link to iconic" /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
The file in question is one continuous 25K line and it is hopeless to find what you are looking for using regular grep.
Notice the two different ways you can call cgrep that parallels grep method.
There is a "niftier" way of creating the function where "$2" is only passed when set which would save 4 lines of code. I don't have it handy though. Something like ${parm2} $parm2. If I find it I'll revise the function and this answer.
With gawk , you can use match function:
x="hey there how are you"
echo "$x" |awk --re-interval '{match($0,/(.{4})how(.{4})/,a);print a[1],a[2]}'
ere are
If you are ok with perl, more flexible solution : Following will print three characters before the pattern followed by actual pattern and then 5 character after the pattern.
echo hey there how are you |perl -lne 'print "$1$2$3" if /(.{3})(there)(.{5})/'
ey there how
This can also be applied to words instead of just characters.Following will print one word before the actual matching string.
echo hey there how are you |perl -lne 'print $1 if /(\w+) there/'
hey
Following will print one word after the pattern:
echo hey there how are you |perl -lne 'print $2 if /(\w+) there (\w+)/'
how
Following will print one word before the pattern , then the actual word and then one word after the pattern:
echo hey there how are you |perl -lne 'print "$1$2$3" if /(\w+)( there )(\w+)/'
hey there how
If using ripgreg this is how you would do it:
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
You can use regexp grep for finding + second grep for highlight
echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}' | grep string
23_string_and
With ugrep you can specify -ABC context with option -o (--only-matching) to show the match with extra characters of context before and/or after the match, fitting the match plus the context within the specified -ABC width. For example:
ugrep -o -C30 pattern testfile.txt
gives:
1: ... long line with an example pattern to match. The line could...
2: ...nother example line with a pattern.
The same on a terminal with color highlighting gives:
Multiple matches on a line are either shown with [+nnn more]:
or with option -k (--column-number) to show each individually with context and the column number:
The context width is the number of Unicode characters displayed (UTF-8/16/32), not just ASCII.
I personally do something similar to the posted answers.. but since the dot key, like any keyboard key, can be tapped or held down.. and I often don't need a lot of context(if I needed more I might do the lines like grep -C but often like you I don't want lines before and after), so I find it much quicker for entering the command, to just tap the dot key for how many dots / how many characters, if it's a few then tapping the key, or hold it down for more.
e.g. echo zzzabczzzz | grep -o '.abc..'
Will have the abc pattern with one dot before and two after. ( in regex language, Dot matches any character). Others used dot too but with curly braces to specify repetition.
If I wanted to be strict re between (0 or x) characters and exactly y characters, then i'd use the curlies.. and -P, as others have done.
There is a setting re whether dot matches new line but you can look into that if it's a concern/interest.

Capturing Groups From a Grep RegEx

I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:
files="*.jpg"
for f in $files
do
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
echo $name
done
So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.
I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I would like to attack this from the *nix purist angle.
Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?
If you're using Bash, you don't even have to use grep:
files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files # unquoted in order to allow the glob to expand
do
if [[ $f =~ $regex ]]
then
name="${BASH_REMATCH[1]}"
echo "${name}.jpg" # concatenate strings
name="${name}.jpg" # same thing stored in a variable
else
echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
fi
done
It's better to put the regex in a variable. Some patterns won't work if included literally.
This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.
You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:
123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz
To eliminate the second and fourth examples, make your regex like this:
^[0-9]+_([a-z]+)_[0-9a-z]*
which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:
^[0-9]+_([a-z]+)_[0-9a-z]*$
then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.
If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):
name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg
The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).
The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.
In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.
The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.
This isn't really possible with pure grep, at least not generally.
But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).
Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:
echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'
The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.
(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).
Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)
I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep, which doesn't seem to have been mentioned yet. Try changing the lines:
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
to the following:
name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')
to get only the contents of the capturing group 1.
The pcregrep tool utilizes all of the same syntax you've already used with grep, but implements the functionality that you need.
The parameter -o works just like the grep version if it is bare, but it also accepts a numeric parameter in pcregrep, which indicates which capturing group you want to show.
With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.
Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.
Not possible in just grep I believe
for sed:
name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`
I'll take a stab at the bonus though:
echo "$name.jpg"
This is a solution that uses gawk. It's something I find I need to use often so I created a function for it
function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }
to use just do
$ echo 'hello world' | regex1 'hello\s(.*)'
world
str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
week="${BASH_REMATCH[1]}"
day="${BASH_REMATCH[2]}"
hour="${BASH_REMATCH[3]}"
echo $week --- $day ---- $hour
fi
output:
1 --- 2 ---- 1
A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:
f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}
Then name will have the value abc.
See Apple developer docs, search forward for 'Parameter Expansion'.
I prefer the one line python or perl command, both often included in major linux disdribution
echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' | python -c $'
import re
import sys
for i in sys.stdin:
g=re.match(r\'.*href="(.*)"\',i);
if g is not None:
print g.group(1)
'
and to handle files:
ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
i=i.strip()
f=open(i,"r")
for j in f:
g=re.match(r\'.*href="(.*)"\',j);
if g is not None:
print g.group(1)
f.close()
'
The follow example shows how to extract the 3 character sequence from a filename using a regex capture group:
for f in 123_abc_123.jpg 123_xyz_432.jpg
do
echo "f: " $f
name=$( perl -ne 'if (/[0-9]+_([a-z]+)_[0-9a-z]*/) { print $1 . "\n" }' <<< $f )
echo "name: " $name
done
Outputs:
f: 123_abc_123.jpg
name: abc
f: 123_xyz_432.jpg
name: xyz
So the if-regex conditional in perl will filter out all non-matching lines at the same time, for those lines that do match, it will apply the capture group(s) which you can access with $1, $2, ... respectively,
if you have bash, you can use extended globbing
shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done
or
ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done

Resources