Shell Scripting: "grep -w" not to select words separated by "-" - shell

I have 3 words.
abcd-1234
abcd-abcd
abcd
Is it possible to select/print the 3rd word "abcd" with grep -w or a similar command?

This should work:
grep '[a-zA-Z]'
more specific, alphabet from begining:
echo "abcd-1234" | grep -o '^[a-zA-Z]*'
it should be good for given examples,
try this, regarding from your comment
data.txt
abcd-1234
abcd-4678
abcd
abcd-as334s
abcd-abcd
cat data.txt | grep -ow '^[a-zA-Z]*' | sort -u

And why do you want to achieve this using -w if you can simply achieve this by -v (A.K.A. --invert-match):
grep -v "-" data.txt
Output:
abcd
Ok, -w only gets entire words, but a hyphen does not always split a word. If you don't like the hyphen, best thing to say is that you don't like the hyphen (hence -v "-").

Related

How to grep only matching string from this result?

I am just simply trying to grab the commit ID, but not quite sure what I'm missing:
➜ ~ curl https://github.com/microsoft/vscode/releases -s | grep -oE 'microsoft/vscode/commit/(.*?)/hovercard'
microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard
The only thing I need back from this is ccbaa2d27e38e5afa3e5c21c1c7bef4657064247.
This works just fine on regex101.com and in ruby/python. What am I missing?
If supported, you can use grep -oP
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | grep -oP "microsoft/vscode/commit/\K.*?(?=/hovercard)"
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
Another option is to use sed with a capture group
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | sed -E 's/microsoft\/vscode\/commit\/([^\/]+)\/hovercard/\1/'
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
The point is that grep does not support extracting capturing group submatches. If you install pcregrep you could do that with
curl https://github.com/microsoft/vscode/releases -s | \
pcregrep -o1 'microsoft/vscode/commit/(.*?)/hovercard' | head -1
The | head -1 part is to fetch the first occurrence only.
I would suggest using awk here:
awk 'match($0,/microsoft\/vscode\/commit\/[^\/]*\/hovercard/){print substr($0,RSTART+24,RLENGTH-34);exit}'
The regex will match a line containing
microsoft\/vscode\/commit\/ - microsoft/vscode/commit/ fixed string
[^\/]* - zero or more chars other than /
\/hovercard - a /hovercard string.
The substr($0,RSTART+24,RLENGTH-34) will print the part of the line starting at the RSTART+24 (24 is the length of microsoft/vscode/commit/) index and the RLENGTH is the length of microsoft/vscode/commit/ + the length of the /hovercard.
The exit command will fetch you the first occurrence. Remove it if you need all occurrences.
You can use sed:
curl -s https://github.com/microsoft/vscode/releases |
sed -En 's=.*microsoft/vscode/commit/([^/]+)/hovercard.*=\1=p' |
head -n 1
head -n 1 is to print the first match (there are 10)grep -o will print (only) everything that matches, including microsoft/ etc.
Your task can not be achieved with Mac's grep. grep -o prints all matching text (compared to default behaviour of printing matching lines), including microsoft/ etc. A grep which implemented perl regex (like GNU grep on Linux) could make use of look ahead/behind (grep -Po '(?<=microsoft/vscode/commit/)[^/]+(?=/hovercard)'). But it's just not available on Mac's grep.
On MacOS you don't have gnu utilities available by default. You can just pipe your output to a simple awk like this:
curl https://github.com/microsoft/vscode/releases -s |
grep -oE 'microsoft/vscode/commit/[^/]+/hovercard' |
awk -F/ '{print $(NF-1)}'
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
3a6960b964327f0e3882ce18fcebd07ed191b316
f4af3cbf5a99787542e2a30fe1fd37cd644cc31f
b3318bc0524af3d74034b8bb8a64df0ccf35549a
6cba118ac49a1b88332f312a8f67186f7f3c1643
c13f1abb110fc756f9b3a6f16670df9cd9d4cf63
ee8c7def80afc00dd6e593ef12f37756d8f504ea
7f6ab5485bbc008386c4386d08766667e155244e
83bd43bc519d15e50c4272c6cf5c1479df196a4d
e7d7e9a9348e6a8cc8c03f877d39cb72e5dfb1ff

How to grep and match the first occurrence of a line?

Given the following content:
title="Bar=1; Fizz=2; Foo_Bar=3;"
I'd like to match the first occurrence of Bar value which is 1. Also I don't want to rely on soundings of the word (like double quote in the front), because the pattern could be in the middle of the line.
Here is my attempt:
$ grep -o -m1 'Bar=[ ./0-9a-zA-Z_-]\+' input.txt
Bar=1
Bar=3
I've used -m/--max-count which suppose to stop reading the file after num matches, but it didn't work. Why this option doesn't work as expected?
I could mix with head -n1, but I wondering if it is possible to achieve that with grep?
grep is line-oriented, so it apparently counts matches in terms of lines when using -m[1]
- even if multiple matches are found on the line (and are output individually with -o).
While I wouldn't know to solve the problem with grep alone (except with GNU grep's -P option - see anubhava's helpful answer), awk can do it (in a portable manner):
$ awk -F'Bar=|;' '{ print $2 }' <<<"Bar=1; Fizz=2; Foo_Bar=3;"
1
Use print "Bar=" $2, if the field name should be included.
Also note that the <<< method of providing input via stdin (a so-called here-string) is specific to Bash, Ksh, Zsh; if POSIX compliance is a must, use echo "..." | grep ... instead.
[1] Options -m and -o are not part of the grep POSIX spec., but both GNU and BSD/OSX grep support them and have chosen to implement the line-based logic.
This is consistent with the standard -c option, which counts "selected lines", i.e., the number of matching lines:
grep -o -c 'Bar=[ ./0-9a-zA-Z_-]\+' <<<"Bar=1; Fizz=2; Foo_Bar=3;" yields 1.
Using perl based regex flavor in gnu grep you can use:
grep -oP '^(.(?!Bar=\d+))*Bar=\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
Bar=1
(.(?!Bar=\d+))* will match 0 or more of any characters that don't have Bar=\d+ pattern thus making sure we match first Bar=\d+
If intent is to just print the value after = then use:
grep -oP '^(.(?!Bar=\d+))*Bar=\K\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
1
You can use grep -P (assuming you are on gnu grep) and positive look ahead ((?=.*Bar)) to achieve that in grep:
echo "Bar=1; Fizz=2; Foo_Bar=3;" | grep -oP -m 1 'Bar=[ ./0-9a-zA-Z_-]+(?=.*Bar)'
First use a grep to make the line start with Bar, and then get the Bar at the start of the line:
grep -o "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"
When you have a large file, you can optimize with
grep -o -m1 "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"

command to count occurrences of word in entire file

I am trying to count the occurrences of a word in a file.
If word occurs multiple times in a line, I will count is a 1.
Following command will give me the output but will fail if line has multiple occurrences of word
grep -c "word" filename.txt
Is there any one liner?
You can use grep -o to show the exact matches and then count them:
grep -o "word" filename.txt | wc -l
Test
$ cat a
hello hello how are you
hello i am fine
but
this is another hello
$ grep -c "hello" a # Normal `grep -c` fails
3
$ grep -o "hello" a
hello
hello
hello
hello
$ grep -o "hello" a | wc -l # grep -o solves it!
4
Set RS in awk for a shorter one.
awk 'END{print NR-1}' RS="word" file
GNU awk allows it to be done in single command with use of multiple piped commands:
awk -v w="word" '$1==w{n++} END{print n}' RS=' |\n' file
cat file | cut -d ' ' | grep -c word
This assumes that all words in the file have spaces between the words. If there's punctuation concatenating the word to itself, or otherwise no spaces on a single line between the word and itself, they'll count as one.
grep word filename.txt | wc -l
grep prints the lines that match, then wc -l prints the number of lines matched

How to grep, excluding some patterns?

I'd like find lines in files with an occurrence of some pattern and an absence of some other pattern. For example, I need find all files/lines including loom except ones with gloom. So, I can find loom with command:
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
Now, I want to search loom excluding gloom. However, both of following commands failed:
grep -v 'gloom' -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
grep -n 'loom' -v 'gloom' ~/projects/**/trunk/src/**/*.#(h|cpp)
What should I do to achieve my goal?
EDIT 1: I mean that loom and gloom are the character sequences (not necessarily the words). So, I need, for example, bloomberg in the command output and don't need ungloomy.
EDIT 2: There is sample of my expectations.
Both of following lines are in command output:
I faced the icons that loomed through the veil of incense.
Arty is slooming in a gloomy day.
Both of following lines aren't in command output:
It’s gloomyin’ ower terrible — great muckle doolders o’ cloods.
In the south west round of the heigh pyntit hall
How about just chaining the greps?
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp) | grep -v 'gloom'
Another solution without chaining grep:
egrep '(^|[^g])loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
Between brackets, you exclude the character g before any occurrence of loom, unless loom is the first chars of the line.
A bit old, but oh well...
The most up-voted solution from #houbysoft will not work as that will exclude any line with "gloom" in it, even if it has "loom". According to OP's expectations, we need to include lines with "loom", even if they also have "gloom" in them. This line needs to be in the output "Arty is slooming in a gloomy day.", but this will be excluded by a chained grep like
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp) | grep -v 'gloom'
Instead, the egrep regex example of Bentoy13 works better
egrep '(^|[^g])loom' ~/projects/**/trunk/src/**/*.#(h|cpp)
as it will include any line with "loom" in it, regardless of whether or not it has "gloom". On the other hand, if it only has gloom, it will not include it, which is precisely the behaviour OP wants.
Just use awk, it's much simpler than grep in letting you clearly express compound conditions.
If you want to skip lines that contains both loom and gloom:
awk '/loom/ && !/gloom/{ print FILENAME, FNR, $0 }' ~/projects/**/trunk/src/**/*.#(h|cpp)
or if you want to print them:
awk '/(^|[^g])loom/{ print FILENAME, FNR, $0 }' ~/projects/**/trunk/src/**/*.#(h|cpp)
and if the reality is you just want lines where loom appears as a word by itself:
awk '/\<loom\>/{ print FILENAME, FNR, $0 }' ~/projects/**/trunk/src/**/*.#(h|cpp)
-v is the "inverted match" flag, so piping is a very good way:
grep "loom" ~/projects/**/trunk/src/**/*.#(h|cpp)| grep -v "gloom"
Simply use! grep -v multiple times.
#Content of file
[root#server]# cat file
1
2
3
4
5
#Exclude the line or match
[root#server]# cat file |grep -v 3
1
2
4
5
#Exclude the line or match multiple
[root#server]# cat file |grep -v "3\|5"
1
2
4
/*You might be looking something like this?
grep -vn "gloom" `grep -l "loom" ~/projects/**/trunk/src/**/*.#(h|cpp)`
The BACKQUOTES are used like brackets for commands, so in this case with -l enabled,
the code in the BACKQUOTES will return you the file names, then with -vn to do what you wanted: have filenames, linenumbers, and also the actual lines.
UPDATE Or with xargs
grep -l "loom" ~/projects/**/trunk/src/**/*.#(h|cpp) | xargs grep -vn "gloom"
Hope that helps.*/
Please ignore what I've written above, it's rubbish.
grep -n "loom" `grep -l "loom" tt4.txt` | grep -v "gloom"
#this part gets the filenames with "loom"
#this part gets the lines with "loom"
#this part gets the linenumber,
#filename and actual line
You can use grep -P (perl regex) supported negative lookbehind:
grep -P '(?<!g)loom\b' ~/projects/**/trunk/src/**/*.#(h|cpp)
I added \b for word boundaries.
grep -n 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp) | grep -v 'gloom'
Question: search for 'loom' excluding 'gloom'.
Answer:
grep -w 'loom' ~/projects/**/trunk/src/**/*.#(h|cpp)

How to remove the last character from a bash grep output

COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2`
outputs something like this
"Abc Inc";
What I want to do is I want to remove the trailing ";" as well. How can i do that? I am a beginner to bash. Any thoughts or suggestions would be helpful.
This will remove the last character contained in your COMPANY_NAME var regardless if it is or not a semicolon:
echo "$COMPANY_NAME" | rev | cut -c 2- | rev
I'd use sed 's/;$//'. eg:
COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2 | sed 's/;$//'`
foo="hello world"
echo ${foo%?}
hello worl
I'd use head --bytes -1, or head -c-1 for short.
COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2 | head --bytes -1`
head outputs only the beginning of a stream or file. Typically it counts lines, but it can be made to count characters/bytes instead. head --bytes 10 will output the first ten characters, but head --bytes -10 will output everything except the last ten.
NB: you may have issues if the final character is multi-byte, but a semi-colon isn't
I'd recommend this solution over sed or cut because
It's exactly what head was designed to do, thus less command-line options and an easier-to-read command
It saves you having to think about regular expressions, which are cool/powerful but often overkill
It saves your machine having to think about regular expressions, so will be imperceptibly faster
I believe the cleanest way to strip a single character from a string with bash is:
echo ${COMPANY_NAME:: -1}
but I haven't been able to embed the grep piece within the curly braces, so your particular task becomes a two-liner:
COMPANY_NAME=$(grep "company_name" file.txt); COMPANY_NAME=${COMPANY_NAME:: -1}
This will strip any character, semicolon or not, but can get rid of the semicolon specifically, too.
To remove ALL semicolons, wherever they may fall:
echo ${COMPANY_NAME/;/}
To remove only a semicolon at the end:
echo ${COMPANY_NAME%;}
Or, to remove multiple semicolons from the end:
echo ${COMPANY_NAME%%;}
For great detail and more on this approach, The Linux Documentation Project covers a lot of ground at http://tldp.org/LDP/abs/html/string-manipulation.html
Using sed, if you don't know what the last character actually is:
$ grep company_name file.txt | cut -d '=' -f2 | sed 's/.$//'
"Abc Inc"
Don't abuse cats. Did you know that grep can read files, too?
The canonical approach would be this:
grep "company_name" file.txt | cut -d '=' -f 2 | sed -e 's/;$//'
the smarter approach would use a single perl or awk statement, which can do filter and different transformations at once. For example something like this:
COMPANY_NAME=$( perl -ne '/company_name=(.*);/ && print $1' file.txt )
don't have to chain so many tools. Just one awk command does the job
COMPANY_NAME=$(awk -F"=" '/company_name/{gsub(/;$/,"",$2) ;print $2}' file.txt)
In Bash using only one external utility:
IFS='= ' read -r discard COMPANY_NAME <<< $(grep "company_name" file.txt)
COMPANY_NAME=${COMPANY_NAME/%?}
Assuming the quotation marks are actually part of the output, couldn't you just use the -o switch to return everything between the quote marks?
COMPANY_NAME="\"ABC Inc\";" | echo $COMPANY_NAME | grep -o "\"*.*\""
you can strip the beginnings and ends of a string by N characters using this bash construct, as someone said already
$ fred=abcdefg.rpm
$ echo ${fred:1:-4}
bcdefg
HOWEVER, this is not supported in older versions of bash.. as I discovered just now writing a script for a Red hat EL6 install process. This is the sole reason for posting here.
A hacky way to achieve this is to use sed with extended regex like this:
$ fred=abcdefg.rpm
$ echo $fred | sed -re 's/^.(.*)....$/\1/g'
bcdefg
Some refinements to answer above. To remove more than one char you add multiple question marks. For example, to remove last two chars from variable $SRC_IP_MSG, you can use:
SRC_IP_MSG=${SRC_IP_MSG%??}
cat file.txt | grep "company_name" | cut -d '=' -f 2 | cut -d ';' -f 1
I am not finding that sed 's/;$//' works. It doesn't trim anything, though I'm wondering whether it's because the character I'm trying to trim off happens to be a "$". What does work for me is sed 's/.\{1\}$//'.

Resources