How do you display all the words that start with 1 uppercase letter using grep in bash? - bash

I tried something like this
sed 's/^[ \t]*//' file.txt | grep "^[A-Z].* "
but it will show only the lines that start with words starting with an uppercase.
file.txt content:
Something1 something2
word1 Word2
this is lower
The output will be Something1 something2 but I will like for it to also show the second line because also has a word that starts with an uppercase letter.

With GNU grep grep -P "[A-Z]+\w*" file.txt will work. Or, as #Shawn said in the comment below, grep -P '\b[A-Z]' file.txt will also work. If you only want the words, and not the entire line, grep -Po "[A-Z]+\w*" file.txt will give you the individual words.

With GNU grep, you can use
grep '\<[[:upper:]]' file
grep '\b[[:upper:]]' file
NOTE:
\< - a leading word boundary (\b is a word boundary)
[[:upper:]] - any uppercase letter.
See the online demo:
#!/bin/bash
s='Something1 something2
word1 Word2
this is lower
папа Петя'
grep '\<[[:upper:]]' <<< "$s"
Output:
Something1 something2
word1 Word2
папа Петя

How do you display all the words
That's simple:
grep -wo '[A-Z]\w*'

Related

How to delete a line of the text file from the output of checklist

I have a text file:
$100 Birthday
$500 Laptop
$50 Phone
I created a --checklist from the text file
[ ] $100 Birthday
[*] $500 Laptop
[*] $50 Phone
the output is $100 $50
How can I delete the line of $100 and $50 in the text file, please?
The expected output of text file:
$100 Birthday
Thank you!
with grep and cut
grep -xf <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
with grep and sed
grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
explanation
use grep to select lines from text file
$ grep Birthday file1.txt
100 Birthday
cut will split line into columns. -f 2 will print only 2nd column but -f 2- will print everything from 2nd column. as delimiter -d whitespace ' ' is used here (some character must escaped with \)
and we can use pipe | as input (instead file)
$ echo one two three | cut -d \ -f 2-
two three
$ grep Birthday file1.txt | cut -d \ -f 2-
Birthday ^
|
(note the two whitespaces) --------+
assuming we have a text file temp.txt
$ cat temp.txt
Birthday
Laptop
Phone
grep can also read list of search patterns from another file as input instead
$ grep -f temp.txt file1.txt
100 Birthday
500 Laptop
50 Phone
or we print the file content with cat and redirect output with <
$ grep -f <(cat temp.txt) file1.txt
100 Birthday
500 Laptop
50 Phone
Now let's generate temp.txt from checklist. You only want grep lines containing [ ] and cut starting from 3rd column (again some characters have special meaning and must therefore escaped \[)
$ grep '\[ ]' file2.txt
[ ] 100 Birthday
$ grep '\[ ]' file2.txt | cut -d\ -f3-
100 Birthday
You don't need temp.txt and can therefore redirect list straight to grep -f what is called process substitution <(...)
$ grep -f <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
100 Birthday
grep read all lines from temp.txt as PATTERN and some characters have special meaning for regex. ^ stands for begin of line and $ for end of line. To be nitpicky correct search pattern should therefore be '^100 Birthday$' so it won't match "1100 Birthday 2".
You might have noticed that I dropped the $ currency in your input files for reason. You can keep it, but tell grep to take all input literally with -F and(/or?) -x flag which will search for whole line "100 Birthday" (no regex for line start/ending needed)
sed [OPTION] 's/regexp/replacement/command' [file]
sed is more common when it comes to text editing. instead grep | cut we can do it from one single command:
grep '\[ ]' | cut -f3- and sed 's/\[ ] *//'
are basically targeting the same lines and delete [ ] from it.
There are however some extra flags required, because sed is text editor and will stream the whole file by default. to emulate grep's behavior we use
-n option to suppress the input
p command to print only changes
and for regexp
\[ ] (text to replace)
' *' = ' ' (whitespace) + * (star)
meaning: repeated previous character 0 or more times, in particulary all trailing whitespaces
(replacement is empty because we want just delete)
so working similar sed command will look like this
sed -n 's/\[ ] *//p' file2.txt
And that's in my opinion all it takes for a checklist. You have however two redundant files and want match your cloned checklist against original file, so let me show you more complicated things.
Instead of deleting the checkbox let's output captured groups. This pseudo code will explain it better than me. \1 is for first capture group ( ) and so on (kinda internal variables)
$ sed 's/(aaa)b(ccc)dd/\1/'
aaa
$ sed 's/(aaa)b(ccc)dd/\2/'
ccc
$ sed 's/(aaa)b(ccc)dd/\1 \2/'
aaa ccc
$ sed 's/(aaa)b(ccc)dd/lets \1 replace \2 this/'
lets aaa replace ccc this
so in this example sed 's/\[ ] (.*)/\1/' we use for regexp
\[ ] (text to replace)
' ' (trailing whitespace)
and inside the first capture group ( ) the desired "100 Birthday"
.* = . (dot) + * (star)
meaning: repeated previous character 0 or more times (in particulary a dot here)
but the dot . itself is regex for ANY char now (special meaning)
so the capture group is all the rest of line
and for replacement we use (only)
\1 first capture group
$ sed -n 's/\[ ] (.*)/\1/p' file2.txt
100 Birthday
But there is more :)
Instead of matching only ' ' whitespace there exist another regex with special meaning (extended regex)
\s will match whitespace and tab
+ repeated previous character 1 or more times (note the difference to * 0 or more times)
\s+ will match a series of spaces
and to make it work we need one more flag
-r use extended regular expressions
so with this command you can extract all search patterns from your cloned checklist...
$ sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt
100 Birthday
...and finally let it run against your original file (without the need of temp.txt)
$ grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
100 Birthday

Sed to remove substring | Can I make a flexible pattern to remove numbers before tab?

I wanted to ask some advice on an issue that I'm having in removing a substring from a string. I have a file with many lines like the following:
DOG; CSQ| 0.1234 | abcd | \t CAT
where \t represents a literal tab.
My aim is to remove a substring by using sed 's/CSQ.*|//g' so that I can get the following output:
DOG; CAT
However I face a problem where all the rows aren't formatted the same. For example, I also get lines such as:
DOG; CSQ| 0.1234 | abcd | 0 \t CAT
DOG; CSQ| 0.1234 | abcd | 0.9187 \t CAT
My code fails at this point because instead of getting DOG; CAT for all lines, I get:
DOG; CAT
DOG; 0 CAT
DOG; 0.9187 CAT
I've searched for possible solutions but I'm having difficulty (I'm also quite new to bash). I imagine there's something that I can do with sed that will handles all cases but I'm not sure.
You can find and replace all text from CSQ till the last | and all chars after that till the tab including it using
sed 's/CSQ.*|.*\t//' file > newfile
See the online demo.
The CSQ.*|.*\t is a POSIX BRE pattern that matches
CSQ - a CSQ string
.* - any text
| - a pipe char
.* - any text
\t - TAB char.
If the \t are two-char combinations double the backslash before t:
sed 's/CSQ.*|.*\\t//' file > newfile
See this online demo.
So optionally match it.
sed 's/CSQ.*|\( [0-9.]*\)\?//g'
You can learn regex online with fun with regex crosswords.
awk makes this pretty easy.
$: awk '/CSQ.*\t/{print $1" "$NF}' file
DOG; CAT
DOG; CAT
DOG; CAT
Note that the file has to have actual tabs, not \t sequences. awk will read the \t correctly.
If there are no other formatted lines in the file that you want, then maybe just
$: awk '{print $1" "$NF}' file
DOG; CAT
DOG; CAT
DOG; CAT

Match multiple patterns with grep and print only the matched patterns

I have a file that looks like
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
I want to extract out field1 and field2 from each line of the file in bash. I want field1 and field2 to appear in the same line for each line. So the output should look like-
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
I wrote a grep expression like -
grep -E '"field1":"[a-z]*".*"field2":"[a-z]*"' -o
But because of .* in between, it produces all the all text between those two expressions. I also tried
grep -E '"field1":"[a-z]*"|"field2":"[a-z]*"' -o
But this outputs all field1s in separate line and then all field2s in separate line.
How do I get the expected output?
You can use grep with awk to format the result:
grep -oE '"(field1|field2)":"[^"]*"' file | awk 'NR%2{p=$0; next} {print p, $0}'
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
use sed:
echo abcdef | sed 's/\(.\).*\(.\)/\1\2/'
# yields: af
for your situation:
sed 's/.*\("field1":"[a-z]*"\).*\("field2":"[a-z]*"\).*/\1 \2/' yourfile
if some lines don't match at all, then do your grep first, e.g.,
grep -Eo '"field1":"[a-z]*".*"field2":"[a-z]*"' yourfile |
sed 's/.*\("field1":"[a-z]*"\).*\("field2":"[a-z]*"\).*/\1 \2/'

how can I remove a pattern from the begining of lines between two words using sed or awk

I want to remove a pattern in the begining of each line of a paragraph that contains word1 in the first line and end with word2 for example if I have the following file and I want to subsitute --MW by nothing
--MW Word1 this is paragraph number 1
--MW aaa
--MW bbb
--MW ccc
--MW word2
I want to get as result :
Word1 this is paragraph number 1
aaa
bbb
ccc
word2
Thanks in advance
Using sed
sed '/Word1/,/word2/s/--MW //' file
Using awk
awk '/Word1/,/word2/{sub(/--MW /,a)}1' file
Both act on lines between and including the matched phrases and the do a substitution on each line. They print all lines.
If you have your text in myfile.txt you could try:
awk 'BEGIN{f=0}$2=="Word1"{f=1}{if (f==1) {$1="";print $0}else{print $0}}$2=="word2"{f=0}' myfile.txt
If you are sure the pattern is going to be in the beginning of the line, then this command might help:
sed 's/^--MW //' file.txt
Please test and let us know if this worked fine with you.
Hopefully, this will do it for you:
$ echo "--MW Word1 this is paragraph number 1" | cut -d ' ' -f 2-
Where you pass the text to cut command and remove the first token, using space as token separator, while keeping the rest of tokens,i.e., from second to the end.

How to retrieve digits including the separator "."

I am using grep to get a string like this: ANS_LENGTH=266.50 then I use sed to only get the digits: 266.50
This is my full command: grep --text 'ANS_LENGTH=' log.txt | sed -e 's/[^[[:digit:]]]*//g'
The result is : 26650
How can this line be changed so the result still shows the separator: 266.50
You don't need grep if you are going to use sed. Just use sed' // to match the lines you need to print.
sed -n '/ANS_LENGTH/s/[^=]*=\(.*\)/\1/p' log.txt
-n will suppress printing of lines that do not match /ANS_LENGTH/
Using captured group we print the value next to = sign.
p flag at the end allows to print the lines that matches our //.
If your grep happens to support -P option then you can do:
grep -oP '(?<=ANS_LENGTH=).*' log.txt
(?<=...) is a look-behind construct that allows us to match the lines you need. This requires the -P option
-o allows us to print only the value part.
You need to match a literal dot as well as the digits.
Try sed -e 's/[^[[:digit:]\.]]*//g'
The dot will match any single character. Escaping it with the backslash will match only a literal dot.
Here is some awk example:
cat file:
some data ANS_LENGTH=266.50 other=22
not mye data=43
gnu awk (due to RS)
awk '/ANS_LENGTH/ {f=NR} f&&NR-1==f' RS="[ =]" file
266.50
awk '/ANS_LENGTH/ {getline;print}' RS="[ =]" file
266.50
Plain awk
awk -F"[ =]" '{for(i=1;i<=NF;i++) if ($i=="ANS_LENGTH") print $(i+1)}' file
266.50
awk '{for(i=1;i<=NF;i++) if ($i~"ANS_LENGTH") {split($i,a,"=");print a[2]}}' file
266.50

Resources