Search first occurrence and print until next delimiter, but match whole word only - bash

I have a file with multiple lines of text similar to:
foo
1
2
3
bar
fool
1
2
3
bar
food
1
2
3
bar
So far the following gives me a closer answer:
sed -n '/foo/,/bar/ p' file.txt | sed -e '$d'
...but it fails by introducing duplicates if it encounters words like "food" or "fool". I want to make the code above do a whole word match only (i.e. grep -w), but inserting the \b switch doesn't seem to work:
sed -n '/foo/\b,/bar/ p' file.txt | sed -e '$d'
I would like to print anything after "foo" (including the first foo) up until "bar", but matching only "foo", and not "foo1".

Use the Regex tokens ^ and $ to indicate the start and end of a line respectively:
sed -n '/^foo$/,/^bar$/ p' file.txt

sed -n '/\<foo\>/,/\<bar\>/ p' file.txt
Or may be this if foo and bar have to be first word of any line.
sed -n '/^\<foo\>/,/^\<bar\>/ p' file

Related

How to delete a line of the text file from the output of checklist

I have a text file:
$100 Birthday
$500 Laptop
$50 Phone
I created a --checklist from the text file
[ ] $100 Birthday
[*] $500 Laptop
[*] $50 Phone
the output is $100 $50
How can I delete the line of $100 and $50 in the text file, please?
The expected output of text file:
$100 Birthday
Thank you!
with grep and cut
grep -xf <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
with grep and sed
grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
explanation
use grep to select lines from text file
$ grep Birthday file1.txt
100 Birthday
cut will split line into columns. -f 2 will print only 2nd column but -f 2- will print everything from 2nd column. as delimiter -d whitespace ' ' is used here (some character must escaped with \)
and we can use pipe | as input (instead file)
$ echo one two three | cut -d \ -f 2-
two three
$ grep Birthday file1.txt | cut -d \ -f 2-
Birthday ^
|
(note the two whitespaces) --------+
assuming we have a text file temp.txt
$ cat temp.txt
Birthday
Laptop
Phone
grep can also read list of search patterns from another file as input instead
$ grep -f temp.txt file1.txt
100 Birthday
500 Laptop
50 Phone
or we print the file content with cat and redirect output with <
$ grep -f <(cat temp.txt) file1.txt
100 Birthday
500 Laptop
50 Phone
Now let's generate temp.txt from checklist. You only want grep lines containing [ ] and cut starting from 3rd column (again some characters have special meaning and must therefore escaped \[)
$ grep '\[ ]' file2.txt
[ ] 100 Birthday
$ grep '\[ ]' file2.txt | cut -d\ -f3-
100 Birthday
You don't need temp.txt and can therefore redirect list straight to grep -f what is called process substitution <(...)
$ grep -f <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
100 Birthday
grep read all lines from temp.txt as PATTERN and some characters have special meaning for regex. ^ stands for begin of line and $ for end of line. To be nitpicky correct search pattern should therefore be '^100 Birthday$' so it won't match "1100 Birthday 2".
You might have noticed that I dropped the $ currency in your input files for reason. You can keep it, but tell grep to take all input literally with -F and(/or?) -x flag which will search for whole line "100 Birthday" (no regex for line start/ending needed)
sed [OPTION] 's/regexp/replacement/command' [file]
sed is more common when it comes to text editing. instead grep | cut we can do it from one single command:
grep '\[ ]' | cut -f3- and sed 's/\[ ] *//'
are basically targeting the same lines and delete [ ] from it.
There are however some extra flags required, because sed is text editor and will stream the whole file by default. to emulate grep's behavior we use
-n option to suppress the input
p command to print only changes
and for regexp
\[ ] (text to replace)
' *' = ' ' (whitespace) + * (star)
meaning: repeated previous character 0 or more times, in particulary all trailing whitespaces
(replacement is empty because we want just delete)
so working similar sed command will look like this
sed -n 's/\[ ] *//p' file2.txt
And that's in my opinion all it takes for a checklist. You have however two redundant files and want match your cloned checklist against original file, so let me show you more complicated things.
Instead of deleting the checkbox let's output captured groups. This pseudo code will explain it better than me. \1 is for first capture group ( ) and so on (kinda internal variables)
$ sed 's/(aaa)b(ccc)dd/\1/'
aaa
$ sed 's/(aaa)b(ccc)dd/\2/'
ccc
$ sed 's/(aaa)b(ccc)dd/\1 \2/'
aaa ccc
$ sed 's/(aaa)b(ccc)dd/lets \1 replace \2 this/'
lets aaa replace ccc this
so in this example sed 's/\[ ] (.*)/\1/' we use for regexp
\[ ] (text to replace)
' ' (trailing whitespace)
and inside the first capture group ( ) the desired "100 Birthday"
.* = . (dot) + * (star)
meaning: repeated previous character 0 or more times (in particulary a dot here)
but the dot . itself is regex for ANY char now (special meaning)
so the capture group is all the rest of line
and for replacement we use (only)
\1 first capture group
$ sed -n 's/\[ ] (.*)/\1/p' file2.txt
100 Birthday
But there is more :)
Instead of matching only ' ' whitespace there exist another regex with special meaning (extended regex)
\s will match whitespace and tab
+ repeated previous character 1 or more times (note the difference to * 0 or more times)
\s+ will match a series of spaces
and to make it work we need one more flag
-r use extended regular expressions
so with this command you can extract all search patterns from your cloned checklist...
$ sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt
100 Birthday
...and finally let it run against your original file (without the need of temp.txt)
$ grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
100 Birthday

Sed to remove substring | Can I make a flexible pattern to remove numbers before tab?

I wanted to ask some advice on an issue that I'm having in removing a substring from a string. I have a file with many lines like the following:
DOG; CSQ| 0.1234 | abcd | \t CAT
where \t represents a literal tab.
My aim is to remove a substring by using sed 's/CSQ.*|//g' so that I can get the following output:
DOG; CAT
However I face a problem where all the rows aren't formatted the same. For example, I also get lines such as:
DOG; CSQ| 0.1234 | abcd | 0 \t CAT
DOG; CSQ| 0.1234 | abcd | 0.9187 \t CAT
My code fails at this point because instead of getting DOG; CAT for all lines, I get:
DOG; CAT
DOG; 0 CAT
DOG; 0.9187 CAT
I've searched for possible solutions but I'm having difficulty (I'm also quite new to bash). I imagine there's something that I can do with sed that will handles all cases but I'm not sure.
You can find and replace all text from CSQ till the last | and all chars after that till the tab including it using
sed 's/CSQ.*|.*\t//' file > newfile
See the online demo.
The CSQ.*|.*\t is a POSIX BRE pattern that matches
CSQ - a CSQ string
.* - any text
| - a pipe char
.* - any text
\t - TAB char.
If the \t are two-char combinations double the backslash before t:
sed 's/CSQ.*|.*\\t//' file > newfile
See this online demo.
So optionally match it.
sed 's/CSQ.*|\( [0-9.]*\)\?//g'
You can learn regex online with fun with regex crosswords.
awk makes this pretty easy.
$: awk '/CSQ.*\t/{print $1" "$NF}' file
DOG; CAT
DOG; CAT
DOG; CAT
Note that the file has to have actual tabs, not \t sequences. awk will read the \t correctly.
If there are no other formatted lines in the file that you want, then maybe just
$: awk '{print $1" "$NF}' file
DOG; CAT
DOG; CAT
DOG; CAT

unix sed substitute nth occurence misfunction?

Let's say I have a string which contains multiple occurences of the letter Z.
For example: aaZbbZccZ.
I want to print parts of that string, each time until the next occurence of Z:
aaZ
aaZbbZ
aaZbbZccZ
So I tried using unix sed for this, with the command sed s/Z.*/Z/i where i is an index that I have running from 1 to the number of Z's in the string. As far as my sed understanding goes: this should delete everything that comes after the i'th Z, But in practice this only works when I have i=1 as in sed s/Z.*/Z/, but not as I increment i, as in sed s/Z.*/Z/2 for example, where it just prints the entire original string. It feels as if there's something I am missing about the functioning of sed, since according to multiple manuals, it should work.
edit: for example, in the string aaZbbZccZ while applying sed s/Z.*/Z/2 I am expecting to have aaZbbZ, as everything after the 2nd occurence of Z get's deleted.
Below sed works closely to what you are looking for, except it removes also the last Z.
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//1g;s/$/Z/'
aaZ
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//2g;s/$/Z/'
aaZbbZ
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//3g;s/$/Z/'
aaZbbZccZ
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//4g;s/$/Z/'
aaZbbZccZddZ
Edit:
Modified according to Aaron suggestion.
Edit2:
If you don't know how many Z there are in the string it's safer to use below command. Otherwise additional Z is added at the end.
-r - enables regular expressions
-e - separates sed operations, the same as ; but easier to read in my opinion.
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//1g' -e 's/([^Z])$/\1Z/'
aaZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//2g' -e 's/([^Z])$/\1Z/'
aaZbbZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//3g' -e 's/([^Z])$/\1Z/'
aaZbbZccZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//4g' -e 's/([^Z])$/\1Z/'
aaZbbZccZddZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//5g' -e 's/([^Z])$/\1Z/'
aaZbbZccZddZ
This should do what you expect (see comments) unless your string can contain line breaks:
# -n will prevent default printing
echo 'aaZbbZccZ' | sed -n '{
# Add a line break after each 'Z'
s/Z/Z\
/g
# Print it and consume it in the next sed command
p
}' | sed -n '{
# Add only the first line to the hold buffer (you can remove it if you don't mind to see first blank line)
1 {
h
}
# As for the rest of the lines
2,$ {
# Replace the hold buffer with the pattern space
x
# Remove line breaks
s/\n//
# Print the result
p
# Get the hold buffer again (matched line)
x
# And append it with new line to the hold buffer
H
}'
The idea is to break the string into multiples lines (each is terminated with Z), that will be processed one by one on the second sed command.
On the second sed we use the Hold Buffer to remember previous lines, print the aggregated result, append new lines and each time remove the line breaks we previously added.
And the output is
aaZ
aaZbbZ
aaZbbZccZ
This might work for you (GNU sed):
sed -n 's/Z/&\n/g;:a;/\n/P;s/\n\(.*Z\)/\1/;ta' file
Use sed's grep-like option -n to explicitly print content. Append a newline after each Z. If there were no substitutions then there is nothing to be done. Print upto the first newline, remove the first newline if the following characters contain a Z and repeat.

Replace the last six spaces with comma

How would I replace the last six spaces with comma in a text file from each line with bash?
I have:
$cat myfile
foo bar foo 6 1 3 23 1 20
foo bar 6 1 2 18 1 15
foo 5 5 0 15 1 21
What I want is:
$cat myfile
foo bar foo,6,1,3,23,1,20
foo bar,6,1,2,18,1,15
foo,5,5,0,15,1,21
Any help is appreciated! Thanks!
It looks like the rule could be to substitute any space before a digit for a comma:
sed 's/ \([0-9]\)/,\1/g' file
Alternatively, following your specification (replace the last six spaces), you could go for something like this:
awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NF-6?FS:(i<NF?",":RS))}' file
This loops through the field in the input, printing each one followed by either a space (FS), a comma or a newline (RS), depending on how close it is to the end of the line.
More complete sed with added rev (for reverse) might be
rev myfile | sed 's/ /,/; s/ /,/; s/ /,/; s/ /,/; s/ /,/; s/ /,/' | rev
sed part for first occurences of course can be simplified if needed!
This might work for you (GNU sed):
sed -r ':a;s/(.*) /\1,/;x;s/^/x/;/^x{6}/{z;x;b};x;ba' file
This uses greed to find the last space on a line and then keeps track of the number of spaces replaced by keeping a counter in the hold space.

using sed how to put space after numbers in a big string

This question was asked in an interview. I could not answer! So getting some help here to understand the logic. i.e. how to put space between a number string and character string.
Given the string "1abc2abcd3efghi10z11jkl100pqrs" what command you use to get following result -
"1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs"
Thanks in advance.
Here is another -- yet simple -- way to think about it:
echo "1abc2abcd3efghi10z11jkl100pqrs" | \
sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; s/([a-zA-Z])([0-9])/\1 \2/g'
add a whitespace between a digit-letter string & letter-digit string
() is to capture the group and \1 and \2 is to return the first and second captured group
With GNU sed:
$ echo "1abc2abcd3efghi10z11jkl100pqrs" | sed -e 's/[0-9]\+/ & /g' -e 's/^ \| $//'
1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs
With awk:
$ echo "1abc2abcd3efghi10z11jkl100pqrs" | awk '{gsub(/[0-9]+/," & ",$0); $1=$1}1'
1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs
gsub with substitute all numbers with space before and after it.
$1=$1 will re-compute entire line and add OFS (by default single
space)
I would have chosen sed over awk:
echo "1abc2abcd3efghi10z11jkl100pqrs" | sed 's/[0-9]\+/ & /g; s/^[ ]//; s/[ ]$//'
It surrounds each run of digits with spaces and afterwards removes the (possibly) leading and trailing ones.
It yields:
1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs
echo 1abc2abcd3efghi10z11jkl100pqrs | \
sed -r -e 's/([[:digit:]]+)/ \1 /g' -e 's/^ *//g' -e 's/ *$//g'
Take the expression -e 's/([[:digit:]]+)/ \1 /g' first.
The parentheses around [[:digit:]]+ 'capture' each sequence of one or more digits. Since it's the first capture group, it's referenced in the substitution by \1 (then there's the space before and after:  \1 ).
The g tells sed to perform this substitution 'globally' on the input.
The -r before the expression tells sed to use extended regular expressions.
The other two 'expressions' (each expression has -e before it to show that it's an expression):
-e 's/^ *//g' will remove leading whitespace, and -e 's/ *$//g' will remove trailing whitespace.
Using perl:
echo 1abc2abcd3efghi10z11jkl100pqrs | perl -F'(\d+)' -ane \
'$F[0] and print "#F\n" or print "#F[1..$#F]"'
Some explanation:
-an together tells Perl to split each line of input and put the resulting fields into the array #F.
-F specifies a delimiter of one or more digits to use with -an to split the input. The parentheses cause the delimiters themselves to be stored in the array, not just the strings they separate.
-e specifies the code to run after each line is read. We simply want to print the contents of #F, with the default field separator (space) used to separate elements of the array. The and...or combination is used to ignore the first field if it is empty, as it will be if the input line starts with a delimiter.

Resources