I'm trying to remove unknown characters between 2 known markers from a variable using bash.
eg
string="This text d #! more text jsdlj end and mo{re ;re end text.text"
I want to remove all the characters between the last word "text " (before the end word) and the first occurance thereafter called "end" . ie between the last occurance of the word "text " after that the first occurance of the word "end", but keeping both these markers)
result="This text d #! more text end and mo{re ;re end text.text"
I'll be using it as part of a find -print0 | xargs -0 bash -c 'command; command...etc.' script.
I've tried
echo $string | sed 's/[de][ex][ft][^\-]*//' ;
but that does it from the first "ext" and "-" (not the last "ext" before the end marker) and also does not retain the markers.
Any suggestions?
EDIT: So far the outcomes are as follows:
string="text text text lk;sdf;-end end 233-end.txt"
start="text "
end="-end"
Method 1
[[ $string =~ (.*'"${start}"').*('"${end}"'.*) ]] || :
nstring="${BASH_REMATCH[1]}${BASH_REMATCH[2]}" ;
echo "$nstring" ;
>"text text text -end.txt"
Required output = "text text text -end end 233-end.txt"
Method 2
temp=${cname%'"$end"'*}
nend=${cname#"$temp"}
nstart=${temp%'"$start"'*}
echo "$nstart$nend"
>"text text -end.txt"
Required output = "text text text -end end 233-end.txt"
Method 3
nstring=$(sed -E "s/(.*'"$start"').*('"$end"')/\1\2/" <<< "$string")
echo "$nstring";
>"text text text -end.txt"
Required output = "text text text -end end 233-end.txt"
Method 4
nstring=$(sed -En "s/(^.*'"$start"').*('"$end"'.*$)/\1\2/p" <<< "$string")
echo "$nstring" ;
>"text text text -end.txt"
Required output = "text text text -end end 233-end.txt"
Using Bash's Regex match:
#!/usr/bin/env bash
string='This text and more text jsdlj-end.text'
[[ $string =~ (.*text\ ).*(-end.*) ]] || :
printf %s\\n "${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
UPDATE: question has been updated with more details for dealing with a string that contains multiple start and end markers.
The new input string:
This text d #! more text jsdlj end and mo{re ;re end text.text
Test case:
start marker = 'text'
end marker = 'end'
objective = remove all text between last start marker and before the first end marker (actually replace all said text with a single space)
Input with all markers in bold:
This text d #! more text jsdlj end and mo{re ;re end text.text
Input with the two markers of interest in bold:
This text d #! more text jsdlj end and mo{re ;re end text.text
Desired result:
This text d #! more text end and mo{re ;re end text.text
While we can use sed to remove the desired text (replace <space>jsdlj<space> with <space>), we have to deal with the fact that sed does greedy matching (fine for finding the 'last' start marker) but does not do non-greedy matching (needed to find the 'first' end marker). We can get around this limitation by switching out our end marker with a single-character replacement, simulate a non-greedy match, then switch back to the original end marker.
m1='text' # start marker
m2='end' # end marker
string="This text d #! more text jsdlj end and mo{re ;re end text.text"
sed -E "s/${m2}/#/g;s/(^.*${m1})[^#]*(#.*$)/\1 \2/;s/#/${m2}/g" <<< "${string}"
Where:
-E - enable Extended regex support (includes capture groups)
s/${m2}/#/g - replace our end marker with the single character # (OP needs to determine what character cannot show up in expected input strings)
(^.*${m1}) - 1st capture group; greedy match from start of string up to last start marker before ...
[^#]* - match everything that's not the # character
(#.*$) - 2nd capture group; everything from # character until end of string
\1 \2 - replace entire string with 1st capture group + <space> + 2nd capture group
s/#/${m2}/g - replace single character # with our end marker
This generates:
This text d #! more text end and mo{re ;re end text.text
Personally, I'd probably opt for a more straight forward parameter expansion approach (similiar to Jetchisel's answer) but that could be a bit problematic for inline xargs processing ... ???
Original answer
One sed idea using capture groups:
$ string="This text and more text jsdlj-end.text"
$ sed -En 's/(^.*text ).*(-end.*$)/\1\2/p' <<< "${string}"
This text and more text -end.text
Where:
-En - enable Extended regex support (and capture groups) and (-n) disable default printing of pattern space
(^.*text ) - first capture group = start of line up to last text
.* - everything between the 2 capture groups
(-end.*$) - second capture group = from -end to end of string
\1\2/p - print the contents of the 2 capture groups.
Though this runs into issues if there are multiple -end strings on the 'end' of the string, eg:
$ string="This text and more text jsdlj-end -end.text"
$ sed -En 's/(^.*text ).*(-end.*$)/\1\2/p' <<< "${string}"
This text and more text -end.text
Whether this is correct or not depends on the desired output (and assuming this type of 'double' ending string is possible).
With Parameter Expansion.
string="This text and more text jsdlj-end.text"
temp=${string%-*}
end=${string#"$temp"}
start=${temp% *}
echo "$start$end"
This is a bit tricky using only a posix extended regex (ERE), but easy with a perl compatible regex (PCRE). Therefore, we switch from sed to perl:
To get the last text (that still has a end afterwards), put a .* in front. The closest end to that text can then be matched using a non-greedy .*?.
Here we also put \b around text and end to avoid matching parts of other words (for example, the word send should not be matched even though it contains end too).
perl -pe 's/(.*\btext\b).*?(\bend\b)/\1 \2/' <<< "$string"
Related
I'm trying to do some regex matching in bash.
I'd like to match multiple block of indented (space or tab) content, with the block itself starting with a keyword.
Some other content could be present in the file.
Using this sample content :
keyword aaa match1
Some other content
keyword ccc match2
indentend content
matching
Some other content
with indendation
keyword ddd match2
indented content still matching
I managed to use this : (^keyword.*(?:\n^\h+.*)*), which seems to be sort of okay, everything is matching as expected. :
https://regex101.com/r/kvMlKK/1
Expected output would be to print every matches :
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
Unfortunatly I did not find a way to print all matches in bash. I can use grep/sed/awk/perl without any problem (edit: i meant I have access to all these command in the environnement i am working with).
Edit:
grep -E --include \*.md '(^keyword.*(?:\n^\h+.*)*)' $(dirname "$0")/../_inbox/draft.md
Using grep it does not return the full match, only first line because of the lack of multi-line matching support I guess.
I am not familiar with awk/sed, I did not get any meaningful results (even if it seems to be better to use them for multi-line matching).
Edit: if that could work on multiple files that would be awesome
Thanks for your help!
You can do it in pure bash, by looping... Because bash regex doesn't support multi-line matching.
#!/bin/bash
# Flag to track whether inside indented block
indented=0
# Read input line by line
while IFS= read -r line; do
# Check if line starts with keyword
reg="^[ \t]*keyword"
if [[ $line =~ $reg ]]; then
# Print line
printf "%s\n" "$line"
# Set flag to indicate inside indented block
indented=1
else
# Check if line starts with whitespace and inside indented block
reg="^[ \t]+.*"
if [[ $line =~ $reg && $indented -eq 1 ]]; then
# Print line
printf "%s\n" "$line"
else
# Reset flag to indicate outside indented block
indented=0
fi
fi
done < "input"
You can do it in awk too:
awk '/^[ \t]*keyword/{print;while(getline line) if(line~/^[ \t]+.*/) print line;else break}' input
Or use sed
sed -n '/^[ \t]*keyword/{:start;p;n;/^[ \t]/{p;n;b start;}}' input
Using awk:
$ awk '!/^[\t ]/{p=0} /^keyword/{p=1} p' file
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
$
I would like to replace the last 3 lines with another string.. using sed, tr, or other bash solution.
Given file:
{
[
{
text text text
text text text
text text text
}
],
[
{
text text text
text text text
text text text
}
]
}
desired result:
{
[
{
text text text
text text text
text text text
}
],
[
{
text text text
text text text
text text text
bar
I tried this with sed
sed -i '' 's/\}\s+\]\s+\}/bar/g' foobar.hcl
tried this with tr
tr -s 's/\}[:blank:]\][:blank:]\}/bar/g' <foobar.hcl
With perl where you can read entire input as a single string using -0777 option. Not suitable if input is large enough to run out of available memory.
# this will replace all remaining whitespaces at the end
# with a single newline
perl -0777 -pe 's/\}\s+]\s+\}\s*\z/bar\n/' foobar.hcl
# this will preserve all remaining whitespaces, if any
perl -0777 -pe 's/\}\s+]\s+\}(?=\s*\z)/bar/' foobar.hcl
Once it is working, you can use perl -i -0777 ... for in-place editing.
This might work for you (GNU sed):
sed '1N;:a;N;/^\s*}\s*\n\s*]\s*\n}\s*$/{s//bar/;N;ba};P;D' file
Open a 3 line window and pattern match.
Using an array - assumes "text text text" has some actual nonspace, non-punctuation characters.
mapfile x < file # throw into an array
c=${#x[#]} # count the lines
let c-- # point c at last index
until [[ "${x[-1]}" =~ [^[:space:][:punct:]] ]] # while last line has no data
do let c-- # decrement the last line pointer
x=( "${x[#]:0:$c}" ) # reassign array without last line
done
x+=( bar ) # add the desired string
echo "${x[#]}" > file # write file without unwanted lines
Allows for any number of blank lines &c. Even }]} and such, so long as it isn't on the same line with the data.
How can I just replace the last character (it's a }) from a string? I need everything before the last character but replace the last character with some new string.
I tried many things with awk and sed but didn't succeed.
For example:
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
}'
should become:
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
\\cf2 Its red now
}'
This replaces the last occurrence of:
}
with
\\cf2 Its red now
}
sed would do this:
# replace '}' in the end
echo '\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0 }' | sed 's/}$/\\cf2 Its red now}/'
# replace any last character
echo '\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0 }' | sed 's/\(.\)$/\\cf2 Its red now\1/'
Replacing the trailing } could be done like this (with $ as the PS1 prompt and > as the PS2 prompt):
$ str="...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
> \\f0
> }"
$ echo "$str"
...\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0
}
$ echo "${str%\}}\cf2 It's red now
}"
...\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0
\cf2 It's red now
}
$
The first 3 lines assign your string to my variable str. The next 4 lines show what's in the string. The 2 lines:
echo "${str%\}}\cf2 It's red now
}"
contain a (grammar-corrected) substitution of the material you asked for, and the last lines echo the substituted value.
Basically, ${str%tail} removes the string tail from the end of $str; I remember % ends in 't' for tail (and the analogous ${str#head} has hash starting with 'h' for head).
See shell parameter expansion in the Bash manual for the remaining details.
If you don't know the last character, you can use a ? metacharacter to match the end instead:
echo "${str%?}and the extra"
First make a string with newlines
str=$(printf "%s\n%s\n%s" '\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural' '\\f0' "}'")
Now you look for the last } in your string and replace it including a newline.
The $ makes sure it will only replace it at the last line, & stands for the matches string.
echo "${str}" |sed '$ s/}[^}]$/\\\\cf2 Its red now\n&/'
The above solution only works when the } is at the last line. It becomes more difficult when you also want to support str2:
str2=$(printf "Extra } here.\n%s\nsome other text" "${str}")
You can not match the } on the last line. Removing the address $ for the last line will result in replacing all } characters (I added a } at the beginning of str2). You only want to replace the last one.
Replacing once is forced with ..../1. Replacing the last and not the first is done by reversing the order of lines with tac. Since you will tac again after the replacement, you need to use a different order in your sedreplacement string.
echo "${str2}" | tac |sed 's/}[^}]$/&\n\\\\cf2 Its red now/1' |tac
In awk:
$ awk ' BEGIN { RS=OFS=FS="" } $NF="\\\\cf2 Its red now\n}"' file
RS="" sets RS to an empty record (change it to suit your needs)
OFS=FS="" separates characters each to its own field
$NF="\\\\cf2 Its red now\n}" replaces the char in the last field ($NF=}) with the quoted text
awk '{sub(/\\f0/,"\\f0\n\\\\\cfs Its red now")}1' file
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
\\cfs Its red now
}'
I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.
Is there a way to tell Visual Studio to only show the file name in the search results instead of the full path? By default it shows the full file path followed by the actual line.
If it did, I could make the results window take less space on the screen and wouldn't have to bother with scrolling to see the actual line (yes, the full paths are quite long).
Yes, it's possible, although involves registry hack:
go to HKCU\Software\Microsoft\VisualStudio\10.0\Find
create a new string Find result format with value $f$e($l,$c):$t\r\n
enjoy
Here are the values you can use in the string:
Files
$p path
$f filename
$v drive/unc share
$d dir
$n name
$e .ext
Location
$l line
$c col
$x end col if on first line, else end of first line
$L span end line
$C span end col
Text
$0 matched text
$t text of first line
$s summary of hit
$T text of spanned lines
Char
\n newline
\s space
\t tab
\\ slash
\$ $