how to substite last occurrance of pattern in large file efficently - windows

Given a file with the following contents:
<root>
<a></a>
<b></b>
</root>
The command should output:
<root>
<a></a>
<b></b>
Things I've tried using the GNU Win32 port of sed:
Remove the last two lines.
This is fast, but it assumes </root> is the second to last line and will cause a bug if it's not.
sed -e '$d' test.xml | sed -e '$d'
Substituting all occurrences of </root> with an empty string.
This works, but is slower than the first solution, and will break if there are nested <root> elements (unlikely).
sed -e 's|</root>||' test.xml
The file I'm dealing with can be large so efficiency is important.
Is there a way to limit sed substitution to the last occurrence in the file? Or is there some other utility that would be faster?

Using Perl with File::Backwards should be very fast (relative, I know, but still...). Perlfaq5 has a topic on going through a file backwards and removing lines. You can check for your pattern using this topic's code as a starting point.

With sed:
sed -e ':a;N;$!ba;s|\(.*\)</root>\n\(.*\)|\1\2|'

How about using awk for this.
AWK:
awk '/^<\/root>$/{next}/<\/root>/{sub(/<\/root>/,"");print;next}1' filename
First /pattern/{action} statement looks for lines with only </root>. It pattern finds it, action ignores it.
Second /pattern/{action} statement looks for lines containing </root> anywhere in the line. If pattern finds it, sub function replaces it with nothing and prints rest of the line.
Third action which is 1 is true for all the lines that does not have pattern </root> in them. If it finds it, it prints it.
I did a quick test and this was the result -
Test:
[jaypal:~/Temp] cat tmp
<root>
<a></a>
<b></b>
</root>
<root>
<a></a>
<b></b>
</root><root>
<a></a>
<b></b></root>
[jaypal:~/Temp] awk '/^<\/root>$/{next}/<\/root>/{sub(/<\/root>/,"");print;next}1' tmp
<root>
<a></a>
<b></b>
<root>
<a></a>
<b></b>
<root>
<a></a>
<b></b>
SED:
This should also work. Though it will remove all </root> and not just the last occurrence.
sed '/<\/root>/,$s///' filename

This might work for you:
sed '/<\/root>/,/<root>/{/<\/root>/{h;d};H;//{x;p};${x;s/[^\n]*\n//p};d}' file
This assumes that each <root> tag is matched with a closing </root> tag and that these tags occur on separate lines (as per the example).
Explanation:
Focus on lines between a closing </root> tag and an opening <root> tag or end-of-file.
If it is a closing </root> tag, save it in the hold space (HS) and then delete it and start a new cycle.
For all other lines within focus (see point 1) append them to the HS.
If it is and opening <root> tag, swap to the HS and print out its contents.
If it is the end-of-file i.e. between a </root> tag and last line of the file, swap to the HS, delete the first line i.e. the closing </root> tag and print the remainder.
For all lines within the focus, delete and start a new cycle.
An alternative solution with two passes:
sed -n '/<\/root>/=' file | sed -n '$s/$/d/p' | sed -f - file
Explanation:
Print out the line numbers of closing </root> tags
Generate a sed delete command from the last matched line number.
Pipe the command through to an instance of sed reading the source file.

Use time function to see which one is efficient. sed should be efficient.
$time command
In my opinion, there is nothing which is faster than grep. try it with awk index() to see if it is any faster.

Related

Assign Array elements to the matched pattern in order

I have a xml having similar blocks throughout the file:
<name> test </name>
<marker>
<name> test </name>
<xyz> some txt </xyz>
<abc> something </abc>
<name>test</name>
<marker>
<name>test</name>
Now, i want to find "marker" and replace the line above and below the first marker with test1 , 2nd marker with test2 and so on
i tried:
array=( test1, test2);
for ((i=0;i<${#array[#]};i++)); do;
sed -i '/<marker>/!b;n;c<name>'`echo ${array[$i]}`'<\/name>' filename;
done
The problem here is: it replaces all the values with test2 always.
but I want a sequential replacement as the 1st marker should have test1 above and below, the 2nd marker should have test2 above and below and so on.
This might work for you (GNU sed):
sed -r '1{x;s/^/1/;x};N;/\n<marker>/!{P;D};N;G;s/.*(\n.*\n).*\n(.*)/<name>test\2<\/name>\1<name>test\2<\/name>/;x;s/.*/expr & + 1/e;x' file
At the start of the file prime the counter with 1. The counter is held in the hold space and incremented after each substitution.
Make a window of two lines throughout the files length. If the second line in the window does not begins <marker>, print the first line and then delete it and repeat. Otherwise, append a third line and then append the counter from the hold space. Using pattern matching, substitute the first and third lines with required test.
Finally increment the counter, ready for the next match and print the last three lines that have been amended.
Well, I didn't find any easy way to do this with a one-line but a simple script can get it done:
matchcount=`grep '<marker>' -c test-input.txt`
i=1
while [[ $i -le $matchcount ]]
do
line=`grep '<marker>' -m 1 -n test-input.txt | grep -o '^[0-9*]'`
nextline=$((line+1))
prevline=$((line-1))
cmd1=`echo $prevline`'s/.*/<name>test'`echo $i`'<\/name>/'
cmd2=`echo $line`'s/.*/REPLACED/'
cmd3=`echo $nextline`'s/.*/<name>test'`echo $i`'<\/name>/'
sed -i $cmd1 test-input.txt
sed -i $cmd2 test-input.txt
sed -i $cmd3 test-input.txt
((i = i + 1))
done
sed -i 's/REPLACED/<marker>/' test-input.txt
Explanation:
Iterate as many times as <marker> appears in the file
Find the first occurrence of <marker> with grep, save it's line number and the surrounding ones.
For each line use a different sed command: replace name or replace marker ocurrence so it doesn't match again.
Replace all REPLACED back to once it's done
I'm sure it can be made in fewer lines but this is made to be simple to read. You're welcome to improve it if you want.
Here is what you can do.
Operations are restricted to range 1,/marker/ and marker is replaced by another word, to avoid a second matching. A last sed at the end restores all marker values.
To simplify, replacement is done with multiline replacement string with quoted '\n'.
array=( test1 test2);
marker='<marker>'
processed='<markerprocessed>' # or whatever cannot happen in input
for ((i=0;i<${#array[#]};i++)); do
replace='<name>'${array[$i]}'<\/name>\n'${processed}'\n<name>'${array[$i]}'<\/name>' # edit as required
sed -i -e '1,/${marker}/'s/${marker}/${replace}/ $file
done
sed -i s/${processed}/${marker}/ $file

Bash change number to another value on specific line

i'm new with bash scripting , and i looking for solution to change a number to another value on specific line.
I have file named foo.config and in this file i have about 100 lines of configuration.
For example i have
<UpdateInterval>2</UpdateInterval>
and i need to find this line on foo.config and replace number(this can be number for 0 to 10 and for my example is 2) for 0 as always.
Like this :
<UpdateInterval>0</UpdateInterval>
How can i do it with sed ? please suggest
the part of lines:
<InstallUrl />
<TargetCulture>en</TargetCulture>
<ApplicationVersion>1.0.1.8</ApplicationVersion>
<AutoIncrementApplicationRevision>true</AutoIncrementApplicationRevision>
<UpdateEnabled>true</UpdateEnabled>
<UpdateInterval>2</UpdateInterval>
<UpdateIntervalUnits>hours</UpdateIntervalUnits>
<ProductName>xxxxxxxxxxxx</ProductName>
<PublisherName />
<SupportUrl />
<FriendlyName>xxxxxxxxxxxx</FriendlyName>
<OfficeApplicationDescription />
<LoadBehavior>3</LoadBehavior>
sed and others(grep, awk) never be a good tools for parsing xml/html data. Use a proper xml/html parsers, like xmlstarlet:
xmlstarlet ed -L -O -u "//UpdateInterval" -v 0 foo.config
ed - edit mode
-L - edit the file inplace
-O - omit xml declaration
-u - update action
"//UpdateInterval" - xpath expression
-v 0 - the new value of the element to be updated
The final (exemplary) foo.config contents:
<root>
<InstallUrl/>
<TargetCulture>en</TargetCulture>
<ApplicationVersion>1.0.1.8</ApplicationVersion>
<AutoIncrementApplicationRevision>true</AutoIncrementApplicationRevision>
<UpdateEnabled>true</UpdateEnabled>
<UpdateInterval>0</UpdateInterval>
<UpdateIntervalUnits>hours</UpdateIntervalUnits>
<ProductName>xxxxxxxxxxxx</ProductName>
<PublisherName/>
<SupportUrl/>
<FriendlyName>xxxxxxxxxxxx</FriendlyName>
<OfficeApplicationDescription/>
<LoadBehavior>3</LoadBehavior>
</root>
The <root> tag was specified for demonstration purpose, your xml/html structure should have its own "root"(most parent) tag
In a very simple way, you may try:
sed -E 's/^<UpdateInterval>[0-9]+/<UpdateInterval>0/' foo.config
This will search for <UpdateInterval> at the beginning of a line (note the ^) and then a number ([0-9] stands for a digit and + for a repetition of one or more). This bit will be replaced with <UpdateInterval>0. The / characters separate what you search and what will replace it. The s command is a search and replace.
It will take the file foo.config as input and you will get the output on standard output. If you want your output on the same file, you may do:
sed -E 's/^<UpdateInterval>[0-9]+/<UpdateInterval>0/' foo.config >foo.temp
mv foo.temp foo.config
Or more simply:
sed -i -E 's/^<UpdateInterval>[0-9]+/<UpdateInterval>0/' foo.config
Note that this is not a good way to do the substitution if your config file contains general XML. It will only work in the simplest of cases (but will do for your example.) If your XML bit may be in the middle of a line, remove the ^ character. The search and replace expression assumes that there is no whitespace around the XML tags.
A solution using an XML parsing tool:
{ echo '<root>'; cat foo.config; echo '</root>'; } |
xmlstarlet ed -O -P -u //UpdateInterval -v 0 |
sed '1d;$d' |
sponge foo.config
The first line is to make the config file into a proper XML file.
The second line updates the value.
The third line removes the root tags.
The last line rewrites the config file. Need to install the moreutils package.

removing the first character in a text file using Sed

I have 100s of enormous text files in the form of
p 127210 3240293 23423234 3242323423
3242323 23423423 23423234 32423423
which I want to turn to
127210 3240293 23423234 3242323423
3242323 23423423 23423234 32423423
I've tried using
sed '1 s/^.//' input > output
but that gives me
127210 3240293 23423234 3242323423
3242323 23423423 23423234 32423423
i.e. an annoying space where the p was. Can anyone help me modify the sed expression to get the output without the space?
Thanks
Following awk may help you in same.
awk '!/^[0-9]+/{$1="";$0=$0;$1=$1} 1' Input_file
Explanation: Checking condition !/^[0-9]+/ which will check if a line doesn't start from digits then do following, then I am nullifying first field here(because you don't want p in output here), then I am re-arranging $0(current line) and re-arranging $1 too so that it could remove that initial space as per your request here.
Output will be as follows.
127210 3240293 23423234 3242323423
3242323 23423423 23423234 32423423
sed 's/^[[:alpha:]]?//;s/^[[:space:]]*//' infile >outifle
should do it.
You can remove non-numeric characters at the beginning of the lines:
sed 's/^[^0-9]*//' input > output
sed 's/^. //' foo.txt > foo2.txt

Extract text matching pattern X after having searched for pattern Y (bash)

In a bash script how would I be able to extract a text from an XML file that begins with abc ends with /abc which comes after a pattern that I need to look for?
Exemple of the input file:
<111>
<abc>
text
</abc>
<def>
text
</def>
</111>
<222>
<abc>
text to extract
</abc>
</222>
My goal would be to display "text to extract" indicating I'm looking for the pattern <222>.
your xml example doesn't have root element?
<111> <222> are not valid xml tag names
if you are not sure your xml format is fixed, don't use regex to parse it
xpath would be the way to go
assume the 111,222 tag named as t111, t222 and you had a root element.
xmllint --xpath "//t222/abc/text()" your.xml
This is really ugly and you really should use #Kent's answer, but if you really, really insist:
grep -A 999 "<222>" file.xml | grep -A1 "<abc>" | tail -n 1
It takes up to 999 lines after finding your pattern <222>, and then, from that, it takes the single line following <abc> and from that it takes the last line.
Using GNU awk for multi-char RS and gensub():
$ awk -v RS='^$' '{print gensub(/.*<222>.*<abc>\n(.*)\n<\/abc>.*/,"\\1","")}' file
text to extract

Awk/Sed - how to print selection between two patterns?

From reference: catonmat.net I think I could get the interested selection between two patterns using the following:
Source Text (one line): 6 June 2013 08.32.435 UTF+8 Report /content/folder[#name='....' Failure ....
Here the important part is the path to report , therefore I am using:
awk '/content\/folder\[#name=/,/Failure/' source.csv
I got the entire matched line, instead of only the content path between the two matches.
I have also tried to:
sed -n '/content\/folder\[#name/,/Failure/ {/content\/folder\[#name\|Failure/!p}' source.csv
Still returning the entire line...
What was wrong?
Try this:
sed -n '|content/folder\[#name.*Failure|s|.*content/folder\[#name\(.*\)Failure.*|\1|' source.csv
/re1/,/re2/ is for selecting a range of lines, not a range of text within a line. Since content/folder and Failure are on the same line, you don't need a range, just a regex that matches a line containing both. Then use s/// to extra the part between them.
sed 's,.*/content/folder\[#name=\(.*\)Failure.*,\1,' source.csv
grep -Po '(?<=#name=).*(?=Failure)' source.csv

Resources