copy text between two strings in a file using bash - bash

I have a XML file. From that, I want to copy the text between two strings.
Sample line from XML file:
some stuff.........<br/><br/><br/>http://example.com/copythislink.php<br/><br/>After you.........some more stuff
I want to copy all the text between
<br/><br/><br/>
and
<br/><br/>After you
These two strings occur only once in the xml file.I tried using sed. But, it returns an error because of <.

You can use this sed,
sed 's#.*<br/><br/><br/>\(.*\)<br/><br/>After you.*#\1#' yourfile.xml
(OR)
If you want to extract only the URL.
sed -n 's#.*<br/><br/><br/>\(.*\)<br/><br/>After you.*#\1#p' yourfile.xml

Using gnu grep
grep -Po '(?<=<br/><br/><br/>)((?!<br/><br/>After you).)*' file
Explanation
(?<=<br/><br/><br/>) is a positive look-behind assertion
(?!<br/><br/>After you) is a negative look-behind assertion

If your need was only to extract the URI, a simple grep would have been enough. For example, something like :
grep -o "http:\/\/[A-Za-z0-9\.\/]*" test.xml
However, if you really want to catch the text (whatever the kind of content, even if it doesn't contain an URI) between these both strings, the solution of sat works well.

Related

Shell find a string btween two patterns

I have a response from a curl command to a text file that looks like this:
<att id="owner"><val>objs/13///user</val></att><att id="por"><val>objs/8/</val></att><att id="subType"><val>null</val></att><att id="uid"><val>14</val></att><att id="webDavPartialUrl"><val>/Users/user%
I need to find the string between the string >objs/8/</val> and <att id="uid">
i have tries awk,sed and grep, but all have issues with special charterers like those above, is there an option to treat the text as simple charterers?
Using grep with -- (explained here)
$ grep -o -- '>objs/8/</val>.*<att id="uid">' pattern
>objs/8/</val></att><att id="subType"><val>null</val></att><att id="uid">
For more specific matching with grep, you can refer to this question.
Otherwise, because your input seems to be XML, you should consider using an XPATH expression on it. More specifically, it seems that you want to
retrieve <att id="subType">, which should be easy to express.
Adding <test> and </test> around your sample, I was able to use xmllint to retrieve the value.
$ xmllint --xpath '/test/att[#id="subType"]' pattern
<att id="subType"><val>null</val></att>
Using Perl:
perl -ne 'print "$1\n" if m#>objs/8/</val>(.*)<att id="uid">#' file
output:
</att><att id="subType"><val>null</val></att>
Explanation:
$1 is the captured string (.*)
m## is used here as the matching operator instead of the standard Perl //, in order to ignore the special / characters

How to pull a value from between 2 strings which occur several times in a file

I am trying to pull the value from inbetween 2 strings and line break each result. I am then hoping to combine this with another value from the same document being pulled the same way. The problem is there are NO linebreaks in this file and it is quite large. Here is an example of the file.
<ID>47</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>myhost.domain.local</DNS_NAME> <IP_ADDRESS>10.0.0.1</IP_ADDRESS><ID>60</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>yourhost.domain.local</DNS_NAME><IP_ADDRESS>10.0.0.2</IP_ADDRESS>
My end result would ideally look something like this.
ID-----DNS_NAME
47-----myhost.domain.local
60-----yourhost.domain.local
My closest attemps so far have been creating variables with grep, but I cant seem to format them into a table. Im also very new to scripting so forgive my ignorance.
If your grep supports -P (--Perl-regexp), then you're free to use the below regex.
$ grep -oP '<ID>\K[^<>]*(?=</ID>)|<DNS_NAME>\K[^<>]*(?=</DNS_NAME>)' file | sed 'N;s/\n/-----/g'
47-----myhost.domain.local
60-----yourhost.domain.local
\K Discards the previously matched characters from printing.
(?=...) posiitve lookahead assertion which asserts where the match would occur. It won't consume any characters.
Here is an gnu awk (do to multiple characters in RS) to get your data:
awk -v RS="<ID>" -F"<|>" 'NR>1 {print $1"-----"$9}' file
47-----myhost.domain.local
60-----yourhost.domain.local

Delete line containing one of multiple strings

I have a text file and I want to remove all lines containing the words: facebook, youtube, google, amazon, dropbox, etc.
I know to delete lines containing a string with sed:
sed '/facebook/d' myfile.txt
I don't want to run this command five different times though for each string, is there a way to combine all the strings into one command?
Try this:
sed '/facebook\|youtube\|google\|amazon\|dropbox/d' myfile.txt
From GNU's sed manual:
regexp1\|regexp2
Matches either regexp1 or regexp2. Use parentheses to use
complex alternative regular expressions. The matching process tries
each alternative in turn, from left to right, and the first one that
succeeds is used. It is a GNU extension.
grep -vf wordsToExcludeFile myfile.txt
"wordsToExcludeFile" should contain the words you don't want, one per line.
If you need to save the result back to the same file, then add this to the command:
> myfile.new && mv myfile.new myfile.txt
With awk
awk '!/facebook|youtube|google|amazon|dropbox/' myfile.txt > filtered.txt

Grep page source for URL

I have a webpage source in a text doc, there's a few lines like so:
"rid" : 'http://web.site/urlhere',
How do I use Linux/terminal to grep just the http://web.site/urlhere portion?
You can pass the -o option to grep to tell it to only display the matching pattern.
grep -o http://web.site/urlhere somefile.txt
Assuming you're looking for generic URLs, you could start with something like this (and probably improve it):
grep -o "'http.*'" someFile.txt | sed "s/'//g"
This will search for the text http after a single quote and will include all the characters from that line until the last single quote. It will then pipe the result (only the matching pattern) to sed and remove the single quotes.
Note: You could run into trouble if you have more single quotes after the url (but your question doesn't mention that)...
Since you're question is very non-specific, there are probably many other input conditions that could cause problems, but the above should be a good starting point.
More info on grep: http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

Find and replace html code for multiple files within multiple directories

I have a very basic understanding of shell scripting, but what I need to do requires more complex commands.
For one task, I need to find and replace html code within the index.html files on my server. These files are in multiple directories with a consistent naming convention. ([letter][3-digit number]) See the example below.
files: index.html
path: /www/mysite/board/today/[rsh][0-9]/
string to find: (div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I hope you don't mind the pseudo-regex. The folders containing my target index.html files look similar to r099, s017, h123. And suffice the say, the html code I'm trying to replace is relatively long, but its still just a string.
The second task is similar to the first, only the filename changes as well.
files: [rsh][0-9].html
path: www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/
string: (div id="id")[code](/div)<--include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I've seen other examples on SO and elsewhere on the net that simply show scripts modifying files under a single directory to find & replace a string without any special characters, but I haven't seen an example similar to what I'm trying to do just yet.
Any assistance would be greatly appreciated.
Thank You.
You have three separate sub-problems:
replacing text in a file
coping with special characters
selecting files to apply the transformation to
​1. The canonical text replacement tool is sed:
sed -e 's/PATTERN/REPLACEMENT/g' <INPUT_FILE >OUTPUT_FILE
If you have GNU sed (e.g. on Linux or Cygwin), pass -i to transform the file in place. You can act on more than one file in the same command line.
sed -i -e 's/PATTERN/REPLACEMENT/g' FILE OTHER_FILE…
If your sed doesn't have the -i option, you need to write to a different file and move that into place afterwards. (This is what GNU sed does behind the scenes.)
sed -e 's/PATTERN/REPLACEMENT/g' <FILE >FILE.tmp
mv FILE.tmp FILE
​2. If you want to replace a literal string by a literal string, you need to prefix all special characters by a backslash. For sed patterns, the special characters are .\[^$* plus the separator for the s command (usually /). For sed replacement text, the special characters are \& and newlines. You can use sed to turn a string into a suitable pattern or replacement text.
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
​3. To act on multiple files directly in one or more directories, use shell wildcards. Your requirements don't seem completely consistent; I think these are the patterns you're looking for, but be sure to review them.
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
This will match files like /www/mysite/board/today/r012/index.html and /www/mysite/person/4/5/6/card/2011/h7.html, but not /www/mysite/board/today/subdir/s012/index.html or /www/mysite/board/today/r1234/index.html.
If you need to act on files in subdirectories recursively, use find. It doesn't seem to be in your requirements and this answer is long enough already, so I'll stop here.
​4. Putting it all together:
string_to_replace='(div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)'
replacement_string='(div id="id")<--include="(path)"-->(/div)'
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
sed -i -e "s/$pattern/$replacement/g" \
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html \
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
Final note: you seem to be working on HTML with regular expressions. That's often not a good idea.
Finding the files can easily be done using find -regex:
find www/mysite/board/today -regex ".*[rsh][0-9][0-9][0-9]/index.html"
find www/mysite/person -regex ".*[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9][0-9][0-9].html"
Due to nature of HTML, replacing the content might not be very easy with sed, so I would suggest using an HTML or XML parsing library in a perl script. Can you provide a short sample of an actual html file and the result of the replacements?

Resources