How to remove from file specific content from another file? - shell

I have a file foo.txt:
$cat foo.txt
<ul>
<li>
<p>something</p>
</li>
<li>
<p>something else</p>
</li>
</ul>
And a bar.txt:
$cat bar.txt
<li>
<p>something</p>
</li>
And I want the desired output:
<ul>
<li>
<p>something else</p>
</li>
</ul>
I have tried:
$sed '{/r bar.txt/} d' foo.txt
But it didn't work, and I cannot do:
$sed '/<li>/,/</li>/ d' foo.txt
because there are other elements.

This awk one-liner works for your example:
awk -v RS="" '{gsub(/\n/,"\x99")}NR==FNR{t=$0;next}{gsub(t,"");gsub(/\x99/,"\n");print}' bar foo
not exactly the same output (empty line), but you got the idea. see the short explanation below the example.
see the example below:
kent$ head foo bar
==> foo <==
<ul>
<li>
<p>something</p>
</li>
<li>
<p>something else</p>
</li>
</ul>
==> bar <==
<li>
<p>something</p>
</li>
kent$ awk -v RS="" '{gsub(/\n/,"\x99")}NR==FNR{t=$0;next}{gsub(t,"");gsub(/\x99/,"\n");print}' bar foo
<ul>
<li>
<p>something else</p>
</li>
</ul>
Add short explanation
The basic idea is, replace linebreak with invisible char (in example I used \x99), then we have two single line strings. we can do the match and replacement. after we processed the strings, replace all \x99 back to linebreak to get the original format. This idea works for sed too, but a bit complicated, you have to make a label and play with pattern/hold spaces...
In the example I just used RS="" (I am a bit lazy). you could use sprintf function to build the one-line string, it would be more generic, since both of your real files could have empty lines. (your example doesn't however)
The point is the invisible char replacement part.
Good luck!

sed is an excellent tool for simple subsitutions on a single line, for anything else use awk. Here is a GNU awk solution:
$ gawk -v RS='\0' -v ORS= 'NR==FNR{re=$0;next} {sub(re,"")} 1' bar.txt foo.txt
<ul>
<li>
<p>something else</p>
</li>
</ul>
If "bar.txt" can contain RE metacharacters and you find those causing undesirable matches in the sub() (unlikely when matching large amounts of text) then you need to switch to an index()+substr()s solution to work with strings instead of REs, e.g.:
$ gawk -v RS='\0' -v ORS= '
NR==FNR { str=$0; rlength=length(str); next }
rstart = index($0,str) { $0 = substr($0,1,rstart-1) substr($0,rstart+rlength) }
1' bar.txt foo.txt
<ul>
<li>
<p>something else</p>
</li>
</ul>

Related

How to substitude / delete multiple lines inplace?

I want to use find . -name *.php -exec COMMAND {} \; on a debian based system to delete pattern like this:
<?php
#bVj7Gt#
line1
...
lineX
#/bVj7Gt#
?>
The line after <?php = hash + six alphanumeric + hash
The line before ?> = hash + slash + six alphanumeric + hash
This may or may not be what you're looking for (since you didn't provide sample input/output we could test against) using GNU awk for multi-char RS:
$ cat file
foo
<?php
#bVj7Gt#
line1
...
lineX
#/bVj7Gt#
?>
bar
$ awk -v RS='<[?]php\n#[[:alnum:]]{6}#.*#/[[:alnum:]]{6}#\n[?]>\n' -v ORS= '1' file
foo
bar
Make it awk -i inplace -v RS=... if you want to do "inplace" editing.

How to select specific string between a line with sed or awk, without print the whole line

I want to select a specific string within a line in an big txt file with sed or awk. But I got always the whole line and each line is 100.000+ characters long.
I got for example:
</div><div class="follow withFollow" id="user-id-1234567890"> <a href="/app/users/id-1234567890/test/ </div><div class="follow withFollow" id="user-id-0123456789"> <a href="/app/users/id-0123456789/test/" 12345678990 1234877890 1234767890 1245456780 123456790 withFollow" id="user-id-9873456789">
The only thing I want is the numbers in:
withFollow" id="user-id-1234567890">, withFollow" id="user-id-0123456789">, withFollow" id="user-id-9873456789">
output:
1234567890
0123456789
9873456789
I tried a lot like:
sed -n '/**user-id-**/,/**">**/p' FILE
awk '/**user-id-**/,/**">**/p' FILE
awk '/**user-id-**/,/**">**/p' FILE | grep -Eo "[0-9]{1,15}" > output.txt
With the last one I got only other numbers in the same line, so not only within id="user-id-1234567890">.
You could use grep:
$ grep -oP 'user-id-\K[^"]*' file
1234567890
0123456789
9873456789
Or if you only want to match digits:
grep -oP 'user-id-\K\d*' file

How to get only part of a line using grep/sed/awk with regex?

I have an HTML file of which I need to get only an specific part. The biggest challenge here is that this HTML file doesn't have linebreaks, so my grep expression isn't working well.
Here is my HTML file:
<p>Test1</p><p>Test2</p>
Note that I have two anchors (<a>) on this line.
I want to get the second anchor and I was trying to get it using:
cat example.html | grep -o "<a.*Test2</p></a>"
Unfortunately, this command returns the whole line, but I want only:
<p>Test2</p>
I don't know how to do this with grep or sed, I'd really appreciate any help.
With GNU awk for multi-char RS, if it's the second record you want:
$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"} NR==2' file
<p>Test2</p>
or if it's the record labeled "Test2":
$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"} /<p>Test2<\/p>/' file
<p>Test2</p>
or:
$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"; FS="</?p>"} $2=="Test2"' file
<p>Test2</p>
Using Perl:
$ perl -pe '#a = split(m~(?<=</a>)~, $_);$_ = $a[1]' file
<p>Test2</p>
Breakdown:
perl -pe ' ' # Read line for line into $_
# and print $_ at the end
m~(?<=</a>)~ # Match the position after
# each </a> tag
#a = split( , $_); # Split into array #a
$_ = $a[1] # Take second item
This should do:
grep -o '<a[^>]*><p>Test2</p></a>' example.html

Remove last string from every line in file

I have a txt file that I want to use sed, awk, grep or any combination to remove the last string of every line in a file.
I think I need something like sed '$s/'. But cant quite figure it out. Thanks.
Example:
fab foo bar
fab foo fab
fab foo foo
Desired output.
fab foo
fab foo
fab foo
Note that the last string will be different in every line.
How about
awk '{$NF=""}1' file
sed -r 's/[[:space:]]*[^[:space:]]+[[:space:]]*$//' file
The above removes the last string of non-spaces on every line along with any white space before or after that string.
awk '{if(NF)NF--}1' file
The condition is needed as NF can't be assigned a negative value and blank lines already have NF == 0. If no blank lines in the input then awk 'NF--' file is enough
You could try the below sed command,
sed 's/ \+[^ ]\+$//' file
Through perl,
perl -pe 's/\S+\s*$//' file
To remove also the leading spaces before the last word.
perl -pe 's/\s+\S+\s*$//' file
Another approach is to print every field except for the last:
awk '
{
for(i=1; i<NF; i++)
{
printf "%s ", $i;
}
printf "\n";
}' file

How to get text which is middle of Tags?

<li><b> Some Text:</b></li><li><b> Some Text:</b></li>
<pg>something else</pg> <li><b> Some Text:</b> </li>
<li><b> Some Text:</b></li>
<li><b> Some Text:</b> More Text </li> <li><b> Some Text:</b> More Text </li>
If this is my input string and
Some Text:
Some Text:
Some Text:
Some Text: More Text
Some Text: More Text
This is to be my output But I got was only
Some Text:
Some Text:
Some Text: More Text
This is my shell script function in linux
#!/bin/sh
sed -n -e 's/.*<li>\(.*\)<\/li>.*/\1/p' $1 > temp
sed -e 's/<[<\/b]*>//g' temp >out
Please give me some ideas where went wrong.
Here is one way with GNU awk (the first line is a blank line):
$ gawk '
RT=="</b>"||RT=="</li>" && NF {
gsub(/^ *| *$/,"")
printf "%s%s",(ORS=!(NR%2)?"":"\n"),$0
}
END { print "\n" }' RS='</?b>|</?li>' file
Some Text:
Some Text:
Some Text:
Some Text:
Some Text:More Text
Some Text:More Text
If you don't mind using a third-party tool - the multi-platform web-scraping utility xidel - it gets as simple as:
xidel file.html -e '/li'
This extracts the text-only content of all (top-level) li elements and prints each on a separate line to produce the desired output.
First things first: Generally speaking, use a tool that understands HTML (see my other answer) rather than awk or sed for HTML parsing - as #chepner succinctly puts it:
Do not parse HTML with sed or awk; sed is designed for line-based editing, and awk for field-based tasks. Neither is suitable for general structured text whose elements may span more than one line.
Thus, the solutions below work in limited circumstances, but do not generalize well.
#jaypal has already provided a GNU awk (gawk)-specific answer.
Here's one that should work with all awk flavors that accept regexes as input record separators (RS) (such as gawk, mawk, and nawk):
awk -v RS='</?li>\n*' '
/^<b>/ { t=$0; gsub(/<\/?b>/, "", t); gsub(/^ +| +$/, "", t); print t}
' file
Older and POSIX-compliant awk flavors - such as the BSD-based one in OSX - only accept a single, literal char. as RS, so the above won't work; on OSX, the following sed command achieves the same (works on Linux, too):
sed -E 's/<\/?li>/\'$'\n''/g' file |
sed -En '/^<pg>/! { /[^ ]/ { s/<\/?b>//g; s/^ +| +$//gp; }; }'
Both solutions trim leading and trailing spaces from the output lines.
#!/bin/sh
Your first sed line does not what you want it to do:
You will only match ONE occurence per line
sed -n -e 's/.*<li>\(.*\)<\/li>.*/\1/p' $1 > temp
this...........................^^
which matches....the rest of the line (obviously not what you expected)
One quick workaround is to change every </li> into </li> plus linefeed before any other processing.
#!/bin/sh
sed -e 's/<\/li>/<\/li>\n/g' "$1" |\
sed -n -e 's/.*<li>\(.*\)<\/li>/\1/p' |\
sed -e 's/<[\/b]*>//g' >out
I am no sed expert...somebody else may have an more elegant solution

Resources