I don't know the good way to do this (see/awk/perl); I combined multiple chapters of html files and it has the following structure
title
title
title
<p>first chapter contents, multiple
pages</p>
title
title
title
<p>Second chapter contents, multiple pages
more informations</p>
title
title
title
<p>Third chapter contents, multiple pages
few more details</p>
I want them to reorganize like below
title
title
title
title
title
title
title
title
title
<p>first chapter contents, multiple
pages</p>
<p>Second chapter contents, multiple pages
more informations</p>
<p>Third chapter contents, multiple pages
few more details</p>
I have five chapters in a html to reorganize them. I was trying to adopt sed hold buffer but that seems to be difficult with my knowledge. I am not restricted to sed or awk. Any help will be highly appreciated, thanks.
Edit
Sorry altered the source file, it also has few lines that doesn't always start either with
<a or <p
is there anyway to have script like inverse selection in sed, something like
/^<a!/p/
How about running sed twice, first outputting the <a> tags, then the <p> tags:
sed -n '/^<a/p' input.txt
sed -n '/^<p/p' input.txt
Using holdspace it could be done like this:
sed -n '/^<a/p; /^<p/H; ${g; s/\n//; p}' input.txt
Print all <a> tags, put all <p> tags into holdspace, at the end of the document ($), get the holdspace and print it. H always adds a newline before appending to the holdspace, the first newline we don't want, that's why we remove it with s/\n//.
If you want to store the output, you can redirect it
sed -n '/^<a/p; /^<p/H; ${g; s/\n//; p}' input.txt > output.txt
To use directly sed -i, we need to restructure the code a bit:
sed -i '${x; G; s/\n//; p}; /^<p/{H;d}' input.txt
But this is getting a bit tedious.
If you have lines starting with other characters, and just want to move all starting with an <a> tag to the front, you can do
sed -n '/^<a/p; /^<a/! H; ${g; s/\n//; p}' input.txt
Grep works too:
(grep -F '<a' test.txt ; grep -F '<p' test.txt)
sed -n '/^ *<[aA]/ !H
/^ *<[aA]/ p
$ {x;s/\n//;p;}
' YourFile
if a <a href="#chapter to be more exact (and also allow cap and small variation) is not present at begin of the line, keep it into buffer.
if present, print the content
At the end, load buffer, remove first new line (we start with an append so there is a newx line at first keep) and print the content
Using awk
awk '{if ($0~/<a/) a[NR]=$0; else b[NR]=$0} END {for (i=1;i<=NR;i++) if (a[i]) print a[i];for (j=1;j<=NR;j++) if (b[j]) print b[j]}' file
title
title
title
title
title
title
title
title
title
<p>first chapter contents, multiple
pages</p>
<p>Second chapter contents, multiple pages
more informations</p>
<p>Third chapter contents, multiple pages
few more details</p>
Related
Suppose I have text as:
This is a sample text.
I have 2 sentences.
text is present there.
I need to replace whole text between two 'text' words. The required solution should be
This is a sample text.
I have new sentences.
text is present there.
I tried using the below command but its not working:
sed -i 's/text.*?text/text\
\nI have new sentence/g' file.txt
With your shown samples please try following. sed doesn't support lazy matching in regex. With awk's RS you could do the substitution with your shown samples only. You need to create variable val which has new value in it. Then in awk performing simple substitution operation will so the rest to get your expected output.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file
Above code will print output on terminal, once you are Happy with results of above and want to save output into Input_file itself then try following code.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file > temp && mv temp Input_file
You have already solved your problem using awk, but in case anyone else will be looking for a sed solution in the future, here's a sed script that does what you needed. Granted, the script is using some advanced sed features, but that's the fun part of it :)
replace.sed
#!/usr/bin/env sed -nEf
# This pattern determines the start marker for the range of lines where we
# want to perform the substitution. In our case the pattern is any line that
# ends with "text." — the `$` symbol meaning end-of-line.
/text\.$/ {
# [p]rint the start-marker line.
p
# Next, we'll read lines (using `n`) in a loop, so mark this point in
# the script as the beginning of the loop using a label called `loop`.
:loop
# Read the next line.
n
# If the last read line doesn't match the pattern for the end marker,
# just continue looping by [b]ranching to the `:loop` label.
/^text/! {
b loop
}
# If the last read line matches the end marker pattern, then just insert
# the text we want and print the last read line. The net effect is that
# all the previous read lines will be replaced by the inserted text.
/^text/ {
# Insert the replacement text
i\
I have a new sentence.
# [print] the end-marker line
p
}
# Exit the script, so that we don't hit the [p]rint command below.
b
}
# Print all other lines.
p
Usage
$ cat lines.txt
foo
This is a sample text.
I have many sentences.
I have many sentences.
I have many sentences.
I have many sentences.
text is present there.
bar
$
$ ./replace.sed lines.txt
foo
This is a sample text.
I have a new sentence.
text is present there.
bar
Substitue
sed -i 's/I have 2 sentences./I have new sentences./g'
sed -i 's/[A-Z]\s[a-z].*/I have new sentences./g'
Insert
sed -i -e '2iI have new sentences.' -e '2d'
I need to replace whole text between two 'text' words.
If I understand, first text. (with a dot) is at the end of first line and second text at the beginning of third line. With awk you can get the required solution adding values to var s:
awk -v s='\nI have new sentences.\n' '/text.?$/ {s=$0 s;next} /^text/ {s=s $0;print s;s=""}' file
This is a sample text.
I have new sentences.
text is present there.
i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip
You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.
You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'
If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.
d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>
See this thread : How to remove the second line of consecutive lines starting with the same word?
Instead of keeping the first duplicate consecutive line starting with "TITLE", I would like to only keep the last one, to get from this input:
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE some more
TITLE extra info
DATA some more data
This output:
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
Also, I'd like to be able to handle an arbitrary number of repetitions, and not only 2 (if by example 7 lines in a row start by "TITLE", only keep the last one).
Like the other post, it can be a perl/bash/sed/awk command that only keep the last line and output the rest of the file as well. I've been workng on this for a long time, but I could only find solutions that does the opposite of what I want.
With sed:
sed '/^TITLE/ { :a $! { N; /\nTITLE/ { s/.*\n//; ba; }; }; }' filename
That is:
/^TITLE/ { # if a line begins with TITLE
:a # jump label for looping.
$! { # unless we hit the end of input (in case the file
# ends with title lines)
N # fetch the next line
/\nTITLE/ { # if it begins with TITLE as well
s/.*\n// # remove the first
ba # go back to a
}
}
}
Just reverse the order of lines, then print the now-first occurrence, then reverse them again:
$ tac file | awk '$1!=prev; {prev=$1}' | tac
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
or if there can be multiple consecutive DATA lines and you want to keep all of those:
$ tac file | awk '!($1=="TITLE" && $1==prev); {prev=$1}' | tac
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
If you're looking for a Perl one-line solution, like the one in the question that you linked, then this will do
perl -ne'if (/^TITLE/) {$t = $_} else {print $t, $_; $t = ""}' myfile
Note that it will not print a TITLE line at all unless it is followed by a line that doesn't begin with TITLE
This might work for you (GNU sed):
sed -r 'N;/^(TITLE ).*\n\1/!P;D' file
This compares 2 lines and if the first is the same as the second does not print the first.
I've been searching for a ling time, and have not been able to find a working answer for my problem.
I have a line from an HTML file extracted with sed '162!d' skinlist.html, which contains the text
<a href="/skin/dwarf-red-beard-734/" title="Dwarf Red Beard">.
I want to extract the text Dwarf Red Beard, but that text is modular (can be changed), so I would like to extract the text between title=" and ".
I cannot, for the life of me, figure out how to do this.
awk 'NR==162 {print $4}' FS='"' skinlist.html
set field separator to "
print only line 162
print field 4
Solution in sed
sed -n '162 s/^.*title="\(.*\)".*$/\1/p' skinlist.html
Extracts line 162 in skinlist.html and captures the title attributes contents in\1.
The shell's variable expansion syntax allows you to trim prefixes and suffixes from a string:
line="$(sed '162!d' skinlist.html)" # extract the relevant line from the file
temp="${line#* title=\"}" # remove from the beginning through the first match of ' title="'
if [ "$temp" = "$line" ]; then
echo "title not found in '$line'" >&2
else
title="${temp%%\"*}" # remote from the first '"' through the end
fi
You can pass it through another sed or add expressions to that sed like -e 's/.*title="//g' -e 's/">.*$//g'
also sed
sed -n '162 s/.*"\([a-zA-Z ]*\)"./\1/p' skinlist.html
This may be a bit complex, but here it goes:
Assuming I have an XML that looks as follows:
<a>
<b>000</b>
<c>111</c>
<b>222</b>
<d>333</d>
<c>444</c>
</a>
How can I, using sed on a mac, get a resulting an XML that looks as follows:
<a>
<b>111 000</b>
<b>222</b>
<d>333</d>
<c>444</c>
</a>
Basically:
Matching 2 consecutive lines that are of the form <b>...</b> followed by </c>...</c>
Taking the value between <c>...</c> and placing it (plus a space character) right after <b> on the line before it
Removing the second line <c>...</c>
Thank you.
If sed is too much for this, please advise anything else as long as I can run it from a mac shell.
Not the most beautiful solution but it seams to work :-)
$ tr '\n' # < input | sed 's#<b>\([0-9]\+\)</b>#<c>\([0-9]\+\)</c>#<b>\2 \1</b#g' | tr # '\n'
output:
<a>
<b>111 000</b
<b>222</b>
<d>333</d>
<c>444</c>
</a>
or a bit more general:
$ tr '\n' # < f1 | sed 's#<b>\([^<]*\)</b>#<c>\([^<]*\)</c>#<b>\2 \1</b#' | tr # '\n'
using [^<] to match anything between brackets
Ruby would support multi-line patterns:
ruby -e 'print gets(nil).sub(/<b>([^\n]*)<\/b>\n<c>([^\n]*)<\/c>/m,"<b>\\2 \\1</b>")' file.txt