I have an HTML file with some repeated text along the document. The repeated strings have font size 4 or 5 and my goal is to delete all those
repeated strings except the first appeareance.
For example:
India! with size=5 appears 9 times and with size=4 appears 2 times. Then I'd like to remove all appeareances of India with size=5 and leave the first.
India!
I've tried with sed command in bash (I'm open to suggestions to do it with other tools) doing as below, but doesn't work because removes everything after the first match:
sed 's/<font size=\"[4-5]\".*<\/font>//g'
and I get as output only this:
<!DOCTYPE html> <html> <body>
<h1>Some header</h1>
<p> </p>
<p> This is other text. </p>
</body>
</html>
My input file is this:
<!DOCTYPE html>
<html>
<body>
<h1>Some header</h1>
<p>
<font size="5">India!</font>
<p>
<font size="4">Japan!</font>
</p>
</p>
<p>Some text 1</p>
<p>
<font size="5">India!</font>
</p>
<p>Some text 2</p>
<p>
<font size="5">India!</font>
<p>
<font size="4">Japan!</font>
</p>
</p>
<p>Some text 3</p>
<p>
<font size="5">Uganda!</font>
</p>
<p>Some text 4</p>
<p>
<font size="5">India!</font>
<p>
<font size="4">Japan!</font>
</p>
</p>
<p>Some text 5</p>
<p>
<font size="5">India!</font>
</p>
<p>Some text 6</p>
<p>
<font size="5">Cameroon!</font>
</p>
<p>Some text 7</p>
<p>
<font size="4">India!</font>
</p>
<p>Some text 8</p>
<p>
<font size="5">India!</font>
</p>
<p>Some text 9</p>
<p>
<font size="5">India!</font>
</p>
<p>Some text 10</p>
<p>
<font size="5">Pakistan!</font>
</p>
<p>Some text 11</p>
<p>
<font size="5">Pakistan!</font>
</p>
<p>Some text 12</p>
<p>
<font size="5">India!</font>
</p>
<p>Some text 13</p>
<p>
<font size="4">Uganda!</font>
</p>
<p>
<font size="5">India!</font>
</p>
<p>Some text 14</p>
<p>
<font size="4">India!</font>
</p>
<p> This is other text. </p>
</body>
</html>
I show in image below the input(to the left) and output desired(to the rigth) in text format and HTML preview.
As you requested in your comment, here is a slightly different program to remove the associated paragraph tags as well.
In order to remove the <p> and </p> before and after the lines you want removed ( the duplicates ), I found it conceptually easier to run through the file twice.
The first pass through the file, I keep track of whether or not I've seen the combination of font size and country just as before. In addition, I also track the line numbers (FNR) of the lines that need to be removed. The code "knows" the first pass through the file when NR == FNR. NR is total number of records so far and FNR is the record number in the file. Thus, when they are equal, awk is parsing the first file.
In the second pass through the same input file, I print out the current record if it is not marked as suppressed. The FNR is used to index the suppressed array because FNR is the same in the first pass as the second pass of the file.
Lastly, in order to tell awk to parse the file twice, we'll need to pass the input file to awk twice on the command line.
Here's the revised code. I also illustrate how to parse your input file twice by adding the file (let's call it input.html) two times to the command line:
awk -F"[\"<>= ]*" '
NR == FNR {
if ( $2 == "font" )
{
if (seen[ $4,$5 ] )
suppress[ NR - 1 ] = suppress[ NR ] = suppress[ NR + 1 ] = 1
seen[$4,$5] = 1
}
next
}
! suppress[ FNR ]
' input.html input.html
Here's an awk 'solution' for you:
awk -F"[\"<>= ]*" '
$2 == "font" {
if (!printed[ $4,$5 ] )
print
printed[$4,$5] = 1
next
}
1
'
Since awk is not a robust HTML parser, it's really not a great general solution. However, if your input files are consistent, this small script may do the trick.
Related
I'm trying to get content from an element whose #id attribute matches the context node's #idref. For example, given the following xml (just a contrived sample)...
<doc>
<toc>
<entry idref="ch1"/>
<entry idref="ch2"/>
</toc>
<body>
<chapter id="ch1">
<title>Chapter 1</title>
<para/>
</chapter>
<chapter id="ch2">
<title>Chapter 2</title>
<para/>
</chapter>
<chapter id="ch3">
<title>Chapter 3</title>
<para/>
</chapter>
</body>
</doc>
From the [entry] element, how can I get the content of [title] within [chapter] whose #id matches the current #idref.
So, basically find chapter[where chapter #id = current entry #idref]/title
I've tried
string(//chapter[#id = #idref]/title)
string(//chapter[#id = ./#idref]/title)
string(//chapter[#id = current()/#idref]/title)
all with no luck.
Can you try this expression on your xml?
//chapter[#id=//toc/entry/#idref]/string-join((title,#id),' ')
Output:
Chapter 1 ch1
Chapter 2 ch2
I'm trying to use xpath to return the value "Vancouver", from either the comment or the text after it. Can anyone point me in the right direction?
The location li is always the first item but is not always present, and the number of list items after it varies for each item.
<item>
<title>
<description>
<!-- Comment #1 -->
<ul class="class1">
<li> <!-- ABC Location=Vancouver -->Location: Vancouver</li>
<li> <!-- More comments -->Text</li>
<li> text</li>
</ul>
</description>
</item>
This will pull it from the text after the comment:
substring-after(//ul[#class='class1']/li[position()=1 and contains(.,'Location:')],'Location: ')
This specifies the first <li> inside the <ul> of class 'class1', only when it contains 'Location:', and takes the string after 'Location:'. If you want to relax the requirement that it be the first li, use this:
substring-after(//ul[#class='class1']/li[contains(.,'Location:')],'Location: ')
This isn't eloquent, and it could cause issues if your "Location: #####" were to change structurally, because this is a static solution, but it works for the above:
substring(//item//li[1],12,string-length(//item//li[1])-10)
And this returns the string equivalent, not a node.
Rushed this one a bit, so I'll give a better solution with time but this is just something to think about...
(it just strips off "Location: " and returns whatever's after it..)
Use:
substring-after(/*/description/ul
/li[1]/text()[starts-with(., 'Location: ')],
'Location: '
)
To extract the location from the comment use:
substring-after(/*/description/ul
/li[1]/comment()[starts-with(., ' ABC Location=')],
' ABC Location='
)
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select=
"substring-after(/*/description/ul
/li[1]/text()[starts-with(., 'Location: ')],
'Location: '
)
"/>
==========
<xsl:copy-of select=
"substring-after(/*/description/ul
/li[1]/comment()[starts-with(., ' ABC Location=')],
' ABC Location='
)
"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<item>
<title/>
<description>
<!-- Comment #1 -->
<ul class="class1">
<li>
<!-- ABC Location=Vancouver -->Location: Vancouver
</li>
<li>
<!-- More comments -->Text
</li>
<li> text</li>
</ul>
</description>
</item>
the two XPath expressions are evaluated and the results of the evaluations are copied to the output:
Vancouver
==========
Vancouver
i have HTML like this:
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>
So i need to get at the one time Hello1 with World1, Hello2 with World2 etc
UPDATE: I use Ruby Mechanize library
The Ruby library "Mechanize" uses the Nokogiri parsing library, so you can call Nokogiri directly. One potential solution might look something like this:
require 'mechanize'
require 'pp'
html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
p = header.xpath("following-sibling::p[1]").text
results << [header.text, p]
end
pp results
EDIT:
This example was tested with Mechanize v2.0.1 which uses Nokogiri ~v1.4. I also tested directly against Nokogiri v1.5.0 without issue.
EDIT #2:
This example answers a follow-up question to the original solution:
require 'nokogiri'
require 'pp'
html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML
doc = Nokogiri::HTML(html)
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
h1 = header.xpath("following-sibling::p/font/b").text
results << h1
end
pp results
H1 tags with nested elements are invalid, so Nokogiri corrects the error during the parsing process. The process to get at the formerly nested elements is very similar to the original solution.
Note: I glazed over the XPath part of this request. This answer is for an XSLT style sheet instead.
Expanding your XML example to give it a root element:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello3</h1>
<p>World3</p>
</root>
You could use a for-each loop along with "following-sibling" to get the elements with something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output encoding="UTF-8" method="text"/>
<xsl:template match="/">
<!-- start lookint for <h1> nodes -->
<xsl:for-each select="/root/h1">
<!-- output the h1 text -->
<xsl:value-of select="."/>
<!-- print a dash for spacing -->
<xsl:text> - </xsl:text>
<!-- select the next <p> node -->
<xsl:value-of select="following-sibling::p[1]"/>
<!-- print a new line -->
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The output would look like this:
Hello1 - World1
Hello2 - World2
Hello3 - World3
I have two XML files. The first is:
<a>
<b>
<c1>1</c1>
</b>
<b>
<c1>2</c1>
</b>
<b "id" = "true">
<c1>3</c1>
<d "do" ="me"></d>
</b>
<b id ="true">
<c1>4</c1>
</b>
</a>
And the second is:
<a>
<b>
<c1>5</c1>
</b>
</a>
I want to update an element from first.xml:
<b "id" = "true">
<c1>3</c1>
<d "do" ="me"></d>
</b>
with an element from second.xml:
<b>
<c1>5</c1>
</b>
I tried to achieve that by deleting all the <b> nodes from first.xml and add the node <b> taken from second.xml file. I am able to delete all the nodes <b> but not able get an element from second.xml and add that to the first.xml.
After cleaning up the source XML, this seems to be what you're looking for:
xml1 = <<EOT
<a>
<b>
<c1>1</c1>
</b>
<b>
<c1>2</c1>
</b>
<b id="true">
<c1>3</c1>
<d do="me"></d>
</b>
<b id="true">
<c1>4</c1>
</b>
</a>
EOT
xml2 = <<EOT
<a>
<b>
<c1>5</c1>
</b>
</a>
EOT
require 'nokogiri'
doc1 = Nokogiri::XML(xml1)
doc2 = Nokogiri::XML(xml2)
doc1_b = doc1.at('//b[#id="true"]/c1/..')
doc2_b = doc2.at('b')
doc1_b.replace(doc2_b)
puts doc1.to_html
Which outputs:
<a>
<b>
<c1>1</c1>
</b>
<b>
<c1>2</c1>
</b>
<b>
<c1>5</c1>
</b>
<b id="true">
<c1>4</c1>
</b>
</a>
doc1.at('//b[#id="true"]/c1/..')' means "find the first occurrence of a b tag with id="true" with a child c1 node".
the option
//b[#id="true" and d/#do="me"]
with the above answer answers my question
I am using mechanize/nokogiri and need to parse out the following HTML string.
can anyone help me with the xpath syntax to do this or any other methods that would work?
<table>
<tr class="darkRow">
<td>
<span>
<a href="?x=mSOWNEBYee31H0eV-V6JA0ZejXANJXLsttVxillWOFoykMg5U65P4x7FtTbsosKRbbBPuYvV8nPhET7b5sFeON4aWpbD10Dq">
<span>4242YP</span>
</a>
</span>
</td>
<td>
<span>Subject of Meeting</span>
</td>
<td>
<span>
<span>01:00 PM</span>
<span>Nov 11 2009</span>
<span>America/New_York</span>
</span>
</td>
<td>
<span>30</span>
</td>
<td>
<span>
<span>example#email.com</span>
</span>
</td>
<td>
<span>39243368</span>
</td>
</tr>
.
.
.
<more table rows with the same format>
</table>
I want this as the output
"4242YP","Subject of Meeting","01:00 PM Nov 11 2009 America/New_York","30","example#email.com", "39243368"
.
.
.
<however many rows exist in the html table>
something like this?
items=doc.xpath('//tr').map {|row| row.xpath('.//span/text()').select{|item| item.text.match(/\w+/)}.map {|item| item.text} }
returns:
=> [["4242YP", "Subject of Meeting", "01:00 PM", "Nov 11 2009", "America/New_York", "30", "example#email.com", "39243368"], ["abcdefg"]]
Select includes only spans that start with word characters (e.g. excluding the whitespace that some of your spans have). You may need to refine the "select" filter for your specific case.
I added a minimalist row that contained a span containing abcdefg, so that you can see the nested array.
Here's part of the XSL to transform your input, if you have an XSL transformer:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="//tr"/>
</xsl:template>
<xsl:template match="tr">
"<xsl:value-of select="td/span/a/span"/>","<xsl:value-of select="td[position()=2]/span"/>","<xsl:value-of select="td[position()=3]/span/span[position()=1]"/>"
</xsl:template>
</xsl:stylesheet>
Output produced looks like this:
"4242YP","Subject of Meeting","01:00 PM"
"4242YP","Subject of Meeting","01:00 PM"
(I duplicated your first table row).
The XSL select bits give you a good idea of what XPATH input you'd need to get the rest.