I have an xpath expression that should do the following: checking the naming of #name and checking if the text starts with "machinery" if conditions are fulfilled "something" will been shown.
I have several fields with naming:
custom.DOC.machinery1
custom.DOC.machinery2
custom.DOC.machinery3
custom.DOC.machinery4
custom.DOC.machinery5
I need to check if one of above fields (starts with machinery) have text/values who start with "machinery".
So I tried following xpath but it seems not to work:
/node/node/node/node[#class = 'InfoType12']/metalist/meta[starts-with(#name,'custom.DOC.machinery') and starts-with(text(),'Machinery')]
If I should be able to combine the working xpath (based on your modified xpath) above with following xpath:
/node/node/node/node[#class = 'InfoType12']/metalist/meta[#name = 'custom.DOC.machinery1.asterisk' or #name = 'custom.DOC.machinery2.asterisk' or #name = 'custom.DOC.machinery3.asterisk' or #name = 'custom.DOC.machinery4.asterisk' or #name = 'custom.DOC.machinery5.asterisk' and contains(text(),'**')]
that would be great.
Now they are in 2 sections, but I think there is maybe a possibility to merge these sections.
Something wrong in my xpath(s)? Other ideas?
I'm using XPath 1.0. Xml is too large to insert
<node class="MultiLangRoot" type="MultiLang" primary-aspect-id="131846" primary-aspect-name="en" primary-specific-culture="en-GB" primary-writing-mode="ltr">
<node class="Layout">
<title>Declaration_of_conformity_NEW</title>
</node>
<node class="MultiLangAspect" type="MultiLang" aspect-id="131846" aspect-name="en" writing-mode="ltr" target-culture="en-GB" character-set="0" htmlhelp-lcid="0x809 English (United Kingdom)">
<title>English</title>
<node id="10976987147" xml:space="preserve" aspect-name="en" specific-culture="en-GB" class="Project" type="Unknown" color="-2354116" wf-state="NotReleased" base-id="10976987147" creator="Jesse Victoor" creationtime="2021-06-16T08:42:32" modificator="Jesse Victoor" modificationtime="2021-06-16T08:42:32" releasedby="" releasedate="" releasedate-invariant="" versionnumber="1" tms-state="Neutral" tms-state-displayname="Language-neutral">
<metalist>
<meta name="Title">New project</meta>
<meta name="Document_typeTaxonomyLink" collectable="true">Declaration of Conformity</meta>
<multi-meta id="3370875395" name="Document_typeTaxonomyLink" collectable="true">
<multi-meta-element id="3371659147" base-id="3371659147" value="Declaration of Conformity"/>
</multi-meta>
</metalist>
<title>New project</title>
<node id="10978966155" format="portrait" class="InfoType12" type="TextNode" color="-65536" wf-state="NotReleased" reuseid="10978974347" base-reuseid="10978974347" base-id="10978966155" chapternumbering="no" showintoc="no" creator="Jesse Victoor" creationtime="2021-06-17T10:30:13" modificator="Jesse Victoor" modificationtime="2021-06-17T10:30:13" releasedby="" releasedate="" releasedate-invariant="" versionnumber="1" content-creator="Jesse Victoor" content-creationtime="17/06/2021 10:28:18" content-creationtime-invariant="2021-06-17T10:28:18" content-modificator="Jesse Victoor" content-modificationtime="30/07/2021 10:23:45" content-modificationtime-invariant="2021-07-30T10:23:45" tms-state="Neutral" tms-state-displayname="Language-neutral" xmlns:st4="http://www.schema.de/2002/ST4/DocuManager" aspect-name="en" specific-culture="en-GB">
<title>3P663050-1_DOC safety (non PED) - EU in scope of MD</title>
<metalist>
<meta name="custom.DOC.affix">* = , , 1, 2, 3, ..., 9</meta>
<meta name="custom.DOC.EN_standard">EN60335-2-40, </meta>
<meta name="custom.DOC.machinery1">Machinery 2006/42/EU</meta>
<meta name="custom.DOC.machinery1.asterik">**</meta>
<meta name="custom.DOC.machinery2">Low Voltage 2014/35/EU</meta>
<meta name="custom.DOC.machinery3">Electromagnetic Compatibility 2014/30/EU</meta>
<meta name="custom.DOC.machinery3.asterik">*</meta>
<meta name="custom.DOC.manufacturer">Daikin Europe N.V.</meta>
<meta name="custom.DOC.sign.left">Yes</meta>
<meta name="custom.DOC.sign.right">No</meta>
<meta name="custom.DOC.standalone.sign.left">No</meta>
<meta name="custom.DOC.standalone.sign.right">Yes</meta>
<meta name="custom.DOC.TCF">Yes</meta>
<meta name="custom.DOC.units1">Units</meta>
<meta name="custom.first.flow">Left page</meta>
<meta name="meta.Pub.Format" sysValue="portrait">Portrait</meta>
<meta name="Title">3P663050-1_DOC safety (non PED) - EU in scope of MD</meta>
</metalist>
<content>
<modref src="b1f75d7b-ada4-4088-ba31-fd7a1bb91b24"/>
<modref src="c8943c42-7c0f-4e77-806f-2872da269a4d"/>
<modref src="7300b119-c8ea-4945-ba51-84d50302a575"/>
<modref src="7658bf10-d95c-4cd9-b906-101f5e148949"/>
<modref src="f603d081-6e60-48ba-b112-6d907865cc49"/>
<modref src="d2115d02-441c-4ec1-b82d-2210a4a62d08"/>
<modref src="f75ff48b-8f83-495c-972e-a2af20cd033c"/>
<modref src="e927b213-cf96-4fff-b61d-d9b27a7a0ec4"/>
<modref src="d2338324-ba6d-4b20-862c-10efcb28fb7e"/>
<modref src="1dc84a97-13ba-4eed-840b-e0a1fe1360fa"/>
</content>
<node id="10979452683" class="TextModule2" type="TextModule" color="-1969921" wf-state="NotReleased" base-id="10979452683" chapternumbering="yes" showintoc="yes" creator="Jesse Victoor" creationtime="2021-06-17T15:37:52" modificator="Jesse Victoor" modificationtime="2021-06-17T15:37:52" releasedby="" releasedate="" releasedate-invariant="" versionnumber="1" content-creator="Jesse Victoor" content-creationtime="17/06/2021 15:37:52" content-creationtime-invariant="2021-06-17T15:37:52" content-modificator="Administrator" content-modificationtime="29/04/2016 02:25:51" content-modificationtime-invariant="2016-04-29T02:25:51" tms-state="Neutral" tms-state-displayname="Language-neutral" linkid="f75ff48b-8f83-495c-972e-a2af20cd033c" xmlns:st4="http://www.schema.de/2002/ST4/DocuManager" aspect-name="en" specific-culture="en-GB">
<title>DENV is authorized to compiled TCF table</title>
<metalist>
<meta name="Title">DENV is authorized to compiled TCF table</meta>
</metalist>
<content>
<table-container type="noframe">
<table hsdl-cm="0.40 6.77 0.40 6.98 0.40 6.97 0.40 6.38" type="scaled">
<tbody>
<tr>
<td>
<p type="p_table_l">
<b>01**</b>
</p>
<p type="p_table_l">
<b>02**</b>
</p>
<p type="p_table_l">
<b>03**</b>
</p>
<p type="p_table_l">
<b>04**</b>
</p>
<p type="p_table_l">
<b>05**</b>
</p>
<p type="p_table_l">
<b>06**</b>
</p>
</td>
<td>
<p type="p_table_l">Daikin Europe N.V. is authorised to compile the Technical Construction File.</p>
<p type="p_table_l">Daikin Europe N.V. hat die Berechtigung die Technische Konstruktionsakte zusammenzustellen.</p>
<p type="p_table_l">Daikin Europe N.V. est autorisé à compiler le Dossier de Construction Technique.</p>
<p type="p_table_l">Daikin Europe N.V. is bevoegd om het Technisch Constructiedossier samen te stellen.</p>
<p type="p_table_l">Daikin Europe N.V. está autorizado a compilar el Archivo de Construcción Técnica.</p>
<p type="p_table_l">Daikin Europe N.V. è autorizzata a redigere il File Tecnico di Costruzione.</p>
</td>
<td>
<p type="p_table_l">
<b>07**</b>
</p>
<p type="p_table_l">
<b>08**</b>
</p>
<p type="p_table_l">
<b>09**</b>
</p>
<p type="p_table_l">
<b>10**</b>
</p>
<p type="p_table_l">
<b>11**</b>
</p>
<p type="p_table_l">
<b>12**</b>
</p>
</td>
<td>
<p type="p_table_l">Η Daikin Europe N.V. είναι εξουσιοδοτημένη να συντάξει τον Τεχνικό φάκελο κατασκευής.</p>
<p type="p_table_l">A Daikin Europe N.V. está autorizada a compilar a documentação técnica de fabrico.</p>
<p type="p_table_l">Компания Daikin Europe N.V. уполномочена составить Комплект технической документации.</p>
<p type="p_table_l">Daikin Europe N.V. er autoriseret til at udarbejde de tekniske konstruktionsdata.</p>
<p type="p_table_l">Daikin Europe N.V. är bemyndigade att sammanställa den tekniska konstruktionsfilen.</p>
<p type="p_table_l">Daikin Europe N.V. har tillatelse til å kompilere den Tekniske konstruksjonsfilen.</p>
</td>
<td>
<p type="p_table_l">
<b>13**</b>
</p>
<p type="p_table_l">
<b>14**</b>
</p>
<p type="p_table_l">
<b>15**</b>
</p>
<p type="p_table_l">
<b>16**</b>
</p>
<p type="p_table_l">
<b>17**</b>
</p>
<p type="p_table_l">
<b>18**</b>
</p>
</td>
<td>
<p type="p_table_l">Daikin Europe N.V. on valtuutettu laatimaan Teknisen asiakirjan.</p>
<p type="p_table_l">Společnost Daikin Europe N.V. má oprávnění ke kompilaci souboru technické konstrukce.</p>
<p type="p_table_l">Daikin Europe N.V. je ovlašten za izradu Datoteke o tehničkoj konstrukciji.</p>
<p type="p_table_l">A Daikin Europe N.V. jogosult a műszaki konstrukciós dokumentáció összeállítására.</p>
<p type="p_table_l">Daikin Europe N.V. ma upoważnienie do zbierania i opracowywania dokumentacji konstrukcyjnej.</p>
<p type="p_table_l">Daikin Europe N.V. este autorizat să compileze Dosarul tehnic de construcţie.</p>
</td>
<td>
<p type="p_table_l">
<b>19**</b>
</p>
<p type="p_table_l">
<b>20**</b>
</p>
<p type="p_table_l">
<b>21**</b>
</p>
<p type="p_table_l">
<b>22**</b>
</p>
<p type="p_table_l">
<b>23**</b>
</p>
<p type="p_table_l">
<b>24**</b>
</p>
<p type="p_table_l">
<b>25**</b>
</p>
</td>
<td>
<p type="p_table_l">Daikin Europe N.V. je pooblaščen za sestavo datoteke s tehnično mapo.</p>
<p type="p_table_l">Daikin Europe N.V. on volitatud koostama tehnilist dokumentatsiooni.</p>
<p type="p_table_l">Daikin Europe N.V. е оторизирана да състави Акта за техническа конструкция.</p>
<p type="p_table_l">Daikin Europe N.V. yra įgaliota sudaryti šį techninės konstrukcijos failą.</p>
<p type="p_table_l">Daikin Europe N.V. ir autorizēts sastādīt tehnisko dokumentāciju.</p>
<p type="p_table_l">Spoločnosť Daikin Europe N.V. je oprávnená vytvoriť súbor technickej konštrukcie.</p>
<p type="p_table_l">Daikin Europe N.V. Teknik Yapı Dosyasını derlemeye yetkilidir.</p>
</td>
</tr>
</tbody>
</table>
</table-container>
</content>
</node>
<node id="10979459083" class="TextModule2" type="TextModule" color="-1969921" wf-state="NotReleased" base-id="10979459083" chapternumbering="yes" showintoc="yes" creator="Jesse Victoor" creationtime="2021-06-17T15:37:52" modificator="Jesse Victoor" modificationtime="2021-06-17T15:37:52" releasedby="" releasedate="" releasedate-invariant="" versionnumber="1" content-creator="Jesse Victoor" content-creationtime="17/06/2021 15:37:52" content-creationtime-invariant="2021-06-17T15:37:52" content-modificator="Administrator" content-modificationtime="29/04/2016 02:25:51" content-modificationtime-invariant="2016-04-29T02:25:51" tms-state="Neutral" tms-state-displayname="Language-neutral" linkid="e927b213-cf96-4fff-b61d-d9b27a7a0ec4" xmlns:st4="http://www.schema.de/2002/ST4/DocuManager" aspect-name="en" specific-culture="en-GB">
<title>DENV is authorized to compiled TCF table</title>
<metalist>
<meta name="Title">DENV is authorized to compiled TCF table</meta>
</metalist>
<content>
<table-container type="noframe">
<table hsdl-cm="0.40 0.40 6.67 0.40 6.88 0.40 6.87 0.40 6.28" type="scaled">
<tbody>
<tr>
<td>
<p type="p_table_l">
<b/>
</p>
</td>
<td>
<p type="p_table_l">
<b>01***</b>
</p>
<p type="p_table_l">
<b>02***</b>
</p>
<p type="p_table_l">
<b>03***</b>
</p>
<p type="p_table_l">
<b>04***</b>
</p>
<p type="p_table_l">
<b>05***</b>
</p>
<p type="p_table_l">
<b>06***</b>
</p>
</td>
<td>
<p type="p_table_l">Daikin Europe N.V. is authorised to compile the Technical Construction File.</p>
<p type="p_table_l">Daikin Europe N.V. hat die Berechtigung die Technische Konstruktionsakte zusammenzustellen.</p>
<p type="p_table_l">Daikin Europe N.V. est autorisé à compiler le Dossier de Construction Technique.</p>
<p type="p_table_l">Daikin Europe N.V. is bevoegd om het Technisch Constructiedossier samen te stellen.</p>
<p type="p_table_l">Daikin Europe N.V. está autorizado a compilar el Archivo de Construcción Técnica.</p>
<p type="p_table_l">Daikin Europe N.V. è autorizzata a redigere il File Tecnico di Costruzione.</p>
</td>
<td>
<p type="p_table_l">
<b>07***</b>
</p>
<p type="p_table_l">
<b>08***</b>
</p>
<p type="p_table_l">
<b>09***</b>
</p>
<p type="p_table_l">
<b>10***</b>
</p>
<p type="p_table_l">
<b>11***</b>
</p>
<p type="p_table_l">
<b>12***</b>
</p>
</td>
<td>
<p type="p_table_l">Η Daikin Europe N.V. είναι εξουσιοδοτημένη να συντάξει τον Τεχνικό φάκελο κατασκευής.</p>
<p type="p_table_l">A Daikin Europe N.V. está autorizada a compilar a documentação técnica de fabrico.</p>
<p type="p_table_l">Компания Daikin Europe N.V. уполномочена составить Комплект технической документации.</p>
<p type="p_table_l">Daikin Europe N.V. er autoriseret til at udarbejde de tekniske konstruktionsdata.</p>
<p type="p_table_l">Daikin Europe N.V. är bemyndigade att sammanställa den tekniska konstruktionsfilen.</p>
<p type="p_table_l">Daikin Europe N.V. har tillatelse til å kompilere den Tekniske konstruksjonsfilen.</p>
</td>
<td>
<p type="p_table_l">
<b>13***</b>
</p>
<p type="p_table_l">
<b>14***</b>
</p>
<p type="p_table_l">
<b>15***</b>
</p>
<p type="p_table_l">
<b>16***</b>
</p>
<p type="p_table_l">
<b>17***</b>
</p>
<p type="p_table_l">
<b>18***</b>
</p>
</td>
<td>
<p type="p_table_l">Daikin Europe N.V. on valtuutettu laatimaan Teknisen asiakirjan.</p>
<p type="p_table_l">Společnost Daikin Europe N.V. má oprávnění ke kompilaci souboru technické konstrukce.</p>
<p type="p_table_l">Daikin Europe N.V. je ovlašten za izradu Datoteke o tehničkoj konstrukciji.</p>
<p type="p_table_l">A Daikin Europe N.V. jogosult a műszaki konstrukciós dokumentáció összeállítására.</p>
<p type="p_table_l">Daikin Europe N.V. ma upoważnienie do zbierania i opracowywania dokumentacji konstrukcyjnej.</p>
<p type="p_table_l">Daikin Europe N.V. este autorizat să compileze Dosarul tehnic de construcţie.</p>
</td>
<td>
<p type="p_table_l">
<b>19***</b>
</p>
<p type="p_table_l">
<b>20***</b>
</p>
<p type="p_table_l">
<b>21***</b>
</p>
<p type="p_table_l">
<b>22***</b>
</p>
<p type="p_table_l">
<b>23***</b>
</p>
<p type="p_table_l">
<b>24***</b>
</p>
<p type="p_table_l">
<b>25***</b>
</p>
</td>
<td>
<p type="p_table_l">Daikin Europe N.V. je pooblaščen za sestavo datoteke s tehnično mapo.</p>
<p type="p_table_l">Daikin Europe N.V. on volitatud koostama tehnilist dokumentatsiooni.</p>
<p type="p_table_l">Daikin Europe N.V. е оторизирана да състави Акта за техническа конструкция.</p>
<p type="p_table_l">Daikin Europe N.V. yra įgaliota sudaryti šį techninės konstrukcijos failą.</p>
<p type="p_table_l">Daikin Europe N.V. ir autorizēts sastādīt tehnisko dokumentāciju.</p>
<p type="p_table_l">Spoločnosť Daikin Europe N.V. je oprávnená vytvoriť súbor technickej konštrukcie.</p>
<p type="p_table_l">Daikin Europe N.V. Teknik Yapı Dosyasını derlemeye yetkilidir.</p>
</td>
</tr>
</tbody>
</table>
</table-container>
</content>
</node>
<linklist>
<link name="TextModuleLink" id="10978969484" target="EU-declaration of conformity table" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10976335115" target-base-id="10976335115" source-id="10978966155" label="b1f75d7b-ada4-4088-ba31-fd7a1bb91b24" color="-3283201" external="0"/>
<link name="TextModuleLink" id="10978970508" target="Declares under its sole responsibility that the PRODUCTS table" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10976381579" target-base-id="10976381579" source-id="10978966155" label="c8943c42-7c0f-4e77-806f-2872da269a4d" color="-3283201" external="0"/>
<link name="TextModuleLink" id="10978971276" target="Are in conformity with the following directive(s) or regulation(s), provided that the products" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10976429963" target-base-id="10976429963" source-id="10978966155" label="7300b119-c8ea-4945-ba51-84d50302a575" color="-3283201" external="0"/>
<link name="TextModuleLink" id="10978972044" target="as amended." source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10976569099" target-base-id="10976569099" source-id="10978966155" label="7658bf10-d95c-4cd9-b906-101f5e148949" color="-3283201" external="0"/>
<link name="TextModuleLink" id="10978972812" target="following the provisions of:" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10976643851" target-base-id="10976643851" source-id="10978966155" label="f603d081-6e60-48ba-b112-6d907865cc49" color="-3283201" external="0"/>
<link name="TextModuleLink" id="10978973580" target="Certificate A-B-C" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10976733835" target-base-id="10976733835" source-id="10978966155" label="d2115d02-441c-4ec1-b82d-2210a4a62d08" color="-3283201" external="0"/>
<link name="TextModuleLink" id="10979511564" target="DENV is authorized to compiled TCF table" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10979452683" target-base-id="10979452683" source-id="10978966155" label="f75ff48b-8f83-495c-972e-a2af20cd033c" color="-3283201" external="0"/>
<link name="TextModuleLink" id="11292616844" target="DENV is authorized to compiled TCF table" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10979459083" target-base-id="10979459083" source-id="10978966155" label="e927b213-cf96-4fff-b61d-d9b27a7a0ec4" color="-3283201" external="0"/>
<link name="TextModuleLink" id="11292618380" target="DICZ is authorized to compiled TCF table" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10979503627" target-base-id="10979503627" source-id="10978966155" label="d2338324-ba6d-4b20-862c-10efcb28fb7e" color="-3283201" external="0"/>
<link name="TextModuleLink" id="11292617612" target="DICZ is authorized to compiled TCF table" source="3P663050-1_DOC safety (non PED) - EU in scope of MD" target-id="10979493387" target-base-id="10979493387" source-id="10978966155" label="1dc84a97-13ba-4eed-840b-e0a1fe1360fa" color="-3283201" external="0"/>
</linklist>
</node>
</node>
</node>
</node>
Related
<tr><td class=term>1st param</td>
<td>PUTIN
<div class='info-icon'>
<a href='#' onmouseover='show_pd(351);' onmouseout='hide_pd(351);' id='info-icon-351'></a>
</div>
<div id='pd-351' style='display: none; position: absolute;'>
<b>СПРАВКА</b>
<br /><br />
<P align=justify><NOBR><STRONG>ABS</STRONG></NOBR>bla-bla-bla text</P>
<P align=justify>bla-bla-bla text 2</P>
<P align=justify>bla-bla-bla text 3</P>
<P align=justify>bla-bla-bla text 4</P>
</div>
</td>
I need extract only "PUTIN".
Now I'm on
//td[#class="term"][contains(text(), "1st param")]/following-sibling::td/[not(self::p)]
With some adjustments to your XML following XPath
//td[#class="term"][contains(text(), "1st param")]/following-sibling::td/node()[1]
has the output PUTIN
Adjustments were to change <td class=term> into <td class="term"> and all <P align=justify> into <P align="justify"> (maybe not necessary for your settings but was required for the XPath evaluator I just used).
I am parsing a rating site to find out which ratings a given company has.
The ratings can vary between 1 and 5, and they can all be extracted with this code:
a = Mechanize.new
page = a.get(url)
reviews = page.search(".reviewcontent")
reviews.each do |r|
rating = r.at_css(".s1, .s2, .s3, .s4, .s5")
puts rating # => <span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content="1">
<meta itemprop="bestRating" content="5">
<meta itemprop="ratingValue" content="5"></span>
puts rating.inspect # => #<Nokogiri::XML::Element:0x3fe0e108783c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fe0e1087440 name="class" value="s5">, #<Nokogiri::XML::Attr:0x3fe0e108742c name="itemprop" value="reviewRating">, #<Nokogiri::XML::Attr:0x3fe0e1087404 name="itemscope">, #<Nokogiri::XML::Attr:0x3fe0e10873dc name="itemtype" value="http://schema.org/Rating">] children=[#<Nokogiri::XML::Text:0x3fe0e108648c "\r\n ">, #<Nokogiri::XML::Element:0x3fe0e108634c name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e108625c name="itemprop" value="worstRating">, #<Nokogiri::XML::Attr:0x3fe0e1086248 name="content" value="1">]>, #<Nokogiri::XML::Element:0x3fe0e10898bc name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e10897cc name="itemprop" value="bestRating">, #<Nokogiri::XML::Attr:0x3fe0e10897b8 name="content" value="5">]>, #<Nokogiri::XML::Element:0x3fe0e1088b10 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e1088994 name="itemprop" value="ratingValue">, #<Nokogiri::XML::Attr:0x3fe0e1088980 name="content" value="5">]>]>
end
I am interested in this line: <meta itemprop="ratingValue" content="5"> and specifically the vaule of contentwhich in this case is 5.
How do I extract this value?
Edit:
puts reviews.to_html gives this result:
<div class="reviewcontent">
<p class="r-m ">
<span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content="1">
<meta itemprop="bestRating" content="5">
<meta itemprop="ratingValue" content="5"></span>
</p>
<time datetime="2011-09-15T18:16:10.0000000+02:00" class="ndate strong" title="15. september 2011 - 18:16:10" pubdate>
15. september 2011
<span title="2011-09-15T18:16:10.0000000+02:00"></span>
</time><meta itemprop="dateCreated" content="2011-09-15T18:16:10.0000000+02:00">
<h3 itemprop="headline" class="summary da">
Tip Top
</h3>
<p itemprop="reviewBody">
Bestilte en del fluer, en krogskærper og andre småting.<br>Kom 3 dage efter bestilling og alt var, som det skulle.
</p>
<span class="imagezoom">
</span>
<div class="actions">
<input type="hidden" name="ReviewId" value="4e7240ea00006400020e3b0e"><input type="hidden" name="UserName" value="Strit"><a href="http://www.trustpilot.dk/review/scandicfly.dk/4e7240ea00006400020e3b0e#allcomments" class="comments fb-comments-label" id="FB-comment-box-0">
<span></span>
Kommentar (<comments-count href="http://trustpilot.com/review/scandicfly.dk#4e7240ea00006400020e3b0e">?</comments-count>)
</a>
<a class="useful" data-reviewid="4e7240ea00006400020e3b0e" href="#"><span> </span>
Find nyttig
</a>
<a class="replyAsCompany" href="#"><span></span>
Svar som firma
</a>
<a class="report" data-reviewid="932622" href="#"><span></span>
Rapportér
</a>
</div>
<div class="fb-comments-wrapper">
<div class="social-guidelines">Sociale retningslinjer</div>
</div>
<div class="companyComments" id="CompanyComments_932622">
<div class="companyComments" id="CompanyComments_4e7240ea00006400020e3b0e">
</div>
</div>
</div><div class="reviewcontent">
<p class="r-m ">
<span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content="1">
<meta itemprop="bestRating" content="5">
<meta itemprop="ratingValue" content="5"></span>
</p>
<time datetime="2011-04-05T16:05:06.0000000+02:00" class="ndate" title="5. april 2011 - 16:05:06" pubdate>
5. april 2011
<span title="2011-04-05T16:05:06.0000000+02:00"></span>
</time><meta itemprop="dateCreated" content="2011-04-05T16:05:06.0000000+02:00">
<h3 itemprop="headline" class="summary da">
en god og flot oplevelse
</h3>
<p itemprop="reviewBody">
Købte en fiskestang hos ScandicFly. Faktra ordrebekræftigelse og det hele præsenteret meget flot. Der kom desuden et notis om min fiskestang var afsendt.<br>Et par dage efter kom min fiskestang med posten forsvarligt pakket ind.
</p>
<span class="imagezoom">
</span>
<div class="actions">
<input type="hidden" name="ReviewId" value="4d9b3db2000064000209035f"><input type="hidden" name="UserName" value="Peter Leter"><a href="http://www.trustpilot.dk/review/scandicfly.dk/4d9b3db2000064000209035f#allcomments" class="comments fb-comments-label" id="FB-comment-box-1">
<span></span>
Kommentar (<comments-count href="http://trustpilot.com/review/scandicfly.dk#4d9b3db2000064000209035f">?</comments-count>)
</a>
<a class="useful" data-reviewid="4d9b3db2000064000209035f" href="#"><span></span>
Find nyttig
</a>
<a class="replyAsCompany" href="#"><span></span>
Svar som firma
</a>
<a class="report" data-reviewid="590687" href="#"><span></span>
Rapportér
</a>
</div>
<div class="fb-comments-wrapper">
<div class="social-guidelines">Sociale retningslinjer</div>
</div>
<div class="companyComments" id="CompanyComments_590687">
<div class="companyComments" id="CompanyComments_4d9b3db2000064000209035f">
</div>
</div>
You can take below xpath after that:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-_HTML_
<span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content="1">
<meta itemprop="bestRating" content="5">
<meta itemprop="ratingValue" content="5">
</span>
_HTML_
doc.at("//meta[#itemprop = 'bestRating']/#content").to_s
# => "5"
In your case write as below:
r.at_css(".s1, .s2, .s3, .s4, .s5").at("//meta[#itemprop = 'bestRating']/#content").to_s
Just to clean up Babai's answer a bit, how about:
doc.at('meta[itemprop="bestRating"]')[:content]
Actually you could have just:
rating[:class][/\d/]
See why?
Hi I have below markup structure. I just grabbed one column from the entire table.
What I'm trying to do here is that within td with "mon" class, all other tags would be hidden other than "monTime" class (which would be done by CSS). Then all "staffBox" div should be re-sorted according to value in p tag with "monTime" class
So in below example, the order should be Vanessa, Adele then Zoe (naturally sorted by their starting time in "monTime" )
I don't mind if I need to change class name structure or whatsoever
<td class="mon">
<div class="staffBox">
<h4>Adele</h4>
<p class="monTime">7AM - 7AM</p>
<p class="tueTime">7AM - 6AM</p>
<p class="wedTime">12AM - 5AM</p>
<p class="thuTime">8AM - 12AM</p>
<p class="friTime">6AM - 12AM</p>
<p class="satTime">12AM - 10AM</p>
<p class="sunTime">12AM - 9AM</p>
</div>
<div class="staffBox">
<h4>Zoe</h4>
<p class="monTime">1PM - 6PM</p>
<p class="tueTime"> - </p>
<p class="wedTime"> - </p>
<p class="thuTime"> - </p>
<p class="friTime"> - </p>
<p class="satTime"> - </p>
<p class="sunTime"> - </p>
</div>
<div class="staffBox">
<h4>Vanessa</h4>
<p class="monTime">3AM - 6AM</p>
<p class="tueTime"> - </p>
<p class="wedTime"> - </p>
<p class="thuTime"> - </p>
<p class="friTime"> - </p>
<p class="satTime"> - </p>
<p class="sunTime"> - </p>
</div>
</td>
Have a look at the jQuery tablesorter plugin, it's what I always use for sorting.
I receive an html like that below from a server. I rebuild the textual part by using the XPath exp #"//text()" and appending the "nodeContent" value to a string. The code is something like this:
for (int i=2; i<[resultXPathQuery count]; i++) {
[mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:#"nodeContent"]];
[mytext appendString:#"\n"];
}
I obtain:
Line 1
line 2
line 3
line 4
How could I build the textual part also considering the empty node?
I would to obtain:
Line 1
line 2
line 3
line 4
<html><head><title>A title</title><style type="text/css">
ol{margin:0;padding:0}p{margin:0}
.c0{font-size:12pt;background-color:#ffffff;font-family:Times New Roman}
.c6{width:432.0pt;background-color:#ffffff;padding:72.0pt 90.0pt 72.0pt 90.0pt}
.c7{color:#aaaaaa;font-family:Times New Roman}
.c3{color:#0000ee;text-decoration:underline}
.c5{color:inherit;text-decoration:inherit}
.c2{font-size:12pt;font-family:Times New Roman}
.c4{height:12pt}.c1{direction:ltr}
body{color:#000000;font-size:12pt;font-family:Times New Roman}
h1{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:24pt;font- family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h2{padding-top:11.25pt;line-height:1.0;text-align:left;color:#000000;font-size:18pt;font-family:Times New Roman;font-weight:bold;padding-bottom:11.25pt}
h3{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:14pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h4{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:12pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h5{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:9pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h6{padding-top:18.0pt;line-height:1.0;text-align:left;color:#000000;font-size:8pt;font-family:Times New Roman;font-weight:bold;padding-bottom:18.0pt}</style>
</head>
<body class="c6">
<p class="c1"><span class="c2">A title</span></p>
<p class="c1 c4"><span class="c2"></span></p>
<p class="c4 c1"><span class="c2"></span></p>
<p class="c1"><span class="c7">Line 1</span></p>
<p class="c1"><span class="c7">line 2</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c1"><span class="c7">line 3</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c3 c2"><span class="c1"></span></p>
<p class="c1"><span class="c7">line 4</span></p>
</body></html>
EDIT
Really, I noticed that the html can be more "complicated", so it's not enough selecting all the span elements or p elements. Moreover, more span elements can appear in the same p element, so in that case I have not to create a new line in my string.
This is the body of a more complicated returned html:
<body class="c13">
<p class="c5"><span>gfgfgfd</span></p>
<p class="c1"><span></span></p>
<p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
<p class="c1 c10"><span></span></p>
<p class="c4"><span>gfgfgfd</span></p>
<p class="c4"><span>f</span></p>
<p class="c4">
<span>gfdgfdg</span>
<span class="c7">hg</span></p>
<p class="c4"><span class="c7">ghgfhgfh</span></p>
<p class="c4"><span class="c7">gfhgfhgf</span></p>
<p class="c5">
<span class="c7">hgfh </span>
<span class="c0">gfdgfg</span></p>
<p class="c5"><span class="c0">fgfdgfdgfd</span></p>
<p class="c5"><span class="c0">gdfgdfgfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">fgfdgfdg</span></p>
<p class="c5">
<span class="c0">fgffgfdgfg</span>
<span class="c0 c11">gfgfdgfd fgd fd</span>
<span class="c0">fdgfdg</span></p>
<p class="c5"><span class="c0">fgfdgfdgf</span></p>
<p class="c5"><span class="c0">gfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c9" start="1">
<li class="c3"><span class="c0">gfgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gdfgfd</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c5"><span class="c0">gfhgfh</span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c2" start="1">
<li class="c3"><span class="c0">gfhg</span></li>
<li class="c3"><span class="c0">hgfh</span></li>
<li class="c3"><span class="c0">hgf</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
</body>
I'd need an XPath expression that selects p, h1, h2,..., h6, li elements, and considers the inner textual part in such way that new line and empty lines are properly detected.
For the example above you can use //span which will return all the <span> elements regardless of their contents. It looks like you are doing some other filtering also because //text() should also return your CSS block and A Title from the <title> and first <span>.
I would rather use a regex for this one:
Grab all the content between the body tags (you can also do that with XPath)
Replace </p> by </p>\n
Strip tags
Can I use Html Agility Pack to make the output look nicely indented, unnecessary white space stripped?
HAP is not going to give you the results you are after.
Try using a .net wrapper for HtmlTidy such as the one found here
using System;
using System.IO;
using System.Net;
using Mark.Tidy;
namespace CleanupHtml
{
/// <summary>
/// http://markbeaton.com/SoftwareInfo.aspx?ID=81a0ecd0-c41c-48da-8a39-f10c8aa3f931
/// </summary>
internal class Program
{
private static void Main(string[] args)
{
string html =
new WebClient().DownloadString(
"http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat/2610903#2610903");
using (Document doc = new Document(html))
{
doc.ShowWarnings = false;
doc.Quiet = true;
doc.OutputXhtml = true;
doc.OutputXml = true;
doc.IndentBlockElements = AutoBool.Yes;
doc.IndentAttributes = false;
doc.IndentCdata = true;
doc.AddVerticalSpace = false;
doc.WrapAt = 120;
doc.CleanAndRepair();
string output = doc.Save();
Console.WriteLine(output);
File.WriteAllText("output.htm", output);
}
}
}
}
Results:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy for Windows (vers 14 October 2008), see www.w3.org" />
<title>
Html Agility Pack: make code look neat - Stack Overflow
</title>
<link rel="stylesheet" href="http://sstatic.net/so/all.css?v=6638" type="text/css" />
<link rel="shortcut icon" href="http://sstatic.net/so/favicon.ico" />
<link rel="apple-touch-icon" href="http://sstatic.net/so/apple-touch-icon.png" />
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href=
"http://sstatic.net/so/opensearch.xml" />
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js">
</script>
<script type="text/javascript" src="http://sstatic.net/so/js/master.js?v=6523">
</script>
<script type="text/javascript">
//<![CDATA[
var imagePath='http://sstatic.net/so/img/';
//]]>
</script>
<link rel="canonical" href="http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat" />
<link rel="alternate" type="application/atom+xml" title=
"Feed for question 'Html Agility Pack: make code look neat'" href="/feeds/question/2593147" />
<script src="http://sstatic.net/so/js/question.js?v=6714" type="text/javascript">
</script>
<script type="text/javascript">
//<![CDATA[
var fkey = "b00609a1a5f2966a687eca3f84e4dd64";
$(function() {
vote.init(2593147);
comments.init();
styleCode();
});
//]]>
</script>
</head>
<body>
<noscript>
<div id="noscript-padding"></div></noscript>
<div id="notify-container"></div><script type="text/javascript">
//<![CDATA[
$(function() { notify.showFirstTime(); });
//]]>
</script>
<div class="container">
<div id="header">
<div id="topbar">
<div id="hlinks">
<a href=
"/users/login?returnurl=%2fquestions%2f2593147%2fhtml-agility-pack-make-code-look-neat%2f2610903">login</a>
<span class="lsep">|</span> careers <span class=
"lsep">|</span> about <span class="lsep">|</span> faq
</div>
<div id="hsearch">
<form id="search" action="/search" method="get" name="search">
<div>
<input name="q" class="textbox" tabindex="1" onfocus="if (this.value=='search') this.value = ''" type=
"text" maxlength="80" size="28" value="search" />
</div>
</form>
</div>
</div><br class="cbt" />
<div id="hlogo">
<img src="http://sstatic.net/so/img/logo.png" width="250" height="61" alt="Stack Overflow" />
</div>
<div id="hmenus">
<div class="nav">
<ul>
<li class="youarehere">
Questions
</li>
<li>
Tags
</li>
<li>
Users
</li>
<li>
Badges
</li>
<li>
Unanswered
</li>
</ul>
</div>
<div class="nav" style="float:right">
<ul>
<li style="margin-right:0px">
Ask Question
</li>
</ul>
</div>
</div>
</div>
<div id="content">
<div id="question-header">
<h2>
<a href="/questions/2593147/html-agility-pack-make-code-look-neat" class="question-hyperlink">Html Agility
Pack: make code look neat</a>
</h2>
</div>
<div id="mainbar">
<div id="question" class="">
<div class="everyonelovesstackoverflow">
<script type="text/javascript">
//<![CDATA[
document.write('<s'+'cript lang' + 'uage="jav' + 'ascript" src="http://ads.stackoverflow.com/a.aspx?ZoneID=3&Task=Get&IFR=False&PageID=52405&SiteID=1&Random=' + (+new Date()) + '&Keywords=htmlagilitypack">');
document.write('</'+'scr'+'ipt>');
//]]>
</script> <noscript>
<div>
<a href=
"http://ads.stackoverflow.com/a.aspx?ZoneID=3&Task=Click&Mode=HTML&SiteID=1&PageID=52405">
<img src=
"http://ads.stackoverflow.com/a.aspx?ZoneID=3&Task=Get&Mode=HTML&SiteID=1&PageID=52405"
alt="" /></a>
</div></noscript>
</div>
<table>
<tr>
<td class="votecell">
<div class="vote">
<input type="hidden" value="2593147" /> <img class="vote-up" src=
"http://sstatic.net/so/img/vote-arrow-up.png" width="40" height="25" alt="vote up" title=
"This question is useful and clear (click again to undo)" /> <span class="vote-count-post">1</span>
<img class="vote-down" src="http://sstatic.net/so/img/vote-arrow-down.png" width="40" height="25"
alt="vote down" title="This question is unclear or not useful (click again to undo)" /> <img class=
"vote-favorite" src="http://sstatic.net/so/img/vote-favorite-off.png" width="32" height="31" alt=
"star" title="This is a favorite question (click again to undo)" />
<div class="favoritecount"></div>
</div>
</td>
<td>
<div>
<div class="post-text">
<p>
Can I use Html Agility Pack to make the output look nicely indented, unnecessary white space
stripped?
</p>
</div>
<div class="post-taglist">
<a href="/questions/tagged/htmlagilitypack" class="post-tag" title=
"show questions tagged 'htmlagilitypack'" rel="tag">htmlagilitypack</a>
</div>
<table class="fw">
<tr>
<td class="vt">
<div class="post-menu">
<a id="flag-post-2593147" title="flag this post for serious problems" name=
"flag-post-2593147">flag</a>
</div>
</td>
<td class="post-signature owner">
<div class="user-info">
<div class="user-action-time">
asked <span title="2010-04-07 14:13:47Z" class="relativetime">2 days ago</span>
</div>
<div class="user-gravatar32">
<a href="/users/51795/illdev"><img src=
"http://www.gravatar.com/avatar/52dc0db2cdacc6e9769d074a37466317?s=32&d=identicon&r=PG"
height="32" width="32" alt="" /></a>
</div>
<div class="user-details">
illdev<br />
<span class="reputation-score" title="reputation score">53</span><span title=
"5 bronze badges"><span class="badge3">●</span><span class=
"badgecount">5</span></span>
</div>
</div><br class="cbt" />
<div class="accept-rate cool" title=
"this user has accepted an answer for 2 of 4 eligible questions">
50% accept rate
</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
<tr>
<td class="votecell"></td>
<td>
<div id="comments-2593147" class="comments">
<table>
<tbody>
<tr id="comment-2600849" class="comment">
<td></td>
<td class="comment-text">
<div>
what output? From where? some more details perhaps? – <a href=
"/users/97614/sam-holder" title="1868" class="comment-user">Sam Holder</a> <span class=
"comment-date"><span title="2010-04-07 14:16:41Z">2 days ago</span></span>
</div>
</td>
</tr>
<tr id="comment-2600851" class="comment">
<td></td>
<td class="comment-text">
<div>
<i>(reference)</i> <a href="http://htmlagilitypack.codeplex.com/Wikipage" rel=
"nofollow">htmlagilitypack.codeplex.com/Wikipage</a> – <a href=
"/users/208809/gordon" title="16497" class="comment-user">Gordon</a> <span class=
"comment-date"><span title="2010-04-07 14:16:55Z">2 days ago</span></span>
</div>
</td>
</tr>
<tr id="comment-2624419" class="comment">
<td></td>
<td class="comment-text">
<div>
output = html code output – <a href="/users/51795/illdev" title="53" class=
"comment-user owner">illdev</a> <span class="comment-date"><span title=
"2010-04-10 13:14:42Z">12 secs ago</span></span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
</div>
<div id="answers">
<a name="tab-top" id="tab-top"></a>
<div id="answers-header">
<div id="subheader">
<h2>
2 Answers
</h2>
<div id="tabs">
<a href="/questions/2593147?tab=oldest#tab-top" title=
"Answers in the order they were given">oldest</a> <a href="/questions/2593147?tab=newest#tab-top"
title="Most recent answers first">newest</a> <a class="youarehere" href=
"/questions/2593147?tab=votes#tab-top" title="Answers with the most votes first">votes</a>
</div>
</div>
</div><a name="2610845"></a>
<div id="answer-2610845" class="answer">
<table>
<tr>
<td class="votecell">
<div class="vote">
<input type="hidden" value="2610845" /> <img class="vote-up" src=
"http://sstatic.net/so/img/vote-arrow-up.png" width="40" height="25" alt="vote up" title=
"This answer is useful (click again to undo)" /> <span class="vote-count-post">0</span>
<img class="vote-down" src="http://sstatic.net/so/img/vote-arrow-down.png" width="40" height="25"
alt="vote down" title="This answer is not useful (click again to undo)" />
</div>
</td>
<td>
<div class="post-text">
<p>
A variation of this question has been answered recently
</p>
<ul>
<li>
<a href=
"http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak/2507673#2507673">
http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak/2507673#2507673</a>
</li>
</ul>
<p>
Basically the outcome of this was that while you <strong>can</strong> use HtmlAgilityPack to
clean it up a bit by using the fix nested tags.
</p>
<p>
The best solution is to use something called Tidy which is an application that was originally
created by some developers at w3c and then made open source. Its the engine that powers the w3c
validator as well.
</p>
<p>
This article covers how to use it but you had to sign up (free) to view it:
</p>
<ul>
<li>
<a href="http://www.devx.com/dotnet/Article/20505/1763/" rel=
"nofollow">http://www.devx.com/dotnet/Article/20505/1763/</a>
</li>
</ul>
<p>
It seems like a legit article but its funny because nobody else seems to have covered this
topic in the last six years...
</p>
</div>
<table class="fw">
<tr>
<td class="vt">
<div class="post-menu">
<a href="/questions/2593147/html-agility-pack-make-code-look-neat/2610845#2610845" title=
"permalink to this answer">link</a><span class="lsep">|</span><a id="flag-post-2610845"
title="flag this post for serious problems" name="flag-post-2610845">flag</a>
</div>
</td>
<td align="right" class="post-signature">
<div class="user-info">
<div class="user-action-time">
answered <span title="2010-04-09 20:55:18Z" class="relativetime">16 hours ago</span>
</div>
<div class="user-gravatar32">
<a href="/users/156388/rtpharry"><img src=
"http://www.gravatar.com/avatar/6811db2b37e824fdf6c5c4fcdddd4146?s=32&d=identicon&r=PG"
height="32" width="32" alt="" /></a>
</div>
<div class="user-details">
rtpHarry<br />
<span class="reputation-score" title="reputation score">88</span><span title=
"6 bronze badges"><span class="badge3">●</span><span class=
"badgecount">6</span></span>
</div>
</div>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="votecell"></td>
<td>
<div id="comments-2610845" class="comments dno">
<table>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
</div>
<div class="everyonelovesstackoverflow">
<script type="text/javascript">
//<![CDATA[
document.write('<s'+'cript lang' + 'uage="jav' + 'ascript" src="http://ads.stackoverflow.com/a.aspx?ZoneID=14&Task=Get&IFR=False&PageID=52405&SiteID=1&Random=' + (+new Date()) + '&Keywords=htmlagilitypack">');
document.write('</'+'scr'+'ipt>');
//]]>
</script> <noscript>
<div>
<a href=
"http://ads.stackoverflow.com/a.aspx?ZoneID=14&Task=Click&Mode=HTML&SiteID=1&PageID=52405">
<img src=
"http://ads.stackoverflow.com/a.aspx?ZoneID=14&Task=Get&Mode=HTML&SiteID=1&PageID=52405"
alt="" /></a>
</div></noscript>
</div><a name="2610903"></a>
<div id="answer-2610903" class="answer">
<table>
<tr>
<td class="votecell">
<div class="vote">
<input type="hidden" value="2610903" /> <img class="vote-up" src=
"http://sstatic.net/so/img/vote-arrow-up.png" width="40" height="25" alt="vote up" title=
"This answer is useful (click again to undo)" /> <span class="vote-count-post">0</span>
<img class="vote-down" src="http://sstatic.net/so/img/vote-arrow-down.png" width="40" height="25"
alt="vote down" title="This answer is not useful (click again to undo)" />
</div>
</td>
<td>
<div class="post-text">
<p>
Output as XHTML and run that through an <a href=
"http://msdn.microsoft.com/en-us/library/system.xml.xmltextwriter.indentation.aspx" rel=
"nofollow">XmlTextWriter</a>
</p>
</div>
<table class="fw">
<tr>
<td class="vt">
<div class="post-menu">
<a href="/questions/2593147/html-agility-pack-make-code-look-neat/2610903#2610903" title=
"permalink to this answer">link</a><span class="lsep">|</span><a id="flag-post-2610903"
title="flag this post for serious problems" name="flag-post-2610903">flag</a>
</div>
</td>
<td align="right" class="post-signature">
<div class="user-info">
<div class="user-action-time">
answered <span title="2010-04-09 21:02:34Z" class="relativetime">16 hours ago</span>
</div>
<div class="user-gravatar32">
<a href="/users/242897/sky-sanders"><img src=
"http://www.gravatar.com/avatar/df4a7fbd8a054fd6193ca0ee62952f1f?s=32&d=identicon&r=PG"
height="32" width="32" alt="" /></a>
</div>
<div class="user-details">
Sky Sanders<br />
<span class="reputation-score" title="reputation score">4,014</span><span title=
"2 silver badges"><span class="badge2">●</span><span class=
"badgecount">2</span></span><span title="14 bronze badges"><span class=
"badge3">●</span><span class="badgecount">14</span></span>
</div>
</div>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="votecell"></td>
<td>
<div id="comments-2610903" class="comments dno">
<table>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
</div>
<form id="post-form" action="/questions/2593147/answer/submit" method="post" name="post-form">
<h2 class="space">
Your Answer
</h2><script src="http://sstatic.net/so/Js/wmd.js?v=6016" type="text/javascript">
</script> <script type="text/javascript">
//<![CDATA[
$(function() {
editorReady(1, heartbeat.answers);
});
//]]>
</script>
<div id="post-editor">
<div id="wmd-container">
<div id="wmd-button-bar"></div>
<textarea id="wmd-input" name="post-text" cols="92" rows="15" tabindex="101">
</textarea>
</div>
<div class="community-option">
<input id="communitymode" name="communitymode" type="checkbox" /> <label for="communitymode" title=
"community owned posts do not generate any reputation for the owner, have a lower reputation barrier for collaborative editing, and show only a revision history instead of a signature block">
community wiki</label>
</div>
<div id="wmd-preview"></div>
<div id="edit-block">
<input id="fkey" name="fkey" type="hidden" value="b00609a1a5f2966a687eca3f84e4dd64" /> <input id=
"author" name="author" type="text" />
</div>
</div>
<div class="form-item">
<table>
<tr>
<td class="vm">
<label for="openid_identifier">OpenID Login</label> <input id="openid_identifier" name=
"openid_identifier" class="openid-identifer" type="text" size="40" maxlength="200" value=""
tabindex="104" />
<div class="form-item-info">
Get an OpenID
</div>
</td>
<td class="orcell">
<div class="orword">
or
</div>
<div class="orline"></div>
</td>
<td class="vm">
<div>
<label for="display-name">Name</label> <input id="display-name" name="display-name" type="text"
size="30" maxlength="30" value="" tabindex="105" />
</div>
<div>
<label for="m-address">Email</label> <input id="m-address" name="m-address" type="text" size=
"40" maxlength="100" value="" tabindex="106" /> <span class="edit-field-overlay" style=
"color:#999; font-weight:normal">never shown</span>
</div>
<div>
<label for="home-page">Home Page</label> <input id="home-page" name="home-page" type="text"
size="40" maxlength="200" value="" tabindex="107" />
</div>
</td>
</tr>
</table>
</div>
<div class="form-submit cbt">
<input id="submit-button" type="submit" value="Post Your Answer" tabindex="110" />
</div>
</form>
<h2 class="space">
Not the answer you're looking for? Browse other questions tagged <a href=
"/questions/tagged/htmlagilitypack" class="post-tag" title="show questions tagged 'htmlagilitypack'" rel=
"tag">htmlagilitypack</a> or ask your own question.
</h2>
</div><img src="/posts/2593147/ivc/1707" class="dno" alt="" />
</div>
A variation of this question has been answered recently
Which is the best HTML tidy pack? Is there any option in HTML agility pack to make HTML webpage tidy?
Basically the outcome of this was that while you can use HtmlAgilityPack to clean it up a bit by using the fix nested tags.
The best solution is to use something called Tidy which is an application that was originally created by some developers at w3c and then made open source. Its the engine that powers the w3c validator as well.
This article covers how to use it but you had to sign up (free) to view it:
http://www.devx.com/dotnet/Article/20505/1763/
It seems like a legit article but its funny because nobody else seems to have covered this topic in the last six years...
See a similar question here: HtmlAgilityPack: how to create indented HTML? and my answer:
No, and it's a "by design" choice.
There is a big difference between XML
(or XHTML, which is XML, not HTML)
where - most of the times -
whitespaces are no specific meaning,
and HTML.
This is not a so minor improvement, as
changing whitespaces can change the
way some browsers render a given HTML
chunk, especially malformed HTML (that
is in general well handled by the
library). And The Html Agility Pack
was designed to minimize the way the
HTML is rendered, not the way the
markup is written.
I'm not saying it's not feasible or
plain impossible. Obviously you can
convert to XML and voilà (and you
could write an extension method to
make this easier) but the rendered
output may be different, in the
general case.