xpath syntax in Scrapy - xpath

team = hxs.select ('//table[#class="tablehead"/tbody/tr[contains[.#class, "player"]')
The structure of the web site I whose table I want to select is as follows:
<html>
<body>
<table>
<tbody>
<tr>
<td>...</td>
<td>...</td>
...
</tr>
</tbody>
</table>
</body>
</html>
Since there are multiple tables in the web site, I only want to select the one whose class is defined as "tablehead". Also, for that table, I only want to select the tags whose class attributes contain the string "player". My attempt above looks a bit spotty to begin with. I tried running the crawler, and it says that the line I produced above is an invalid xpath line. Any advice would be nice.

I've came across these problems before, try to omit tbody in the xpath expression.

//table[#class="tablehead"/tbody/tr[contains[.#class, "player"]
Correcting this results in:
//table[#class='tablehead']/tbody/tr[contains(#class, 'player')]
This selects every tr the string value of whose class attribute contains the string "player" and that (the tr) is a child of a tbody that is a child of any table in the XML document, whose class attribute has string value "tablehead" .
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//table[#class='tablehead']
/tbody/tr[contains(#class, 'player')]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document (made just a little bit more realistic):
<html>
<body>
<table class="tablehead">
<tbody>
<tr class="major-player">
<td>player1</td>
<td>player2</td>
</tr>
</tbody>
</table>
</body>
</html>
the Xpath expression is evaluated and the selected nodes (just one in this case) are copied to the output:
<tr class="major-player">
<td>player1</td>
<td>player2</td>
</tr>

Related

Create sitemap from the content of the CommerceTools database

I need to create the sitemap file of my CommerceTools based shop and it would be great if it could be done automatically from the contents of the CTP database.
Do you know if there is a module, tool or extension already developed that allows this task?
EDIT->
I am aware that each online store can be built with a different technology.
In our specific case, the front-end is based on Sunrise for JVM, so it would be convenient for this tool to be created for this technology, although it is not essential.
I also recognize that each project can have its specific features that make it different from any other (mainly static content or from an external CMS) so I understand that creating a universal tool is very complex.
Anyway I think it would be great to have some tool that could be able to create a "sitemap-products.xml" from the most dynamic content of CTP using the slug of categories and products.
Then this "sitemap-products.xml" could be called from a sitemapindex from which you link both this and other secondary sitemaps that can be self-generated by the CMS (if you have it) and / or other more static that can be created and maintained manually by the development team.
<-EDIT
Thanks in advance.
I will give you a simple rule for creating a perfect sitemap from the database.
Sitemap.php :
<?php
$site = "https://yourdomain.ccom/"; // your URL addres with slash at end "/".
$chfreqprod = "weekly"; // the frequency of sitemaps
$priority = "0.8"; // priority
$date = date("Y-m-d\TH:m:s+02:00", time());
define ('DB_USER', 'changeWithYourUser');
define ('DB_PASSWORD', 'changeWithYourPassword');
define ('DB_HOST', 'localhost');
define ('DB_NAME', 'cangeWithYourDataBase');
$conn = mysql_connect(DB_HOST, DB_USER, DB_PASSWORD) or die("Could not connect to the database.");
mysql_select_db(DB_NAME, $conn) or die("Can not select the table in the database!");
header("Content-Type: text/xml;charset=utf-8");
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<?xml-stylesheet type=\"text/xsl\" href=\"smap.xsl\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">";
$query = #mysql_query("SELECT * FROM products LIMIT 0,25000");
while($row = #mysql_fetch_array($query)){
$product = $row['product_seo'];
echo "<url>
<loc>".$site.$product.".html</loc>
<lastmod>".$date."</lastmod>
<changefreq>".$chfreqprod."</changefreq>
<priority>".$priority."</priority>
</url>";
}
echo "</urlset>";
?>
smap.xsl :
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:html="http://www.w3.org/TR/REC-html40"
xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>XML Sitemap</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="robots" content="noindex,follow" />
<style type="text/css">
body {
font-family:"Lucida Grande","Lucida Sans Unicode",Tahoma,Verdana;
font-size:13px;
}
#intro {
background-color:#CFEBF7;
border:1px #2580B2 solid;
padding:5px 13px 5px 13px;
margin:10px;
}
#intro p {
line-height:16.8667px;
}
#intro strong {
font-weight:normal;
}
table {
width:100%;
}
td {
font-size:11px;
}
th {
text-align:left;
padding-right:30px;
font-size:11px;
background-color:#E1E3EE;
}
tr.high {
background-color:whitesmoke;
}
tr:hover {
background-color:#E8EAF2;
}
#footer {
width:100%;
padding:2px;
margin-top:10px;
font-size:8pt;
color:gray;
text-align:center;
}
#footer a {
color:gray;
}
a {
color:#000;
text-decoration:none;
}
a:hover {
text-decoration:underline;
}
</style>
</head>
<body>
<xsl:apply-templates></xsl:apply-templates>
</body>
</html>
</xsl:template>
<xsl:template match="sitemap:urlset">
<h1 align="center">XML Sitemap</h1>
<div id="content">
<table cellpadding="5">
<tr style="border-bottom:1px black solid;">
<th width="70%">URL</th>
<th width="5%">Priority</th>
<th width="12%">Change frequency</th>
<th width="13%">Last modified</th>
</tr>
<xsl:variable name="lower" select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="upper" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:for-each select="./sitemap:url">
<tr>
<xsl:if test="position() mod 2 != 1">
<xsl:attribute name="class">high</xsl:attribute>
</xsl:if>
<td>
<xsl:variable name="itemURL">
<xsl:value-of select="sitemap:loc"/>
</xsl:variable>
<a href="{$itemURL}">
<xsl:value-of select="sitemap:loc"/>
</a>
</td>
<td>
<xsl:value-of select="concat(sitemap:priority*100,'%')"/>
</td>
<td>
<xsl:value-of select="concat(translate(substring(sitemap:changefreq, 1, 1),concat($lower, $upper),concat($upper, $lower)),substring(sitemap:changefreq, 2))"/>
</td>
<td>
<xsl:value-of select="concat(substring(sitemap:lastmod,0,11),concat(' ', substring(sitemap:lastmod,12,5)))"/>
</td>
</tr>
</xsl:for-each>
</table>
</div>
<div id="footer">Index Sitemap by www.adydev.com</div>
</xsl:template>
<xsl:template match="sitemap:sitemapindex">
<h1 align="center">XML Sitemap Index</h1>
<div id="content">
<table cellpadding="5">
<tr style="border-bottom:1px black solid;">
<th width="85%">URL of sub-sitemap</th>
<th width="15%">Last modified</th>
</tr>
<xsl:for-each select="./sitemap:sitemap">
<tr>
<xsl:if test="position() mod 2 != 1">
<xsl:attribute name="class">high</xsl:attribute>
</xsl:if>
<td>
<xsl:variable name="itemURL">
<xsl:value-of select="sitemap:loc"/>
</xsl:variable>
<a href="{$itemURL}">
<xsl:value-of select="sitemap:loc"/>
</a>
</td>
<td>
<xsl:value-of select="concat(substring(sitemap:lastmod,0,11),concat(' ', substring(sitemap:lastmod,12,5)))"/>
</td>
</tr>
</xsl:for-each>
</table>
</div>
<div id="footer">Index Sitemap by www.adydev.com</div>
</xsl:template>
</xsl:stylesheet>
.htaccess :
RewriteRule ^sitemap.xml$ sitemap.php [L]
For multilanguage sitemap, index sitemap and automate sitemap, please contact me. Thank you!
There is no standard module or extension available; the sitemap is frontend-specific since everybody has different URL patterns and non-commerce content on the site.
A sitemap needs to be built fitting to the frontend technology your project is developed in.
I have returned to this question to tell you that we finally managed to solve our need by using a module for Play Framework that is precisely capable of generating sitemaps using the URLs that you pass.
We have downloaded the module from the repository of its creators (https://github.com/edulify/play-sitemap-module.edulify.com) and, after configuring some different providers for products, categories and static pages, since we wanted each type of link to have a different refresh frequency and priority for search engines, we have managed to generate our sitemap.xml automatically every 24h.
If someone needs help to implement this funcionality in your store with Sunrise, contact me and I will try to help you.
Thank you very much to all for trying to help us.
Greetings.
Miguel

Nokogiri and tables

Am parsing a web page with a standard structure as follows:
<html>
<body>
<table>
<tbody>
<tr class="active">
<td>name1</td>
<td>name2</td>
<td>name3</td>
</tr>
</tbody>
</table>
</body>
</html>
For the life of me, I can't access the 'tbody' or 'tr' elements.
response = open('http://my_url')
node = Nokogiri::HTML(response).css('table')
puts node
Returns
#<Nokogiri::XML::Element:0x8294c08c name="table" attributes=[#<Nokogiri::XML::Attr:0x8294c014 name="id" value="beta-users">] children=[#<Nokogiri::XML::Text:0x82953bc0 "\n">]>
I have tried various tricks but can't seem to dig deeper down to a lower-level child than 'table'.
At best, I can get to the lowest-level Text object by using
node.children
but
node.children.text
returns "\n".
Despite searching for some hours am none the wiser how to sort it out. Any thoughts?
There is a non-closed class value in your sample, it should be:
<html>
<body>
<table>
<tbody>
<tr class="active">
<td>name1</td>
<td>name2</td>
<td>name3</td>
</tr>
</tbody>
</table>
</body>
</html>
After correcting this, you can:
node = Nokogiri::HTML(response).css('table tbody tr td')
node.each {|child| puts child.text}
name1
name2
name3

xpath - how to find an embedded li with an input element inside it?

Given this HTML:
<li class="check_boxes input optional" id="activity_roles_input">
<fieldset class="choices">
<legend class="label"><label>Roles</label></legend>
<input id="activity_roles_none" name="activity[role_ids][]" type="hidden" value="" />
<ol class="choices-group">
<li class="choice">
<label for="activity_role_ids_104">
<input id="activity_role_ids_104" name="activity[role_ids][]" type="checkbox" value="104" />Language Therapist
</label>
</li>
<li class="choice">
<label for="activity_role_ids_103">
<input id="activity_role_ids_103" name="activity[role_ids][]" type="checkbox" value="103" />Speech Therapist
</label>
</li>
</ol>
</fieldset>
</li>
I am trying to use Selenium and xpath with it.
I am trying to select the first 'checkbox' input element link.
I am having problems selecting the element.
I cannot use the db ID (104) as this is for repeated tests with new ID's each time. I need to select the 'first' input checkbox, based on it having the text for Language Therapist.
I have tried:
xpath=(//li[contains(#id,'activity_roles_input')])//input
and
xpath=(//li[contains(#id,'activity_roles_input')])//contains('Language Therapist")
but it is not finding the element.
When I do:
xpath=(//li[contains(#id,'activity_roles_input')])
it gets to the input set. The problem I am having is selecting the first input checkbox control for 'Language Therapist'.
First, find any <li> containing the text and than look for in the descendant of those for the first checkbox.
xpath=(//li[contains(., "Language Therapist")]/descendant::input[#type="checkbox"][1])
(From Michael)
The above worked for me. In the end I actually used
xpath=(//li[contains(#id,'activity_roles_input')]/descendant::input[#type="checkbox"][1])
becuase I liked ID'ing by css ID.
interesting fact to notice when I try to run this small xsl against your xml.
XSL:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="//li[#id ='activity_roles_input']">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Output:
Roles
Language Therapist
Speech Therapist
You have
xpath=(//li[contains(#id,'activity_roles_input')])//input
Shouldn't that be
xpath=(//li[contains(#id,'activity_roles_input')]//input)
or rather
xpath=(//li[#id='activity_roles_input']//input)
?
xpath=(//li[#id='activity_roles_input']//input[1])

Ruby Nokogiri - XPATH using URL

I have this table:
<tr>
<td><b>Amount</b></td>
<td><b>Due Date</b></td>
<td"><b>Link</b></td>
</tr>
<tr>
<td>02/13/2012</td>
<td>$81.66</td>
<td><a onclick="javascript:window.open('/cso/displaypdfbill?selectedBillkey=449409587','_blank');" href="javascript: void(0);">View Bill</a></td>
</tr>
<tr>
<td>01/13/2012</td>
<td>$181.66</td>
<td><a onclick="javascript:window.open('/cso/displaypdfbill?selectedBillkey=543409587','_blank');" href="javascript: void(0);">View Bill</a></td>
</tr>
I am looping through the table and extracting the Bill key in each row. I removed the Billkey and stored it into a variable.
BillKey = 449409587
What I want is to get the <tr> where that BillKey is located:
So I should have:
2/13/2012 81.86 View Bill
I am having trouble writing the XPATH to get the <tr>.
Use:
string(table/tr
[td/a/#onclick
[substring
(.,
string-length()
- 21
)
=
$vEnding
]
]
)
where $vEnding must be substituted by the string: "=449409587','_blank');"
So, the complete XPath expression after this substitution is:
string(table/tr
[td/a/#onclick
[substring
(.,
string-length()
- 21
)
=
"=449409587','_blank');"
]
]
)
XSLT - based verification:
This XSLT transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:variable name="vEnding">=449409587','_blank');</xsl:variable>
<xsl:template match="/">
<xsl:copy-of select=
"string(table/tr
[td/a/#onclick
[substring
(.,
string-length()
- 21
)
=
$vEnding
]
]
)
"/>
</xsl:template>
</xsl:stylesheet>
when applied on the following XML document (the provided one wrapped in a single top element table):
<table>
<tr>
<td>
<b>Amount</b>
</td>
<td>
<b>Due Date</b>
</td>
<td>
<b>Link</b>
</td>
</tr>
<tr>
<td>02/13/2012</td>
<td>$81.66</td>
<td>
<a onclick=
"javascript:window.open('/cso/displaypdfbill?selectedBillkey=449409587','_blank');" href="javascript: void(0);">View Bill</a>
</td>
</tr>
<tr>
<td>01/13/2012</td>
<td>$181.66</td>
<td>
<a onclick=
"javascript:window.open('/cso/displaypdfbill?selectedBillkey=543409587','_blank');" href="javascript: void(0);">View Bill</a>
</td>
</tr>
</table>
evaluates the XPath expression and copies to the output the result of the evaluation:
02/13/2012
$81.66
View Bill

XPath to find attributes where the name starts with a given value

With this xml:
<div val1="q">a</div>
<div val2="w">b</div>
<div val3="e">c</div>
<div some="r">d</div>
<div thing="t">f</div>
<div name="y">g</div>
we want to find only
<div val1="q">a</div>
<div val2="w">b</div>
<div val3="e">c</div>
which are those nodes having an attribute where the attribute name begins with val
You can try this :
//div/#*[starts-with(name(.), 'val')]
if you know that you are looking for the first attribute of the div element.
Edit:
Sorry didn't realize you wanted to select the elements themselves. You could use parent::div or what you did, but the proper way of doing this would be to select directly the div themselves :
//div[#*[starts-with(name(), 'val')]]
have you tried with .../#val* ?
which are those nodes having an attribute where the attribute name
begins with val
Use:
//div[#*[starts-with(name(), 'val')]]
This selects any div element in the document, that has at least one attribute, whose name starts with the string "val".
XSLT - based verification:
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="//div[#*[starts-with(name(), 'val')]]"/>
</xsl:template>
</xsl:stylesheet>
when applied on this XML document (produced from the provided XML fragment):
<html>
<div val1="q">a</div>
<div val2="w">b</div>
<div val3="e">c</div>
<div some="r">d</div>
<div thing="t">f</div>
<div name="y">g</div>
</html>
selects and outputs the wanted nodes:
<div val1="q">a</div>
<div val2="w">b</div>
<div val3="e">c</div>

Resources