XPath expression from a website to extract price information - xpath

I am trying to extract the price of this type of website using XPath but I don't have any experience and using an addon I got this Xpath expression //div[#class='teaser--product-prices public-product']/div[#class='ui-table' and 1]/div[#class='ui-table-cell' and 1]/div[#class='teaser--product-final-price large sell-price' and 1] which is not working.
The website also uses dot (.) as a thousand separator and doesn't use a dot (.) for decimals so I would really appreciate if there was a way to remove the first dot and add one for decimals through the Xpath expression.
The expression is to be used for Content Egg Wordpress plugin to feed price information to a website.
The website is https://www.public-cyprus.com.cy/product/tileoraseis/tileoraseis/tileorasi-samsung-65-smart-8k-qled-qe65q950t/prod10634476pp/

You could use the following XPath expression :
concat(translate(//div[#class="product-main-container product-page"]/#data-price,".",""),".",translate(//div[#class="product-main-container product-page"]/#data-decimals,"€",""))
We use function concat to join the result of 2 XPath expressions. The first in which we remove the dot with funtion translate. The second in which we remove the euro symbol with translate too. We also add the decimal separator during the concat step.
Output :
6599.00
Side note : It could not work since I don't know if your plugin supports these XPath functions.

Related

XPath to be precised into one in order to extract text from a web page?

I have a few Xpaths as below:
//*[#id="904735f0-bb82-11ea-a473-6d0f51688222"]/div/p
//*[#id="729c0860-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
//*[#id="2555ab30-bb84-11ea-9e8b-277e7f6208b2"]/div/div/div[1]/div/p
//*[#id="7e100250-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
//*[#id="811727d0-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
All of the above are used to extract text from a single web page since text is located at different view--ports, but I wish to find a single xpath to extract text for all of them. Is it possible to use 'and' and multiple ID's to extract all of it through one xpath?
Any other suggestions would be appreciate.
You can use the or operator for the last four.
And the merge-nodes operator | to add the first one.
So to select all 5 expression in one, use the following expression:
//*[#id="904735f0-bb82-11ea-a473-6d0f51688222"]/div/p | //*[#id="729c0860-a71d-11ea-b994-53a3e91a35c2" or #id="2555ab30-bb84-11ea-9e8b-277e7f6208b2" or #id="7e100250-a71d-11ea-b994-53a3e91a35c2" or #id="811727d0-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
A shorter and more generic solution could be :
(//div/div/div[1]/div/p|//div/p)[parent::*[string-length(#id)=36 and substring(#id,24,1)="-"]]
First part with () is used to specify the end of the path. Since #id attributes have the same length, we use it inside the predicate. We also verify the presence of a - at a specific position with substring.

How to get multiple occurences of an element with XPath under usage of normalize-space and substring-before

I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.

XPath expression

This question regards XPath expressions.
I want to find the average of the length of all URLs in a Web page, that point to a .pdf file.
So far I have constructed the following expression, but it does not work:
sum(string-length(string(//a/#href[contains(., ".pdf")]))) div
count(//a/#href[contains(., ".pdf")])
Any help will be appreciated!
You will need XPath 2.0.
For calculating the sum of the string lengths, you will need either
need a concatenated string of all #hrefs to apply to string-lenght($string as xs:string) (which only allows a single string as parameter), but concat(...) only takes an arbitrary number of atomar strings, not a sequence of those; or
apply string-length(...) on every #href as #Navin Rawat proposed - but using arbitrary functions in axis steps is a new feature of XPath 2.0.
If using XPath 2.0, there are functions avg(...) and ends-with(...) which help you in stripping down the expression to
avg(//a/#href[ends-with(., '.pdf')]/string-length())
If you have to stick with XPath 1.0, all you can do is using my expression below to fetch the URLs and calculate the average outside XPath.
Anyway, the subexpression you proposed will fail at URLs like http://example.net/myfile.pdf.txt. Only compare the end of the URL:
//a[#href[substring(., string-length(.) - 3) = '.pdf']]/#href
And you missed a path step for the attribute, so you've been trying to average the string length of the link names right now.
Please put something like:
sum(//a/#href[contains(.,'.pdf')]/string-length()) div count(//a/#href[contains(.,'.pdf')])

How to discover a date or a number near a word - only with regex within regex

I am still learning the intrinsics of regex, and am wondering if it is possible with a single regex to find a number that is at a provided distance from a word.
Consider the following text
DateClient
15-01-20130060 15-01-20140010 15-01-20150020
I want that my regex matches just 15-01-2013.
I know I can have the full DateClient 15-01-2013 with DateClient\W+\d{2}-\d{2}-\d{4}, and then apply a regex afterwards, but i'm trying to build a configurable agnostic system, that gives power to the user, and so I would like to have a single regex expression that just matches 15-01-2013.
Is this even feasible?
Any suggestions?
You can use a capturing group :
DateClient\W+(\d{2}-\d{2}-\d{4})
Example in javascript (you didn't specify a language) :
var str = "DateClient\n15-01-20130060 15-01-20140010 15-01-20150020";
var date = str.match(/DateClient\W+(\d{2}-\d{2}-\d{4})/)[1];
EDIT (following the addition of the Ruby tag) :
In Ruby you can use
(?<=DateClient\W)(\d{2}-\d{2}-\d{4})
Demonstration
Check out lookbehind for matching only the date. However, lookbehind support of your environment can be limited.
Or you could just use a capturing group, which you will be able to extract from the match result.

XPath on Wikipedia Summary

I'm currently trying to extract the blurb, or summary from any given Wikipedia page, using XPath. Now, there are many places online where this has already been done: http://jimblackler.net/blog/?p=13, How to use XPath or xgrep to find information in Wikipedia?.
But, when I try to use similar XPath expressions, on a variety of pages, the returned results are strange. For the sake of this question, let's assume I'm trying to retrieve the very first paragraph in the printable Wikipedia page on Boston: http://en.wikipedia.org/w/index.php?title=Boston&printable=yes.
When I try to use this expression /html/body/div[#id='content']/div[#id='bodyContent']//p, only the last four words of the paragraph, "in the United States.", are returned.
Actually, the expression used above could be simplified to //div/p, but the results are the same.
Strangely, the links I linked to previously seem to use similar methods and return great results; originally, I imagined this was due to Wikipedia changing the formatting of their pages in recent years, but honestly, I can't seem to find what's wrong with both the expressions.
Does anyone have any idea about this?
When I try to use this expression
/html/body/div[#id='content']/div[#id='bodyContent']//p, only the
last four words of the paragraph, "in the United States.", are
returned.
There are a few problems here:
The XML document is in a default namespace. Writing XPath expressions to select nodes in a document that is in a default namespace is the most FAQ about XPath -- search for "XPath and default namespace". In short, any unprefixed name will most probably cause nothing to be selected. One must register the default namespace and associate a specific prefix with this namespace. Then any element name in the XPath expression must be written with this prefix. So, the expression above will become:
:
/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p
where the "x:" prefix is associated to the "http://www.w3.org/1999/xhtml" namespace.
.2. Even the above expression doesn't select (only) the wanted node. In order to select only the first x:p from the above, the XPath expression must be specified as (note the brackets):
(/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1]
.3. As you want the text of the paragraph, an easy way to do this is to use the standard XPath function string():
string((/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1])
When this XPath expression is evaluated, I get the text of the paragraph -- for example in the XPath Visualizer I wrote some years ago:

Resources