extract xpath - xpath

I want to retrieve the xpath of an attribute (example "brand" of a product from a retailer website).
One way of doing it is using addons like xpather or xpath checker to firefox, opening up the website using firefox and right clicking the desired attrbute I am interested in. This is ok. But I want to capture this information for many attributes and right clicking each and every attribute maybe time consuming. Also, the other problem I have is that attributes I maybe interested in will be there for one product. The other attributes maybe for some other product. So, I will have to go that product & then do it manually again.
Is there an automated or programatic way of retrieving the xpath of the desired attributes from a website rather than having to do this manually?

You must notice that not all websites use valid XML that you can use xpath on...
That said, you should check out some HTML parsers that will allow you to use xpath on HTML even if it is not a valid XML.
Since you did not specify the technology you are working with - I'll suggest the .NET HTML Agility Pack, if you need others, search for questions dealing with this here on SO.

The solution I use for this kind of thing is to write an xpath something like this:
//*[text()="Brand"]/following-sibling::*
//*[text()="Color"]/following-sibling::*
//*[text()="Size"]/following-sibling::*
//*[text()="Material"]/following-sibling::*
It works by finding all elements (labels) with the text you want and then looking to the next sibling in the HTML. Without a specific URL to see I can't help any further.
This is a generalised version you can make more specific versions by replacing the asterisks is tag types, and you can navigate differently by replacing the axis following sibling with something else.
I use xPaths in import.io to make APIs for this kind of thing all the time, It's just a matter of finding a xPath that's generic enough to find the HTML no matter where it is on the page, but being specific enough to get the right data.

Related

Setting the correct xpath

I'm trying to set the right xpath for using RSelenium, but I'm not very experienced in this area, so any help would be much appreciated.
Since I'm not allowed to post pictures yet I have tried to add a link to a screenshot of the html:
The html
I need R to scrape the dates (28-10-2020 - 13-11-2020), but so far I have not been able to set the correct xpath when using html.nodes.
I'm trying to scrape from sites like this one: https://www.boligsiden.dk/adresse/topperne-9-3-33-2620-albertslund-01650532___9__3__33
I usually do this on python rather than R
As you can see in this image when you right-click on the element concerned. You get a drop-down menu with an x-path to the element.
Other than that, the site orientation and x-path might change and a full x-path might be a good option in the short-run, so I rather prefer driver.find_element_by_xpath('//button[contains(text(),"Login")]')\ .click()
In your case which would be find_element_by_xpath('//*[contains(#class, 'u-pb-4 u-block')]')
I hope this helps and it is mostly the same across different languages

Unable to identify element in Blue Prism using XPath

I have spied an input text box using the Application Modeller of Blue Prism and was able to successfully highlight the text box using the below XPath:
/HTML/BODY(1)/DIV(4)/main(1)/DIV(1)/DIV(1)/DIV(1)/DIV(2)/DIV(1)/DIV(1)/DIV(2)/IFRAME(1)/HTML/BODY(1)/DIV(2)/FORM(1)/DIV(3)/TABLE(2)/TBODY(1)/TR(1)/TD(1)/DIV(1)/DIV(1)/DIV(1)/DIV(2)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/SPAN(1)/DIV(1)/DIV(2)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/TABLE(1)/TBODY(1)/TR(1)/TD(1)/INPUT(1)
I wanted to use a more robust XPath and to achieve that I was trying to use the below XPath:
//*[#id="CT"]/div/div/div/div[1]/div[1]/table/tbody[1]/tr/td/input[1]
The above XPath was identifying the element correctly in Chrome but was getting the below error message when trying the same in Blue Prism:
Error - Highlighting results - Object reference not set to an instance of an object.
Let me know if I am doing anything incorrectly.
Sorry for replying to a pretty old one! The workaround we've devised for this scenario (where making the path dynamic requires too long of a loop / search) is to use Jquery snippets. If the page is using jquery it is trivial to execute these queries very quickly using the blue prism capability of executing javascript functions.
And we put in an enhancement request, because it'd be supremely useful functionality.
Update: As a user points out below, the vanilla js querySelector method is probably safer and more future proof than using jquery if it is possible to be used.
Blue Prism does not fully support the XPath spec; alas the construct you're attempting to use here won't work.
Alternatively, you can set the Path attribute of an application modeler entry to be Dynamic, which allows you to insert dynamic parameters from the process/object level to pinpoint elements you'd like to interact with.
Unfortunately Blue Prism doesn't actually use "real" XPaths, but only an extremely limited subset: Absolute paths without wildcards. (Note: It is technically possible to match the XPath to a string with wildcards, but this seemingly causes BP to check every single element in the document, and is so slow it is almost never the right solution.)
For cases where an element can't be robustly identified via the BP application modeler (maybe because it requires complex or dynamic selectors), my workaround is to inject a JS snippet. JS can select elements much more reliably, and it can then generate the BluePrism path for that element.
Returning data from JS to BluePrism is not trivial, but one of the nicer solutions is to have JS create a <script id="_output"> element, put JSON inside it, then have BluePrism read the contents of this element.

ImportXML and Google Spreadsheet issue

I'm 'scraping' a few product descriptions from a website and bringing them into a google spreadsheet using importXML.
It has gone fairly smoothly, but there is one major snag that I would love to correct, and I need your help!
The website in question prohibits those posting products from including contact information (email addresses usually) in the product description. Sometimes people ignore the rule, and include the contact information anyways. When this occurs, the website automatically hides the contact information in the product description, replacing it with [obscured], as in "...please feel free to contact me at [obscured]" or something close to that. The [obscured] appears in a different colour, and is obviously treated differently by the website.
When these product descriptions are imported into my spreadsheet, the [obscured] causes the scraping to kind of be 'bumped'-- the description text stops prior to [obscured], the word [obscured] appears in an adjacent cell all by itself, and the description text that follows [obscured] then continues in a third cell.
This separation ruins the alignment and logic in my spreadsheet, as product descriptions having an [obscured] word become broken up and misaligned from those that do not.
I would love to be able to have my importXML or XPath accommodate for this, and essentially 'ignore' the [obscured]. I don't mind it being included in the scraped description, but I want to stop the breaking-up into 3 separate adjacent cells.
The [obscured] is part of a 'span' that appears to occasionally lie within the description class 'desc' I am calling.
Is there a way to do this? Instruct importXML to import that 'desc' class BUT 'ignore/omit/exception' of the span which might sometimes appear within?
I've included the source code (inspect element in Safari) below:
<div class="desc descFull collapsed">
<span class="obscureText">[obscured]</span>
As mentioned, this span only occurs in some of the product descriptions, not all of them.
Does anyone know what kind of language I would use in the importXML to call the 'desc' but ignore the 'span', or prevent the splitting into 3 cells when the [obscured] is encountered??
My current call is
=ImportXML(A1,"//div[#class='desc']")
which works fine, unless the [obscured] span is encountered.
Thank you for any help you can give!
Unless Google Drive is breaking the definition of Xpath, Xpaths can't be used to query CSS classes, like CSS selectors can.
The Xpath //div[#class='desc'] will only match a div element with a class attribute that is literally "desc". It won't match "desc descFull collapsed" as the string is different.
As for excluding the text of the obscured node, that would require finding the text nodes and exluding on, which would return a nodeset, not a string, and you wouldn't be able to concatenate these back together using XPath 1.0. If Google Drive uses XPath 2.0 it might be possible, using the techniques in that linked question.

How to retrieve plain text from a formatted website to use in UIWebView

Not sure if what I want to do is possible, but what I am hoping to do is somehow gather certain pieces of text from a website, remove the header, footer, background, all formatting, and place it into my application in a scrollview or something similar...
I'll give you an example... Imagine I was making wikipedia's iPhone app, I want to download the information about the wiki on dogs, without the header, side bars etc, just the text. How would I go about doing this?
I understand that for this I have not provided any example code or what I've tried or started, but that's just because in this case I'm lost! That doesn't mean I want full chunks of code either. Any help will do. If this doesn't work, I will just have to make a 'mobile optimised' version of the webpages I want to include in my app.
Thanks
(Edit: the term I was trying to use was 'strip the web page of its HTML coding')
You may be going about this the wrong way, or perhaps even asking the wrong question.
Does the target website have an API or datafeed of some kind?
Can you get the information you need in JSON or XML format directly from the site?
I think you've misunderstood the technology. HTML is merely the framwork on which the formatting and data is hung.
Parsing the HTML page seems like an awfully big headache, I doubt you'll ever be able to get it to work, because almost all sites these days are partially or wholly generated on the server side, the page is only the result.
Some sites hide the information in memory and others get it dynamically through ajax for example, which means that simply trying to get the data by parsing the HTML will get zero data.
Another issue you should be aware of though, is that simply copying the data from generated websites may open yourself up to copyright issues.
You have to parse the html code and search for the part that you want and "throw" away the part that you do not need. This is more or less like bruteforcing and the code of the website should not change otherwise you are screwed. So you have to write the parser by hand with this method. But maybe there is a atom or rss feed and you can parse this one. This will be much more easier and you are not depending on the website layout because the rss/atom feed is just about the data. For parsing rss you could try out NSXMLParser.
And then you have to make a valid html page out of the data and present it in the UIWebView

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.
This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm
First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.
After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.
There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.
Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.
I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Resources