tool for extracting xpath query from speciifed/selected node - xpath

Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.

Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/

Related

Octoparse and relative Xpath iframe extraction issues

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]
There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

Getting a xPath from XML document

I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon
Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay
An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]

Confused about XPath Syntax

Problem Summary:
Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.
So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.
Background info:
So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.
Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.
I've tried a lot of variations to get the data but all are returning empty arrays.
I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.
Code:
Here are some of the attempts that didn't work:
response.xpath("//div[#class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[#class='g5-component]")
response.xpath("//li[#class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[#class='mlb-scores__list-item']")
response.xpath("//div[#!data-game-pk-id > 0]")'
response.xpath("//div[contains(#class, 'g5-component')]")
Expected Results and Actual Results
I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.
So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).
So for all my attempts I either get the wrong objects or an empty array.
You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.
You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:
https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420

Identifying objects in Tosca with Xpath

I am recently brushing up my skills in TOSCA, I was working on it 2 years ago and switched to Selenium, I noticed that the new TOSCA allows identification using Xpath, and I am really familiar with it now, however, I cannot make it work in TOSCA and I am sure the object identification works because I am testing my xpath in google chrome developer tools.
Something as simple as (//*[text()='Forgot Password?'])[1] does not seem to be working. Could I be missing something?
This is the webpage I am using as reference for this example:
https://www.freecrm.com/index.html
XPath certainly can be used to identify elements of an HTML web UI in Tosca.
Since the question was originally posted, the "Forgot Password?" link at https://www.freecrm.com/index.html appears to have changed so that it's text is now "Forgot your password?" and is actually located at https://ui.freecrm.com/.
To account for that change, this answer uses "(//*[text()='Forgot your password?'])[1]" instead of the expression provided in the original post.
With the text modification, the expression works to idenfity the element in XScan after wrapping it in double quotes:
"(//*[text()='Forgot your password?'])[1]"
Some things to keep in mind when using XPath in Tosca:
It seems that XPath expressions need to be wrapped in double quotes (") so that XScan knows when to start evaluating XPath instead of using its normal rules. Looking closely at the expression that is pregenerated when XScan starts, we see that it is wrapped in double quotes:
"id('ui')/div[1]/div[1]/div[1]/a[1]"
A valid XPath expression doesn't necessarily guarantee uniqueness, so it is helpful to pay attention to any feedback messages at the bottom of XScan. There is a significant difference between "The selected element was not found" and "The selected element is not unique". The former simply indicates XScan can't find a match, the latter indicates that XScan matches successfully, but cannot uniquely identify the element.
My experience has been that it helps to explicitly identify the element to reduce the possibility of ambiguity. If the idea is to target the anchor element in order for tests to click a link, then reducing scope from any element i.e. "(//*[text()='Forgot your Password?'])[1]" to only match anchor elements with that text "//a[text()='Forgot your password?']".
In general, Tricentis (or at least the trainers with whom I have spoken) recommends using methods other than XPath to identify a target if they are available. That said, in my experience I've had better luck with XPath than with "Identify by Anchor".
An XPath expression is visible and editable in the XModuleAttribute properties without having to rescan. Personally, I find it easier to work with than the XML value of the RelativeId property that is generated when using Identify by Anchor.
With Anchor, I've had issues where XModuleAttributes scanned in one browser can no longer be found when switching to another browser, specifically from IE to Chrome. With XPath, I've not had these issues.
While XPath works well to identify the properties of one element with attributes of another because it can identify the relationship between them (very common with controls in Angular applications), the same can often be accomplished by adapting the engine layer using the TBox API (i.e. building a custom control). This requires some initial work up front from developer resources, but it can significantly improve how tests steer these controls in addition to reducing the need for Automation Specialists to have to rely on XPath.
What I know is that you can identify elements with XPath when working with XML messages in Tosca API testing. Your use case seems to be UI testing, but I am not sure about that.
Did you try to use XScan to scan the page? Usually Tosca automatically calculates an XPath expression for you that you can use immediately.
Please see the manual for details.
If it still does not work please try to be more specific? What isn't working? Error message? Unexpected behavior? ...
Tosca provides its set of attributes for locating any type of elements. You can directly select any number of attributes you want to make your element unique along with index of that element. Just make sure that you are not using any dynamic values in 'id' or 'class-name' of that element, also the index range is not so large like 20 out of 100; it could be 5 out of 10, which will be helpful if you need to update it in future.
Also take help of parent elements which will be uniquely located easily and then locate your expected element.
TOSCA provide various ways to locate an element just like selenium plus in addition it will provide other properties also.Under transition properties you will find x path and it will be absolute x path since you know selenium you know the difference between absolute and relative x path. I would suggest you to go with.
1.Identify by ID OR name
2. Identify by anchor
if your relative x path is not working
Try load all properties on the right side bottom. But it showed for me without clicking on it. See here

Verify sorting in Selenium

Has anyone tested sorting with Selenium? I'd like to verify that sorting a table in different ways work (a-z, z-a, state, date, etc.). Any help would be very much appreciated.
/Göran
Before checking it with selenium, You have to do small thing. Store the table values(which comes after sorting) in a string or array.
Now perform the sorting using selenium and capture the new list as
string new_list= selenium.gettable("xpath");
Now compare both the values and check whether they are same or not.
I have shared a strategy to test sorting feature of an application on my blog. You can use this to automate test cases that verify the sorting feature of an application. You could use it on place like the search result page, item listing and report module of the application. The strategy explained does not require creation of test data and is fully scalable.
You can get value of fields like this:
//div[#id='sortResult']/div[1]/div (this'd be row 1 of the search result)
//div[#id='sortResult']/div[2]/div ( row 2)
(I'm making some assumptions about the HTML structure here, but you get my drift...)
These can be quite fragile assertions, I'd recommend you anchor these xpath references to an outer container element (not the root of your document, as lots of "automatic" tools do).
When you click sort, the value changes. You'll have to find out what the values are supposed to be.
Also watch out for browser compatibility with such xpaths. They're not always ;)
The way I approached this was to define the expected sorted results as an array and then iterate over the results returned from the sorted page to make sure they met my expectations.
It's a little slow, but it does work. (We actually managed to find a few low-level sorting defects on multiple pages this way..)
You could use the WebDriver API from Selenium 2.0 (currently in alpha) to return an array of elements with the findElements command before and after the sort. This becomes a bit more difficult however if what you're sorting is paginated.

Resources