Can somebody help me to write this expression as it is driving me crazy!
Take this site for example http://www.jigsaw-online.com/
I'm trying to build up an expression where I can get all links under a category of my choice.
E.g. I want all four /a under New In
I can get the New In link via
//header//li[#class='nav-level-1-list']/a[contains(text(),'New In')]
I've then tried going up a level to then get all the links via:
//header//li[#class='nav-level-1-list']//li/a
but that doesn't work because it's still trying to find an anchor that contains 'New In'
How can I combine these two expressions together so I can get all the links under the category?
//header//li[#class='nav-level-1-list']/ul[preceding-sibling::a[contains(text(),'New In')]]//li/a (did not test it, has nothing to run xpath on html)
Related
I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon
Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay
An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]
Problem Summary:
Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.
So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.
Background info:
So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.
Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.
I've tried a lot of variations to get the data but all are returning empty arrays.
I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.
Code:
Here are some of the attempts that didn't work:
response.xpath("//div[#class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[#class='g5-component]")
response.xpath("//li[#class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[#class='mlb-scores__list-item']")
response.xpath("//div[#!data-game-pk-id > 0]")'
response.xpath("//div[contains(#class, 'g5-component')]")
Expected Results and Actual Results
I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.
So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).
So for all my attempts I either get the wrong objects or an empty array.
You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.
You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:
https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420
I'm looking into the possibilities of jmeter, and it looks just great. However, one of the things my testing script should be able to do, is search for some values, and click on a random resulting link.
So what I would need to automate is:
Entering the values in the searchbox (I could do this by using the correct GET url in a second page, but how do I do this 5000 times?)
Clicking on one of the results listed.
Thanks for the help!
This can be done using the CSV Data Set, it loads a CSV file, and feeds the content into the variables of your choice:
http://jmeter.apache.org/usermanual/component_reference.html#CSV_Data_Set_Config
After that, you can use the regular expression extractor to extract the URL's you want from the resulting HTML, and follow those links:
http://jmeter.apache.org/usermanual/component_reference.html#Regular_Expression_Extractor
You could use a HTML Link Parser for your second part (the clicking on one of the results):
http://jmeter.apache.org/usermanual/component_reference.html#HTML_Link_Parser
See http://theworkaholic.blogspot.co.at/2009/11/randomly-clicking-links-in-jmeter.html for an example of using the link parser in a context similar to your question.
I'm somewhat new to automation, and am learning everything auto-didactically, so forgive me if my terminology is a bit off. I've searched hi and low for an answer to this question, and I can't seem to find anything. I presume it's my small vocabulary when it comes to this stuff... anyway...
I'm attempting to write a test that performs all the actions necessary to complete a tutorial by using the recorder. However, for one particular step, the element ID changes. For example, the ID I'm trying to click is this:
//li[#id='message_661119']/div[2]/div[2]/a/img
However, for each new user that is performing the tutorial "quest", the number of the id changes.
Is there anyway to get Selenium to recognize, or use, wildcards? Example:
//li[#id='message_******']/div[2]/div[2]/a/img
Of course, the example above does not work.
Any advice would be immensely helpful. Thank you!!
You can use starts-with() for this:
//li[starts-with(#id, 'message_')]/div[2]/div[2]/a/img
It's one of the examples mentioned in Locating Techniques in Selenium's docs for starts-with().
In Target field of the command in Selenium IDE where you can see message_123123 click on a dropdownlist and choose an option which is related to xpath:idRelative or if this one doesn't work then try another options which do not include that annoying message_123123 so this way you'll identify webpage element by it's location but not id. I solved my issue this way
I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)