XPath - Extract spectific file name from string - xpath

I'm trying to extract just the filename from a javascript link in import.io, eg googlebolver.htm from href="javascript:finpopup('googlebolver.htm',920,620,0)"
I've managed to get to the 'link' (javascript:finpopup('googlebolver.htm',920,620,0)) with the following XPath
//*[text()='GOOGLE.MAPS']/#href
but I would like to get to the actual address on its own.
As I am running the import.io Extracto on multiple urls, I want it to find something like *.htm
I believe this maybe possible by using the substring function, but I don't know how to do it.
The following questions of this site looked promising, but one only works for fixed length stings and the other I don't completely understand and works for only a specific 'word'
Extract value from javascript object in site using xpath and import.io
How to use substring() with Import.io?
Thanks in advance for your help
EDIT: Here is the URL

You can use the XPath functions substring-after and substring-before, to select the text after, say, (' and before ',
in your example, it would be
substring-before(substring-after(//*[text()='GOOGLE.MAPS']/#href,"('"),"',")
Note: I don't know if import.io supports these standard XPath function

Related

How can I find an image source based on xpath elements including a slash?

I'm trying to get an image on a remote page based on it's source value containing "/images/".
As an example: https://images-na.ssl-images-amazon.com/images/W/WEBP_402378-T1/images/G/01/kindle/ku/KU-retail-lp_KUPrePaid._CB661222046_.jpg
Because it has the /images/ in the source, this xpath should theoretically work:
#$xpath->query('//*[contains(#src,"/images/")]/#src')[0];
Unfortuantely it doesn't not. I thought it might be an escape issue, but that doesn't seem to work either. What's the trick here?

Confused about XPath Syntax

Problem Summary:
Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.
So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.
Background info:
So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.
Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.
I've tried a lot of variations to get the data but all are returning empty arrays.
I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.
Code:
Here are some of the attempts that didn't work:
response.xpath("//div[#class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[#class='g5-component]")
response.xpath("//li[#class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[#class='mlb-scores__list-item']")
response.xpath("//div[#!data-game-pk-id > 0]")'
response.xpath("//div[contains(#class, 'g5-component')]")
Expected Results and Actual Results
I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.
So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).
So for all my attempts I either get the wrong objects or an empty array.
You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.
You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:
https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420

Getting Cell Contents Via XPath for ImportXML()

I am trying to scrape data from https://www.snpedia.com/index.php/Rs7136259 to create an automated database of genomic information using google sheets.
I would like to retrieve the odds ratio contained in a table on the page. I have tried to figure out the XPath, but nothing I do works. I copied as XPath from InspectElement but that's returning a #N/A error. The information I am trying to scrape is the "Odds Ratio".
My current query:
=importxml(J2,"//*div[#id="mw-content-text"]/table/tr[7]/td")
Thanks for your input. I have searched the other links but could not figure it out. Sorry for being so green.
As noted in the comments, *div is not valid XPath. Another problem is that you have double quotes inside of double quotes, which is also invalid.
It looks like this works:
=importxml(J2,"//*[#id='mw-content-text']/table/tr[7]/td")

How can I find the content between multiple parameter strings (E.G. /contact-group/{ID}/member/{CONTACT-ID})

I had searched around but couldn't find something for it specifically. I was looking for a way to find the content of a URL (In this case these are URIs in a rest API)
A few examples of these look like:
/currency/{currency-id}
Or
/contact-group/{ID}/member/{CONTACT-ID}
The parameters always can be different, however they always are between {}, in different forms within the string. I know how I can replace these when there is only one in the URI without issue, but at runtime the programmer won't know these, and I'm trying to prevent having to define them and because of this when URIs contain multiple parameters I'm not sure how to obtain each case of them.
Happy for any ideas on how to get around this!
Seems like you're looking for a basic example of routing:
# in config/routes.rb
get "/:param_1/:param_2", to: "MyController#some_action"
Then in the controller you'd be able to get params[:param_1] and such.
You can see Rails' routing guide for more info
Maybe I'm not totally understanding your question, though. If you're looking to be able to capture a variable number of params, there's a special syntax for passing arrays in the query param.
See this: Passing array of parameters through get in rails
The answer to this was here
Basically using parameterset = url.scan(/{.+?}/) (replace url with your string name), and what's in scan with your parameter list, I can use this to do
parameterset.each { |x| x.... etc}

Wiki quotes API?

I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text

Resources