Xpath implementation in Google Sheets - xpath

Xpath newbie question, so forgive me if this seems straight forward, but I really have looked everywhere for the answer!
I'm trying to build a process for extracting all my playlists from Spotify and making it universal, allowing migration across various platforms. I will gladly share once completed as I know many people would find this useful.
I'm unfortunately stumped on trying to extract some data from:
[http://musicbrainz.org/ws/2/artist/?query=%22faith%20no%20more%22][1]
I am looking to extract the id from the artist element, which should be b15ebd71-a252-417d-9e1c-3e6863da68f8. I can get this working in Base X with the following:
declare namespace mmd="http://musicbrainz.org/ns/mmd-2.0#";
declare variable $doc := doc("http://musicbrainz.org/ws/2/artist/?query=%22faith%20no%20more%22");
$doc/mmd:metadata/mmd:artist-list/mmd:artist/#id
However, in Google Sheets using Importxml, the best I can do is:
=IMPORTXML("http://musicbrainz.org/ws/2/artist/?query=%22faith%20no%20more%22","//#id")
This results in all 3 id results being returned:
b15ebd71-a252-417d-9e1c-3e6863da68f8
489ce91b-6658-3307-9877-795b68554c98
83f22bb6-4631-443c-bace-9fae8540362a
I am completely stumped and any help will be greatly appreciated.
Kind regards,
James

I haven't been able to find any useful documentation on Google's IMPORTXML, but there is no evidence that it provides any way to establish a namespace binding, or that it supports the XPath 2.0 syntax *:metadata to select elements independent of namespace. If that's the case then you may need to resort to the horrible construct *[local-name()='metadata']/*[local-name()='artist-list']/*[local-name()='artist']

Related

Correct way to search videos with multiple keywords with OR condition for youtube search API

I'm trying to use youtube data search and video API in my web application to display top view-counted videos related with several keywords. I'm planing to use totally two calls: the first call get id list with search API, and the second call get details for the ids hit on the first call, with video API.
My question is with regard to search API. Based on my trial and error, If I input multiple keyword with space separation in the parameter q for search API, it's looks behaves as AND condition it's not same as common behavior such as google. To search with multiple keywords with OR condition, As far as I tried, it's looks working if I Include the OR between keywords, but I would like to confirm my assumption correct, officially if possible.
I should be able to find this kind of specification in the official documentation, but finally I have no luck. It's very helpful if you could share these links if exists or give me the official answer.
By the way, it is my first post to stackoverflow. If there is missing point of my question, please kindly advice.

is there a specific way to write xpaths into rapidminer for web crawling

I have tried so many options, over many days to try and extract data. I don't know where I am going wrong.
for example, I am on the website reviewcentre.com and am looking at car selling site reviews.
I am struggling badly to retrieve information, most of my xpaths appear incorrect.
Where can I best learn how to do this properly, I have spent days at this.
https://www.reviewcentre.com/car_dealers/we_buy_any_car_-_wwwwebuyanycarcom-review_14068020
I know how to copy xpaths, but when it comes to rapidminer, I can't extract the data.
I know I am doing it wrong, but I don't know what's right unfortunately.
examples include
//*[#id="ReviewTitle-14068020"]
h:html/h:head/h:title/text()
this one works!
//*[#id="ReviewBox-14068020"]/div[1]/div[2]/p[2]/span
I have no problem it appears retrieving the xpath from the website, but using it for extracting data on rapidminer is not working at all..Would really appreciate if anyone can point me in the right direction.
Obviously, you don't want to use unique IDs in your xpaths.
Make sure you have understood the concept of xml namespaces, too.

Using wildcards in Selenium IDE

I'm somewhat new to automation, and am learning everything auto-didactically, so forgive me if my terminology is a bit off. I've searched hi and low for an answer to this question, and I can't seem to find anything. I presume it's my small vocabulary when it comes to this stuff... anyway...
I'm attempting to write a test that performs all the actions necessary to complete a tutorial by using the recorder. However, for one particular step, the element ID changes. For example, the ID I'm trying to click is this:
//li[#id='message_661119']/div[2]/div[2]/a/img
However, for each new user that is performing the tutorial "quest", the number of the id changes.
Is there anyway to get Selenium to recognize, or use, wildcards? Example:
//li[#id='message_******']/div[2]/div[2]/a/img
Of course, the example above does not work.
Any advice would be immensely helpful. Thank you!!
You can use starts-with() for this:
//li[starts-with(#id, 'message_')]/div[2]/div[2]/a/img
It's one of the examples mentioned in Locating Techniques in Selenium's docs for starts-with().
In Target field of the command in Selenium IDE where you can see message_123123 click on a dropdownlist and choose an option which is related to xpath:idRelative or if this one doesn't work then try another options which do not include that annoying message_123123 so this way you'll identify webpage element by it's location but not id. I solved my issue this way

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)

What is a good approach for extracting keywords from user-submitted text?

I'm building a site that allows users to make sense of a debate by graphically representing arguments for and against a particular issue. (Wrangl)
I'd like to categorise these debates so they are more easily found and connected. I don't want to irritate the person creating the debate by asking them to add tags and categories before they see any benefit, so I'm looking at a way of automatically extracting keywords.
What's a good approach for taking the debate's title and description (and possibly the content of the arguments themselves once there are some) to pull out, say, ten strong keywords that could be used as metadata to connect similar debates together, or even as the content of the "meta" keywords tag in the head of the HTML page where the debate is viewable. Eg. Datamapper vs ActiveRecord
The site is coded in Ruby with Sinatra, using DataMapper for data storage. I'm ideally looking for something which will work on Heroku (I don't have a way of writing files to disk dynamically), and I'd consider a web service, an API or ideally a Ruby gem.
Maybe you can use TextAnalyzer.
I understand that you're wanting to find an easy way of achieving this, I've recently dived into the world of NLP (Natural Language Processing) and Text-mining and its a daunting process of which most went far above my head.
Although i managed to code some functionality that resembles what you're looking for, though I did it in PHP. What i would suggest, that if you want it tailored to your project (Wrangl) then do it yourself.
Using the Porter stemming algorithm which I'm sure there will be Ruby code for.
Ruby Porter stemmer
You can try the salsaAPI to automatically extract keywords and categorize the debates!

Resources