XPath scrape using google sheets

XPath scrape using google sheets - xpath

I have been struggling to get any XPath technique to work on octoparse and similar software. I'm now trying google sheets from reading posts here and can't get it to work either.
Input: A slideshare presentation url (eg https://www.slideshare.net/carologic/ai-and-machine-learning-demystified-by-carol-smith-at-midwest-ux-2017)
Intended output: Slideshare embed url (in this case: https://www.slideshare.net/slideshow/embed_code/key/wZudqqTdctjWXA)
I think this would be the way to get the output using google sheets: =importxml(A1,"//meta[#itemprop='embedURL']/#content")
It is not working for me (failure to fetch url). With Octoparse etc I just got a blank value.
I'm being daft here, no doubt. Any help would be useful.

It doesn't work because slideshare is owned by LinkedIN, and they have put in a lot of effort to ensure they cant be scraped, including google sheets. Before it was possible, but I believe they eventually caught on to the work around.

Related

Direct URL to "I'm Feeling Lucky" for images

I have a website for book reviews. I offer a link to the Amazon entry of the books. I discovered after a bit of research that the direct URL for Google's "I'm Feeling Lucky" is:
http://www.google.com/search?hl=en&q=TITLE+AUTHOR+amazon&btnI=745
Which works magic because then I don't have to manually include the Amazon link in my database and directly links to the Amazon page (works 99.99% of the times).
I was wondering if there was an equivalent for images (whether Google or some alternative) to retrieve an image URL based on keywords only (for the purpose of getting the book cover image).

There's no such thing for Google Images, but you might be able to use another web service to do what you want. I noticed that when you're searching for a book, the first image result isn't always the cover of it. Sometimes it's a photo of the author, sometimes it's some image from book's review, so you can hardly rely on that.

It should not be hard to parse the amazon page and get the image and link but google has an API to google books that return all informations about a book in JSON format, you can try it online on the API Explorer (the cover are on the results too). Click here to see an example (click "Execute" to run it).

Unfortunately public Google search engine doesn't support that. You should use Custom Search API to implement such feature in your application. Alternatively use XGoogle (unofficial Python wrapper to Google Search services, see google_dl tool for example).
Other suggestions is to use:
YQL by Yahoo (see yql-tables repo at GitHub for examples).
Use alternative search engines.
E.g. In Wolfram Alpha you can type: "show image of laptop" and it'll give you the first popular picture, however you need to use Wolfram|Alpha APIs or some script (see this ChatBot for example) to pick up the direct link.

Google Chrome Malware Warning when including images from image search API

I'm using Google and Bing image search APIs to provide a way for users of my web app to search for images to include in the documents they create in the app. A (rare?) problem I encountered today: a result from either Bing or Google (I'm going to assume Bing) caused the Google Chrome Malware detector to go off.
Is there any good way to avoid this that I'm not aware of, aside from only using the Google Image API (which is being deprecated!) since I assume they filter out results from sites they think contain malware?
There doesn't seem to be any performant way on my end to check these results before displaying them to prevent this error from occurring, and I'm very worried that any less savvy computer users will think my site is at fault (not to mention being unable to make the warning go away).
I guess I'm also making the assumption here that images from random Internet sites are okay to include in the page as long as they are returned by these APIs...I do copy them over to our own S3 account a few minutes after they are added to the document in case they are changed/removed on the external site...
EDIT: The result is indeed being included from the Bing API, and it is from thefatlossauthority.com.
I would prefer a solution based in Ruby, but given a general solution I'm more than willing to implement it myself.

Google Suggest, how it works?

How does Google Suggest work? How does it manage to update the web page on the client so quickly, based on information in a distant Google database? Why does the web page not look ‘jumpy’ if it is being frequently updated?

It uses AJAX.
When you are writing your query, it searches for the 10 most requested words matching yours. Then it writes minified JSON on an invisible DIV element. Fast, but still resource intensive.
Try to install Firebug on Firefox or use the Developer Console on Chrome, open the console and start writing "Youtube" or whatever you want. You will see the minified JSON responses.
Good luck :D

In addition to the front-end handling others have talked about, which jQuery is a great example of, you might also be interested in how they approach the idea on the backend. Dr. Peter Norvig has written about how to create a spelling corrector, where similar approaches could be used to find close matches.

The whole page is not being updated. Only parts of it are using AJAX - Asynchronous Javascript and XML. Ajax requests can be made in Javascript, and the page updated when the response comes back.
A far more interesting question is how does Google actually search 10bn+ documents in a teeny tiny fraction of a second :)

xpath in =importXML() for extracting meta descriptions

I'm trying to use Xpath to pull in the meta descriptions from web pages, using Google Sheets.
I have this working to pull in the titles: =importXml(www.example.com; "//title")
Here are two sources of my learning:
http://seogadget.co.uk/playing-around-with-importxml-in-google-spreadsheets/
http://docs.google.com/support/bin/answer.py?hl=en&answer=75507
I have read many other posts on this site, and this seems to be the similar idea of what I want:
"/html/head/meta[#name='description']/#content"
"/*/head/meta[#name='description']/#content"
"//head/meta[#name=\"description\"]/#content"
None of these work in Google Sheets, which specifies to write it in Xpath. The only difference, is that in Google Sheets you are to use ' in place of " (hence why description is like that). I've honestly tried it about 219 different ways....no luck.
Any ideas? Thanks in advance!

//meta[#name='description']/#content
So your full URL call in google sheet would be
=importxml(A1,"//meta[#name='description']/#content")
I've built some awesome SEO tools using importXML - this is just the start of it mate! :)

RSS API to get TED videos

Developing an app on android platform to get the ted videos which will replicates ted .
I want to give user experience and use based on these...
category based, views based, talkers based,tag name based.
Unfortunately after long googling still am not able to find a better way to get or separate the talks based on above conditions using the feedburner list like the below ones,
http://feeds.feedburner.com/TedtalksHD
http://feeds2.feedburner.com/tedtalks_video/
Is there any api like thing or some other way to get it done better. I tried with google reader api but in that the feeds are not listed based on its category.
I really appreciate your help.

At first I was thinking that it would be a job for a yahoo pipe, but after looking at the feed it looks like every item is tagged with the same Higher Education category. No luck going that route.
I think you might want to look at the youtube data api's.
http://code.google.com/apis/youtube/2.0/developers_guide_protocol_understanding_video_feeds.html#Understanding_Video_Entries
It looks like in that data set you'll get alot more information including the number of views and favorites on youtube.
Oops, forgot to mention that the TED videos are always on youtube at, http://www.youtube.com/user/TEDtalksDirector

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

XPath scrape using google sheets - xpath

It doesn't work because slideshare is owned by LinkedIN, and they have put in a lot of effort to ensure they cant be scraped, including google sheets. Before it was possible, but I believe they eventually caught on to the work around.

Related

Direct URL to "I'm Feeling Lucky" for images

Google Chrome Malware Warning when including images from image search API

Google Suggest, how it works?

xpath in =importXML() for extracting meta descriptions

RSS API to get TED videos

Categories

Resources