using Xpath to get a href from html

using Xpath to get a href from html - xpath

First time posting here and newbie at Google apps. I am putting together a url in a spreadsheet for a linkedin company. example: http://www.linkedin.com/company/National-Renewable-Energy-Laboratory
Can I use =importXML from a google spreadsheet plus Xpath to get the website url that is listed on each company page.
I have gotten to a point where I can extract all the href's from the page and the link that I need is in that, but I just want the website url.
Here is what I am using so far:
=importXML(R2, "//*[#href]")
Here is a link to my spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AheVK6uxf6AvdHhILTFrR1k4Wl9tWW5OVWpRRUJKMlE
The code is in S2
Appreciate your response.

//*[#href] matches elements that have href, not the href attributes themselves. Try //#href instead.

It's more complicated, but a good solution would be to use the LinkedIn API, which you can access using UrlFetchApp.

Related

Web Scraping: XPath for Pagination

I am trying to scrape a few company websites with Octoparse. I can't seem to get my XPath right for pagination. The website pages do not have a 'Next' button. I am trying to scrape data from each page.
Any suggestions?
I have tried the following XPath (along with a few other failures):
//*[#id="main"]/div[2]/section/div[1]/nav/ul/li[1]/a/following-sibling::li[1]/a
Here is an example of a company website I am testing it on.

You need page next from the current page. This is quite qasy with following-sibling
//li[./a[#class="current"]]/following-sibling::li[1]
You can read about this here

Answering my own question as I modified Redyukov Pavel's solution which worked:
//a[#class='current']/../following-sibling::li[1]/a[1]

IMPORTHTML or IMPORTXML to collect data from a site

I have made several attempts to collect the data within this table:
The simple ways of the two functions I've commented on, I've tried, but not succeeded.
I would like to if anyone knows any other way to collect this data in Google Sheets.
Site Link:
https://www.onlinebettingacademy.com/stats/team/brazil/operrio-pr/13217#tab=t_squad

the table you want to scrape is under JavaScript control, therefore, it can't be scraped.
all you can get from that site into Google Sheets is:
=ARRAY_CONSTRAIN(IMPORTDATA(
"https://www.onlinebettingacademy.com/stats/team/brazil/operrio-pr/13217#tab=t_squad&team_id=13217"); 10000; 10)

Because the page you are trying to scape is rendered using Javascript — i.e. the content you are looking to scrape is not in the markup, you will not be able to use a tool like Google Sheets.
However... you can easily scrape this by using a "headless browser". You pretty much will use a browser (without a UI) that will render your URL with the Javascript, and then once the page is loaded, you query the data using XPATH, etc.
Check out Puppeteer for an example of a JS framework that you can use for this task.

XPath scrape using google sheets

I have been struggling to get any XPath technique to work on octoparse and similar software. I'm now trying google sheets from reading posts here and can't get it to work either.
Input: A slideshare presentation url (eg https://www.slideshare.net/carologic/ai-and-machine-learning-demystified-by-carol-smith-at-midwest-ux-2017)
Intended output: Slideshare embed url (in this case: https://www.slideshare.net/slideshow/embed_code/key/wZudqqTdctjWXA)
I think this would be the way to get the output using google sheets: =importxml(A1,"//meta[#itemprop='embedURL']/#content")
It is not working for me (failure to fetch url). With Octoparse etc I just got a blank value.
I'm being daft here, no doubt. Any help would be useful.

It doesn't work because slideshare is owned by LinkedIN, and they have put in a lot of effort to ensure they cant be scraped, including google sheets. Before it was possible, but I believe they eventually caught on to the work around.

Import IO- Using XPath to show "more" content

I'm totally stumped on this and reaching our for help!
I'm using Import.io crawler to extract reviews from TripAdvisor. However when I am training the crawler, the "more" button is inactive.
Here's an example of the page: [http://www.tripadvisor.co.uk/Hotel_Review-g295424-d306662-Reviews-Hilton_Dubai_Jumeirah_Resort-Dubai_Emirate_of_Dubai.html#REVIEWS][1]
Here is the Xpath to the review in full: //*[#id="UR288083139"]/div[2]/div/div[3]
And to the More button:
//*[#id="review_288083139"]/div[1]/div[2]/div/div/div[3]/p/span
Is it possible to have an Xpath so the full review is included in Import.io?

One way you can do this is by using a Crawler then an Extractor. This would split the process into two parts.
Create a crawler that you'd train to capture the links for every review on the page. Make sure that you select link for the column.
Sample review from the website
Create an Extractor to capture the full review from the links you got from the crawler.
Voila! You got all reviews!
Note: If you already have all the links for the pages you need the reviews from, better make an Extractor instead of a Crawler. This way, you can chain the API to the other extractor. You'd only need a crawler if you don't know all the links.
Hope this helps!

It looks like the html is NOT on the page before you click that button, and there isn't a URL which has that data on it. So you may be out of luck.
You could try playing around with the developer console to see if you can find the full reviews buried in a xml file or dynamic URL somewhere. Im not sure how though.

xpath in =importXML() for extracting meta descriptions

I'm trying to use Xpath to pull in the meta descriptions from web pages, using Google Sheets.
I have this working to pull in the titles: =importXml(www.example.com; "//title")
Here are two sources of my learning:
http://seogadget.co.uk/playing-around-with-importxml-in-google-spreadsheets/
http://docs.google.com/support/bin/answer.py?hl=en&answer=75507
I have read many other posts on this site, and this seems to be the similar idea of what I want:
"/html/head/meta[#name='description']/#content"
"/*/head/meta[#name='description']/#content"
"//head/meta[#name=\"description\"]/#content"
None of these work in Google Sheets, which specifies to write it in Xpath. The only difference, is that in Google Sheets you are to use ' in place of " (hence why description is like that). I've honestly tried it about 219 different ways....no luck.
Any ideas? Thanks in advance!

//meta[#name='description']/#content
So your full URL call in google sheet would be
=importxml(A1,"//meta[#name='description']/#content")
I've built some awesome SEO tools using importXML - this is just the start of it mate! :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

using Xpath to get a href from html - xpath

//*[#href] matches elements that have href, not the href attributes themselves. Try //#href instead.

It's more complicated, but a good solution would be to use the LinkedIn API, which you can access using UrlFetchApp.

Related

Web Scraping: XPath for Pagination

IMPORTHTML or IMPORTXML to collect data from a site

XPath scrape using google sheets

Import IO- Using XPath to show "more" content

xpath in =importXML() for extracting meta descriptions

Categories

Resources