I am trying to scrape a few company websites with Octoparse. I can't seem to get my XPath right for pagination. The website pages do not have a 'Next' button. I am trying to scrape data from each page.
Any suggestions?
I have tried the following XPath (along with a few other failures):
//*[#id="main"]/div[2]/section/div[1]/nav/ul/li[1]/a/following-sibling::li[1]/a
Here is an example of a company website I am testing it on.
You need page next from the current page. This is quite qasy with following-sibling
//li[./a[#class="current"]]/following-sibling::li[1]
You can read about this here
Answering my own question as I modified Redyukov Pavel's solution which worked:
//a[#class='current']/../following-sibling::li[1]/a[1]
Related
I am trying to scrape data from an infinite scroll site. I need a list of URL's though to be able to parse the data I need. The site is http://www.cabinetparts.com/c/hinges/european-cabinet-hinges/standard-european-cabinet-hinges. I'm looking to find something like http://www.cabinetparts.com/c/hinges/european-cabinet-hinges/standard-european-cabinet-hinges#/Page=2... Page=3 etc. I was able to find this by inspecting the other infinite scroll sites in Chrome, but this site continues to elude me. Thanks in advance!
I'm totally stumped on this and reaching our for help!
I'm using Import.io crawler to extract reviews from TripAdvisor. However when I am training the crawler, the "more" button is inactive.
Here's an example of the page: [http://www.tripadvisor.co.uk/Hotel_Review-g295424-d306662-Reviews-Hilton_Dubai_Jumeirah_Resort-Dubai_Emirate_of_Dubai.html#REVIEWS][1]
Here is the Xpath to the review in full: //*[#id="UR288083139"]/div[2]/div/div[3]
And to the More button:
//*[#id="review_288083139"]/div[1]/div[2]/div/div/div[3]/p/span
Is it possible to have an Xpath so the full review is included in Import.io?
One way you can do this is by using a Crawler then an Extractor. This would split the process into two parts.
Create a crawler that you'd train to capture the links for every review on the page. Make sure that you select link for the column.
Sample review from the website
Create an Extractor to capture the full review from the links you got from the crawler.
Voila! You got all reviews!
Note: If you already have all the links for the pages you need the reviews from, better make an Extractor instead of a Crawler. This way, you can chain the API to the other extractor. You'd only need a crawler if you don't know all the links.
Hope this helps!
It looks like the html is NOT on the page before you click that button, and there isn't a URL which has that data on it. So you may be out of luck.
You could try playing around with the developer console to see if you can find the full reviews buried in a xml file or dynamic URL somewhere. Im not sure how though.
I am trying to create a one-page WordPress website, something like the ones you sometimes see in ThemeForest's WP section: the whole website is a long page that has everything in one place, from about us, to portfolio, to some blog posts, to contacts.
Placing all things on one page is not difficult. But when I started thinking about how to present individual posts and pages, I realised that I probably need a general way of getting posts' data via AJAX, and create new blocks with JS. How should I go about this? I suppose this was done before, but I struggle to find something this specific on Codex or a tutorial with best practices.
Any advice or link will be greatly appreciated.
You could use a plugin such as jQuery Easytabs, download it here, that has a built-in Ajax component.
I've found that the easiest way is to just get all content to load into the divs ahead of time, vs. trying to load all pages through Ajax. However, appending something like '?ajax/ajax' to the end of your urls through the Easytabs plugin is one option that I have successfully used in the past.
If you decide to use the easytabs functionality, there is ample documentation on the page that I linked to.
First time posting here and newbie at Google apps. I am putting together a url in a spreadsheet for a linkedin company. example: http://www.linkedin.com/company/National-Renewable-Energy-Laboratory
Can I use =importXML from a google spreadsheet plus Xpath to get the website url that is listed on each company page.
I have gotten to a point where I can extract all the href's from the page and the link that I need is in that, but I just want the website url.
Here is what I am using so far:
=importXML(R2, "//*[#href]")
Here is a link to my spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AheVK6uxf6AvdHhILTFrR1k4Wl9tWW5OVWpRRUJKMlE
The code is in S2
Appreciate your response.
//*[#href] matches elements that have href, not the href attributes themselves. Try //#href instead.
It's more complicated, but a good solution would be to use the LinkedIn API, which you can access using UrlFetchApp.
I found this problem all over the net but no answer yet, so maybe here someone solved it ...?
I built a page relying heavily on jquery.address. It's got one index page and the rest loads dynamically via Ajax following Google's /#!/ scheme for crawlable pages. Now I want to add Facebooks Like or share button but I can't get it to grab the actual page title or url.
Whatever I do, it always falls back to title and url of the index page. It tried:
(obviously) changing title an openGraph meta on load of the new parts.
"linking" the crawler page (?_escaped_fragmet_=xyx) but specifying the #! page in meta
"sharing" with a given title and url.
I never get anything but a link to the index page or a blank "share" to the right url with title and thumbnail ignored.
Has anyone got a similar setup working?
Thanks for any hints,
thomas
Facebook is actually using #! now and it works! If you build your site so that http://site.de/?_escaped_fragment=something is identical to http://site.de/#!/something all you have to do is "share" the #! url and it'll display the info from the escaped fragment page.
Use this URL to check: http://developers.facebook.com/tools/debug
But: A much cleaner solution to the problem can be found here: http://github.com/browserstate/history.js/wiki/Intelligent-State-Handling
My guess would be that Facebook's crawler doesn't run Javascript and will always display whatever's actually in the page it gets from the server.
Facebook share has a BRUTAL cache, last time I checked it was impossible to change the title / description data once it was scraped :(
The issue I had was the og:url and the actual url of the page did not match. I also read a number of comments about the og data being just after the title element, but I don't think that solved anything.
With regard to issues of caching, it is true that Facebook's caching is "brutal", but it does not cache anything for the lint tool: http://developers.facebook.com/tools/debug.
I use no-hash-bang urls when sharing links. I process the hard links and redirect them to a hash bang client side using javascript. That way if a crawler goes to the hard linked page it will display the information just as it would if javascript were enabled.
Compare:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Flikeapage.com%2F%23!%2FChristmas%2Fvs%2FBacon
and
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Flikeapage.com%2FChristmas%2Fvs%2FBacon
Hope this helps.