I'm totally stumped on this and reaching our for help!
I'm using Import.io crawler to extract reviews from TripAdvisor. However when I am training the crawler, the "more" button is inactive.
Here's an example of the page: [http://www.tripadvisor.co.uk/Hotel_Review-g295424-d306662-Reviews-Hilton_Dubai_Jumeirah_Resort-Dubai_Emirate_of_Dubai.html#REVIEWS][1]
Here is the Xpath to the review in full: //*[#id="UR288083139"]/div[2]/div/div[3]
And to the More button:
//*[#id="review_288083139"]/div[1]/div[2]/div/div/div[3]/p/span
Is it possible to have an Xpath so the full review is included in Import.io?
One way you can do this is by using a Crawler then an Extractor. This would split the process into two parts.
Create a crawler that you'd train to capture the links for every review on the page. Make sure that you select link for the column.
Sample review from the website
Create an Extractor to capture the full review from the links you got from the crawler.
Voila! You got all reviews!
Note: If you already have all the links for the pages you need the reviews from, better make an Extractor instead of a Crawler. This way, you can chain the API to the other extractor. You'd only need a crawler if you don't know all the links.
Hope this helps!
It looks like the html is NOT on the page before you click that button, and there isn't a URL which has that data on it. So you may be out of luck.
You could try playing around with the developer console to see if you can find the full reviews buried in a xml file or dynamic URL somewhere. Im not sure how though.
Related
I am trying to scrape a few company websites with Octoparse. I can't seem to get my XPath right for pagination. The website pages do not have a 'Next' button. I am trying to scrape data from each page.
Any suggestions?
I have tried the following XPath (along with a few other failures):
//*[#id="main"]/div[2]/section/div[1]/nav/ul/li[1]/a/following-sibling::li[1]/a
Here is an example of a company website I am testing it on.
You need page next from the current page. This is quite qasy with following-sibling
//li[./a[#class="current"]]/following-sibling::li[1]
You can read about this here
Answering my own question as I modified Redyukov Pavel's solution which worked:
//a[#class='current']/../following-sibling::li[1]/a[1]
I am writing to seek help to display custom results in a SEF URL on Joomla CMS.
Example: This is a page with a customized search, https://jobwalkins.in/search.html?search=IT&exf_5=1&exf_4=-1&option=com_jomclassifieds&view=search&Itemid=147
I would like to display this link as https://jobwalkins.in/today-walkins-in-hyderabad.html
I am using https://extensions.joomla.org/extension/jom-classifieds/ as the extensions.
Any helpful inputs will be greatly appreciated. I am looking forward to hearing from you soon.
Best Regards,
Syed H
I was able to get the desired output using https://extensions.joomla.org/extension/sh404sef/. The website in question https://jobwalkins.in/ now shows the predefined search results in custom URLs.
Here are a few of them which I was able to achieve:
https://jobwalkins.in/jobs-in-bangalore.html where the actual link was https://jobwalkins.in/search.html?search=&exf_5=2&exf_4=-1&option=com_jomclassifieds&view=search&Itemid=147
https://jobwalkins.in/today-walkins-in-hyderabad.html where the actual link was https://jobwalkins.in/search.html?search=&exf_5=1&exf_4=-1&option=com_jomclassifieds&view=search&Itemid=147
It works even for the links where the keywords are searched ex:
I searched for a keyword "fresher" and have set the page to render on custom URL https://jobwalkins.in/fresher-jobs.html where the actual link was https://jobwalkins.in/search.html?search=fresher&exf_5=-1&exf_4=-1&option=com_jomclassifieds&view=search&Itemid=147
The sh404SEF https://extensions.joomla.org/extension/sh404sef/ worked great and helped me address my concern very well.
Hope this post is useful for someone who may have a similar issue.
First time posting here and newbie at Google apps. I am putting together a url in a spreadsheet for a linkedin company. example: http://www.linkedin.com/company/National-Renewable-Energy-Laboratory
Can I use =importXML from a google spreadsheet plus Xpath to get the website url that is listed on each company page.
I have gotten to a point where I can extract all the href's from the page and the link that I need is in that, but I just want the website url.
Here is what I am using so far:
=importXML(R2, "//*[#href]")
Here is a link to my spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AheVK6uxf6AvdHhILTFrR1k4Wl9tWW5OVWpRRUJKMlE
The code is in S2
Appreciate your response.
//*[#href] matches elements that have href, not the href attributes themselves. Try //#href instead.
It's more complicated, but a good solution would be to use the LinkedIn API, which you can access using UrlFetchApp.
I have build a site based on Ajax navigation.
I have build it that way, that whenever someone without javascript visits my site, the nav links, which usually load content via Ajax, are acting like normal links and the user can browse through the pages as usual.
Since, Google bot doesn't run javascript, it should theoretically be able to go through all links and corresponding sites as usual, right? Since they are valid links with the href tag pointed to the corresponding site.
Now I was wondering if thats sufficient or if I need to implant this method from Google too to make sure Google sees all my content?
Thanks for your insights and excuse my poor English!
If you can navigate your site by showing source (ctrl-u in chrome), google can also crawl your site. Yes, its that simple
I found this problem all over the net but no answer yet, so maybe here someone solved it ...?
I built a page relying heavily on jquery.address. It's got one index page and the rest loads dynamically via Ajax following Google's /#!/ scheme for crawlable pages. Now I want to add Facebooks Like or share button but I can't get it to grab the actual page title or url.
Whatever I do, it always falls back to title and url of the index page. It tried:
(obviously) changing title an openGraph meta on load of the new parts.
"linking" the crawler page (?_escaped_fragmet_=xyx) but specifying the #! page in meta
"sharing" with a given title and url.
I never get anything but a link to the index page or a blank "share" to the right url with title and thumbnail ignored.
Has anyone got a similar setup working?
Thanks for any hints,
thomas
Facebook is actually using #! now and it works! If you build your site so that http://site.de/?_escaped_fragment=something is identical to http://site.de/#!/something all you have to do is "share" the #! url and it'll display the info from the escaped fragment page.
Use this URL to check: http://developers.facebook.com/tools/debug
But: A much cleaner solution to the problem can be found here: http://github.com/browserstate/history.js/wiki/Intelligent-State-Handling
My guess would be that Facebook's crawler doesn't run Javascript and will always display whatever's actually in the page it gets from the server.
Facebook share has a BRUTAL cache, last time I checked it was impossible to change the title / description data once it was scraped :(
The issue I had was the og:url and the actual url of the page did not match. I also read a number of comments about the og data being just after the title element, but I don't think that solved anything.
With regard to issues of caching, it is true that Facebook's caching is "brutal", but it does not cache anything for the lint tool: http://developers.facebook.com/tools/debug.
I use no-hash-bang urls when sharing links. I process the hard links and redirect them to a hash bang client side using javascript. That way if a crawler goes to the hard linked page it will display the information just as it would if javascript were enabled.
Compare:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Flikeapage.com%2F%23!%2FChristmas%2Fvs%2FBacon
and
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Flikeapage.com%2FChristmas%2Fvs%2FBacon
Hope this helps.