Say I wanted to create a ruby script that would access google, search for 'dogs', and then return the links of the top 5 results. How would this be implemented in ruby?
Thanks.
To clarify, I'm not looking for a way to search Google specifically. I want this to work on other sites too, such as amazon.com, dictionary.com, etc.
See the answer to this question: Using Google Search REST API in Ruby
You could hack through it with hpricot, if a google api doesn't exist already. Or here is a script
I would use cURL to actually GET the page contents (its a simple GET request to google.com/?q=stuff). Then you'll need to use regular expressions and intuition to parse the DOM, extract the links, and display :)
Related
I have been struggling to get any XPath technique to work on octoparse and similar software. I'm now trying google sheets from reading posts here and can't get it to work either.
Input: A slideshare presentation url (eg https://www.slideshare.net/carologic/ai-and-machine-learning-demystified-by-carol-smith-at-midwest-ux-2017)
Intended output: Slideshare embed url (in this case: https://www.slideshare.net/slideshow/embed_code/key/wZudqqTdctjWXA)
I think this would be the way to get the output using google sheets: =importxml(A1,"//meta[#itemprop='embedURL']/#content")
It is not working for me (failure to fetch url). With Octoparse etc I just got a blank value.
I'm being daft here, no doubt. Any help would be useful.
It doesn't work because slideshare is owned by LinkedIN, and they have put in a lot of effort to ensure they cant be scraped, including google sheets. Before it was possible, but I believe they eventually caught on to the work around.
I am building a search component that allows users to filter by type of response. You can see all responses, just the PDFs, or just the webpages. I have the first two parts down, all responses is a basic search and you can filter for pdfs using &fileType=pdf in the query, but i'm not sure how to exclude the pdfs and only return web pages.
I can't find a similar "exclude" param such as -fileType which seems to be supported in other similar APIs. Maybe I just need to format the URL the right way... If anyone has insight into how to accomplish something like this I would appreciate it.
You can try with -inurl:pdf in your URL.
I've got a web app which heavily uses AngularJS / AJAX and I'd like it to be crawlable by Google and other search engines. My understanding is that I need to do something special to make it work, as described here: https://developers.google.com/webmasters/ajax-crawling
Unfortunately, that looks quite nasty and I'd rather not introduce the hash tags. What I'd like to do is to serve a static page to Googlebot (based on the User-Agent), either directly or by sending it a 302 redirect. That way, the web app can be the same, and the whole Googlebot workaround is nicely isolated until it is no longer necessary.
My worry is that Google may mistakenly assume that I'm trying to trick Googlebot, while my goal is to help it. What do you guys think about this approach, and what would you recommend?
Recently I come upon this excellent post from yearofmoo, explaining in details how to make your Angular app SEO friendly. In essence, when bots see an uri with a hash tag they will know it's an ajaxed page and will try to reach the same uri by replacing '#!' in your uri with '?_escaped_fragment_='. This alternative uri instructs bots that they should expect to find a definitive static version of the page they were accessing.
Of course, to achieve this you'd have to introduce hash tags into your uris. I don't see why are you trying to avoid them. Isn't gmail using hash tags?
Yeah unfortunately, if you want to be indexed - you have to adhere to the scheme :( If your running a ruby app - there's a gem that implements the crawling scheme for any rack app....
gem install google_ajax_crawler
writeup of how to use it is at http://thecodeabode.blogspot.com.au/2013/03/backbonejs-and-seo-google-ajax-crawling.html, source code at https://github.com/benkitzelman/google-ajax-crawler
Have a look at these links and it will give you a good direction:
Set up your own Prerender service using Prerender.io open source code:
https://prerender.io/
Use a different existing service such as BromBone, Seo.js or SEO4AJAX:
http://www.brombone.com/
http://getseojs.com/
http://www.seo4ajax.com/
Create your own service for rendering and serving snapshots to search engines. Read this article. It will give you the big picture:
http://scotch.io/tutorials/javascript/angularjs-seo-with-prerender-io
As of May 2014 GoogleBot now executes JavaScript. Check WebmasterTools to see how Google sees your site.
http://googlewebmastercentral.blogspot.no/2014/05/understanding-web-pages-better.html
Edit: Note that this does not mean other crawlers (Bing, Facebook, etc.) will execute Javascript. You may still need to take additional steps to ensure that these crawlers can see your site.
I want to scrape few websites and many suggested Scrapy. It is Python based and since I am very familiar with PHP I looked for alternatives.
I got a crawler PHPCrawl. I am not sure if it is just a crawler or will it provides scraping facility as well. If it can be used for scraping- will it support XPath or Regular expressions.
How can it be compared with Scrapy which is on Python.
Please suggest me which is best to use for scraping the websites.
Thanks
PHPCrawl is a pure crawler, it delivers found pages and their sourcecode to users "as they are" (together with some context-information). Therefor it's fast, it's able ot use multi processes and has tons of options to configure it.
Can't say much about Scrapy since i didn't use it so far.
Yes, of course.
But as i said, PHPCrawl delivers the page-sources, and you have to extract the data you want to extract from it.
I'm trying to figure out how to access the Google Documents List API from Ruby.
I've looked at the google-api-ruby-client but that doesn't seem to support that particular API. I've also looked at the gdata-ruby-util client but that looks like it's out of date and no longer active.
It seems odd that there's no ruby client for accessing such a popular API, so can anyone help with a solution?
Here is a library that lets you read/write files. It also has methods to read/write spreadsheets cells.
https://github.com/gimite/google-drive-ruby
http://code.google.com/p/gdata-ruby-util/ is the correct library.
I would say it is more "stable" than "no longer active".