I have this website https://gpfo.memberclicks.net//index.php?option=com_community&view=profile&userid=23705974 and I am trying to extract the href link behind 'View' under 'Full Profile'.
I'd like to know how to scrape this. I tried //dl[1]/dd[contains(a/text(),'View')]/#href but it didn't return any data.
I'd also like to get an expert opinion on what the most efficient way to scrape websites is: is it better to directly run importXML from Google Docs or is there a better way to doing it using Scripts?
You try to query for the <dd>'s #href tag (which is not present). Try
//dd/a[. = 'View']/#href
instead. Or, staying closer to your original expression:
//dl[1]/dd/a[contains(text(),'View')]/#href
Is it better to directly run importXML from Google Docs or is there a better way to doing it using Scripts?
Depends on how complex things will get. If you just want to read some tabular data, you're probably better off with plain Spreadsheets; if it is more complicated writing your own script might be reasonable.
Related
I'm a beginner in shell scripting.
I'm trying to write a script in which a part of it involves reading the value from a webpage. In this case, The shell script tries to fetch the IMDB rating of a movie by going to the movie's IMDB page.
Can someone suggest me how i can achieve this & also what are the topics i need to learn ?
Thank you.
You can use wget and curl to get the page. Then you'll need to use regex or some other string manipulation to get the information you need from that. It would be a lot easier to use a library to do some of these things for you.
I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text
I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)
So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.
I am thinking of using Hpricot, but am not too sure how to proceed.
How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?
You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.
Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?
If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?
Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).
Would really appreciate some guidance.
Thanks.
I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.
I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.
Check this RailsCast:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
and:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/
I am looking to create an application effectively like a very simple blog - with another article being added every few days. In order to control the appearance of the articles, I would like to write them directly using html so picures, links etc can be put in appropriate places and formatted as required. However, storing these articles in a database would seem to make the maintenance a lot easier and provide additional searching capabilities. Is it sensible to store such html / erb code in a database in this way? If not what are the alternatives?
The standard way to do this is to use a markup library, such as redcloth, to do this. You use something similar even here on SO - it looks like markdown, if you read the help link. The content can certainly be put in a database. It can be done with html and erb, but the reason it is not often done is safety.
If you are the only one using it, it may not be an issue, but if you allow anyone else to insert data, you can open yourself up to XSS attacks with html, or even code exploits if you allowed raw erb. Markup languages exist to limit the set of markup allowed and to remove the ability for scripting attacks.
see also: Better ruby markdown interpreter?
Update: What great timing, there was a railscast released today about this: http://railscasts.com/episodes/272-markdown-with-redcarpet