How to read/search specific content in a webpage using shell scripting - bash

I'm a beginner in shell scripting.
I'm trying to write a script in which a part of it involves reading the value from a webpage. In this case, The shell script tries to fetch the IMDB rating of a movie by going to the movie's IMDB page.
Can someone suggest me how i can achieve this & also what are the topics i need to learn ?
Thank you.

You can use wget and curl to get the page. Then you'll need to use regex or some other string manipulation to get the information you need from that. It would be a lot easier to use a library to do some of these things for you.

Related

sentiWordNet in rapidminer

I am trying to integrate SentiWordNet into Rapidminer using the Extract Sentiment operator. I cannot find a way to get the dictionary input, in fact even if I use the OpenWordnetDictionary operator I get "Map failed" error.
Has anyone of you ever (successfully) performed the same operation or know how I can make it work?
Thank you
There are some examples here. The basic trick is to put the Sentiword text file into the same folder as the Wordnet dictionary.

How to Scrape a Website Using Google Spreadsheet?

I have this website https://gpfo.memberclicks.net//index.php?option=com_community&view=profile&userid=23705974 and I am trying to extract the href link behind 'View' under 'Full Profile'.
I'd like to know how to scrape this. I tried //dl[1]/dd[contains(a/text(),'View')]/#href but it didn't return any data.
I'd also like to get an expert opinion on what the most efficient way to scrape websites is: is it better to directly run importXML from Google Docs or is there a better way to doing it using Scripts?
You try to query for the <dd>'s #href tag (which is not present). Try
//dd/a[. = 'View']/#href
instead. Or, staying closer to your original expression:
//dl[1]/dd/a[contains(text(),'View')]/#href
Is it better to directly run importXML from Google Docs or is there a better way to doing it using Scripts?
Depends on how complex things will get. If you just want to read some tabular data, you're probably better off with plain Spreadsheets; if it is more complicated writing your own script might be reasonable.

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)

Data retrieve from PDF using Perl

I would like to know the retrieve data from PDF using PERL. I have used the API::PDF but I'm expecting other than that. I am expecting data output like as PDF 2 DOC. I would be appreciate any one help me.
Thanks!
Writing a general PDF converter is not a simple task. There are at least two modules on CPAN which can help:
CAM::PDF
PDF::API2

Ruby XML Parsing with Nokogiri/XPath

I have a shopify store that I want to automatically update the product variants inventory levels with, using a live xml feed from the wholesaler I use.
I'm learning to program (Ruby) and this is my first project, but after researching here is how I think it should work.
Use Ruby/Nokugiri to parse the XML feed from the wholesaler, and then Xpath to locate both the unique product variant SKU code, and the stock level.
Somehow I need to use this SKU to refer back to my Shopify store product XML list, and pull out the variants unique ID using the SKU code.
Then use something like the builder gem to build the XML format that shopify needs, and then use curl to PUT the changes. I'm guessing I loop this process for every product?
I know Shopify only has a 300 call limit, so I've got the article on putting a delay in the script, but I get the feeling the above method isn't the easiest way to go about this?
With Shopify you need to apply the variant stock level update against unique variant xml files, so I need to build the unique xml file/code and PUT it against /admin/variants/#[thevariantid].xml
I'm looking forward to trying to put this together and learning in the process, but am I on the right track with this? Are there simpler gems I should be looking at?
n.b I've only recently started learning Ruby, and will head to Rails afterwards. I know a bit about XML and it's structure so should be ok finding what I need with XPath.
You’re on the right track, but I’d use the shopify_api gem to do the talking to Shopify instead of having to form the XML and URIs yourself: https://github.com/Shopify/shopify_api
There’s an article on our wiki that might also help you out with regards to the API call limit but just let me know if you need more space – we’re pretty flexible and the limit is really just there to keep scripts from going wild and affecting service for everyone else.
Your proposed path seems good, except that there's no need to use the 'builder' gem, as Nokogiri has some very nice XML-building built into it.

Resources