ruby + save web page - ruby

To save the HTML of a web page using Ruby, it's very easy.
One way to do is by using rio:
require 'rubygems'
require 'rio'
rio('http://www.google.com') > rio('google.html')
Is it possible to do the same for by parsing the html, requesting again the different images, javascript, css and then save each of them?
I think it is not very efficient.
So, is there a way to save a web page + all the images, css, and javascript that are related to that page, and all this automatically?

what about system("wget -r -l 1 http://google.com")

Most time we can use the system's tools. Like dimus said, you can use the wget to download page.
And there are many useful api for solving the Net problem. Such as net/ftp, net/http or net/https.
You can see the document for detail.
Net/HTTP
.But these methods only get the response, what we need do more is parsing the HTML document. Even more using the mozilla's lib is a good way.

url = "docs.zillabyte.com"
output_dir = "/tmp/crawl"
# -E = adjust malformed extensions (e.g. /some_image/ -> /some_image.gif)
# -H = span hosts (e.g. include assets from other domains)
# -p = download all assets associated with the page
# -P = output prefix (a.k.a the directory to dump the assets)
system("wget -E -H -p '#{url}' -P '#{output_dir}'")
# read files from 'output_dir'

Related

How to modify documentation built with sphinx via a script in readthedocs build

I am trying to run a script to modify the documentation built by sphinx hosted by Read the Docs (because some links are not properly handled). The script works when I try to build it locally, but either fails on the Read the Docs build or the changes do not get propagated to the web site.
The script I'm trying to run is super simple, it replaces some html links that are not properly converted by sphinx-markdown-tables:
#!/bin/bash
# fix_table_links.sh
FILE="_build/html/api_reference.html"
if [[ "$1" != "" ]]; then
FILE="$1"
fi
sed -E 's/a href="(.*)\.md"/a href="\1\.html"/g' -i ${FILE}
My readthedocs.yml looks like this:
# Required
version: 2
# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/conf.py
# Optionally build your docs in additional formats such as PDF and ePub
formats: all
# Optionally set the version of Python and requirements required to build your docs
python:
install:
- requirements: docs/requirements.readthedocs.txt
build:
os: ubuntu-20.04
tools:
python: "3.8"
jobs:
post_build:
- echo "Running post-build commands."
- bash docs/fix_table_links.sh _readthedocs/html/api_reference.html
There are two cases:
Case 1) Using the readthedocs.yml as above, the build fails because _readthedocs/html/api_reference.html does not exist, despite this directory being the place the documentation claims will get uploaded from here. An example failure of this run is here.
Case 2) If I change the final of readthedocs.yml to bash docs/fix_table_links.sh docs/_build/html/api_reference.html, then the build passes (example here). But the links are not updated on the Read the Docs site: they still point to markdown pages rather than their corresponding HTML pages, so it must not be the version that gets uploaded to the Read the Docs web site.
Wading through documentation, I can't figure out how do this. Has anybody done this before or have a better grasp on how Read the Docs builds work? Thanks!
If you're willing to rewrite the script as a Python function, then you can do this super easily by connecting it as an event handler for the build-finished event.
I've done something similar in one of my own repos, except I post-process a .rst file. It's not actually used on RTD, but I can see it works in the build logs. So it should work to post-process your HTML files as well, since the build-finished event would occur after they've been generated.
First, define the script as a function in your conf.py. It needs to have app and exception as parameters.
def replace_html_links(app, exception):
with open(FILE, 'r') as f:
html = f.read()
# stuff to edit and save the html
Then either define or add to your setup function:
def setup(app):
app.connect('build-finished', replace_html_links)
And that's it!

How to extract the source of a webpage without tags using bash?

We can download the source of the page using wget or curl , but I want to extract the source of the page without tags.
I mean extract it as text.
You can pipe to a simple sed command :
curl www.gnu.org | sed 's/<\/*[^>]*>//g'
Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.
First, you have to download the tika-server jar from the Apache site:
https://tika.apache.org/download.html
Then, run it as a local server:
$ java -jar tika-server-1.12.jar
After that, you can start parsing text using the following url:
http://localhost:9998/tika
Now, to parse the HTML of webpage into simple text:
$ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika
That should return the webpage text without tags.
This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.
Create a Ruby script that uses Nokogiri to parse the HTML:
require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357')
text = html.at('body').inner_text
puts text
Source
It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.
See also: bash command to covert html page to a text file

How can I determine what the current stable version of Ruby is?

I want to write a Ruby method that does two things:
Determine what the current stable version of Ruby is. My first thought is to get the response from https://www.ruby-lang.org/en/downloads/ and use RegEx to isolate the phrase The current stable version is [x]. Is there is an API I'm not aware of?
Get the URL to download the .tar.gz of that release. For this I was thinking the same thing, get it from the output of the site URL.
I'm looking for advice about the best way to go about it, or direction if there's something in place I might use to determine my desired results.
Ruby code to fetch the download page, then parse the current version and the link URL:
html = Net::HTTP.get(URI("https://www.ruby-lang.org/en/downloads/"))
vers = html[/http.*ruby-(.*).tar.gz/,1]
link = html[/http.*ruby-.*.tar.gz/]
GitHub code: ruby-stable-version.rb
Shell code:
ruby-stable-version
If you are using rbenv you can use ruby-build to get a list of ruby versions and then grep against that.
ruby-build --definitions | tail -r | grep -x -G -m 1 '[0-9]\.[0-9].[0-9]\-*[p0-9*]*'
You can then use that within your code like so:
version = `ruby-build --definitions | tail -r | grep -x -G -m 1 '[0-9]\.[0-9].[0-9]\-*[p0-9*]*'`.strip
You can then use this value to get the download URL.
url = "http://cache.ruby-lang.org/pub/ruby/#{version[0..2]}/ruby-#{version}.tar.gz"
And then download the file:
require 'open-uri'
open("ruby-#{version}.tar.gz", 'wb') do |file|
file << open(url).read
end
Learn more about rbenv here and ruby-build here.
Another possibility would be to use the Ruby source repository. Check version.h in every branch, filter by RUBY_PATCHLEVEL > -1 (-1 is used for -dev versions), sort by RUBY_VERSION and take the latest one.
You can use:
Ruby's built-in OpenURI, and Nokogiri, to read a page, parse it, search for certain tags, extract a parameter such as a "src" or "href".
OpenURI to read the URL, or curl or wget at the command-line to retrieve the file.
Nokogiri's tutorials including showing how to use OpenURI to retrieve the page and hand it off to Nokogiri.
OpenURI's docs show how to "open" URLs and retrieve their content using read. Once you've done that, the data will be easy to save to disk using something like this for text files:
File.write('some_file', open('http://www.example.com/').read)
or for binary:
File.open('some_file', 'wb') { |fo| fo.write(open('http://www.example.com/').read) }
There are examples of using both Nokogiri and OpenURI for this all over Stack Overflow.

Open a local file with open-uri

I am doing data scraping with Ruby and Nokogiri. Is it possible to download and parse a local file in my computer?
I have:
require 'open-uri'
url = "file:///home/nav/Desktop/Scraping/scrap1.html"
It gives error as:
No such file or directory # rb_sysopen - file:\home/nav/Desktop/Scraping/scrap1.html
If you want to parse a local file with Nokogiri you can do it like this.
file = File.read('/home/nav/Desktop/Scraping/scrap1.html')
doc = Nokogiri::HTML(file)
When you open a local file in a browser, the URL in the address bar is displayed as:
file:///Users/7stud/Desktop/accounts.txt
But that doesn't mean you use that format in a Ruby script. Your Ruby script doesn't send the file name to a browser and then ask the browser to retrieve the file. Your Ruby script searches your file system directly.
The same is true for URLs: your Ruby script doesn't ask your browser to go retrieve a page from the internet, Ruby retrieves the page itself by sending a request using your system's network interface. After all, a browser and a Ruby program are both just computer programs. What your browser can do over a network, a Ruby program can do, too.
This works for me:
require 'open-uri'
text = open('./data.txt').read
puts text
You have to get your path right, though. The only reason I can think of to use open() is if you had an array of filenames and URLs mixed together. If that isn't your situation, see new2code's answer.
This is how I do it as according to the documentation.
f = File.open("//home/nav/Desktop/Scraping/scrap1.html")
doc = Nokogiri::HTML(f)
f.close
I would make use of Mechanize and save the file locally, then parse it with Nokogiri like so:
# Save the file
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Download
current_url = 'http://www.example.com'
file = agent.get(current_url)
file.save!("#{Rails.root}/tmp/")
# Read the file
page = Nokogiri::HTML::Reader(File.open(file))
Hope that helps!

Explain to me a Sinatra setup for dummy's for a preview of html/css code

A word of warning up front: I do not know even the ruby basics, but I'm trying to learn more and more of the world of shell scripting this year.
I saw this Vimeo video of Ben Schwarz and immediately thought that I'd like to use such a tool to debug my sass and haml files.
So this is a call to help me to grasp the concept of Sinatra.
What I want is a simple way to output the code of my index.html to check if all the haml magic was applied correctly - so it should function as a source viewer that gives me live updates. I'd prefer it if Sinatra simply looked at the files that LiveReload already rendered (c.f. index.html) in my project folder.
Update: This is a screenshot of the Vimeo Video. It shows just the raw CSS output of a Sass file. This is what I'd like to have for my Haml and Sass code, or better for the output files that are already rendered by LiveReload as HTML and CSS.
I looked at the source file from #benschwarz at his github, but I wasn't even with his own example I'm only getting the standard: "Sinatra doesn’t know this ditty." So transferring this to work with html is still out of my reach.
What I did so far:
I setup my project as usual in ~/Sites/projectname
I setup RVM and install all the gems I need
Sass, Compass, Haml - the output gets compiled via LiveReload
Sinatra
I created myapp.rb in ~/Sites/projectname with the following content:
# myapp.rb
require 'sinatra'
set :public_folder, '/'
get '/' do
File.read(File.join('public', 'index.html'))
end
Whatever, I fired up Sinatra and checked http://localhost:4567/ – this didn't work because I do not know how to set the public_folder to ~/Sites/projectname.
Afterthoughts:
So I went on to search the net, but my limited knowledge of Ruby put my attempt of an successful research to an immediate halt.
Here are some sites I stumpled upon which are obvioulsy close to the solution I need, but… like I told you in the first sentence: if the solution was a book, I'd need the "For Dummies" version.
https://bitbucket.org/sulab/genelist_store/src/30fc0ba390b9/idea8/idea8.rb
Serving static files with Sinatra
http://www.sinatrarb.com/intro
Obvioulsy the reference documentation of Sinatra would help me, but I don't speak the language, hence, I don't get the lingo.
About public folder:
The public_folder is relative to your app file myapp.rb. If you have a public folder inside the projectname folder, this is your public folder. If you have your css, js and image files in another folder, say, includes under project_name, then you need to change the line:
# Actually, you need to remove the line above from myapp.rb as it is.
# The line means that the public folder which is used to have css, js and
# image files under '/' and that means that even myapp.rb is visible to everyone.
set :public_folder, '/'
# to:
set :public_folder, File.dirname(__FILE__) + '/includes'
And that will serve up css, js and/or image files from the project_name/includes folder instead of project_name/public folder.
Reading the html file:
Reading the html files does not depend on the public folder settings. These need not be inside the public folder.
get '/' do
File.read(File.dirname(__FILE__) + '/index.html')
# This says that the app should read the index.html
# Assuming that both myapp.rb and index.html are in the same folder.
# incase the html files are inside a different directory, say, html,
# change that line to:
# File.read(File.dirname(__FILE__) + '/html/index.html')
# Directory structure sample:
# project_name
# | - myapp.rb
# | - index.html (and not html/index.html etc.)
# | /public (or includes incase the css, js assets have a different location)
# | | /css
# | | /js
# | | /images
end
To get the html output inside the browser
After the file is read, typically, this will be your string: "<html><head></head><body></body></html>"
Without escaping the string, the browser renders the html string as html (no pun) and that's why you won't see any text. To escape the html, you can use the CGI class provided by Ruby (hat tip). So, in the end, this will be your snippet:
get '/' do
CGI::escapeHTML(File.read(File.dirname(__FILE__) + 'index.html'))
end
But that will spit out the html file in a single line. To clean it up,
# myapp.rb
get '/' do
#raw_html = CGI::escapeHTML(File.read(File.dirname(__FILE__) + 'index.html'))
end
# Using inline templates to keep things simple.
# get '/' do...end gets the index.erb file and hence, in the inline template,
# we need to use the ## index representation. If we say get '/home' do...end,
# then the inline template will come under ## home. All the html/erb between
# two "##"s will be rendered as one template (also called as view).
# The <%= #raw_html %>spews out the entire html string read inside the "get" block
__END__
## index
<html>
<head></head>
<body>
<pre>
<%= #raw_html %>
</pre>
</body>
</html>
If you're trying to render an index.html file, I would try storing it in the /views directory with an .erb extension. Or use an inline template. Here is a great resource

Resources