How to get all resources of a page in ruby - ruby

There are many http request tools in ruby, httparty, rest-client, etc. But most of them get only the page itself. Is there a tool that gets the html, javascript, css and images of a page just like a browser does?

Anemone comes to mind, but it's not designed to do a single page. It's capable if you have the time to set it up though.
It's not hard to retrieve the content of a page using something like Nokogiri, which is a HTML parser. You can iterate over tags that are of interest, grab their "SRC" or "HREF" parameters and request those files, storing their content on disk.
A simple, untested and written-on-the-fly, example using Nokogiri and OpenURI would be:
require 'nokogiri'
require 'open-uri'
html = open('http://www.example.com').read
File.write('www.example.com.html', html)
page = Nokogiri::HTML(html)
page.search('img').each do |img|
File.open(img['src'], 'wb') { |fo| fo.write open(img['src']).read }
end
Getting CSS and JavaScript are a bit more difficult because you have to determine whether they are embedded in the page or are resources and need to be retrieved from their sources.
Merely downloading the HTML and content is easy. Creating a version of the page that is stand-alone and reads the content from your local cache is much more difficult. You have to rewrite all the "SRC" and "HREF" parameters to point to the file on your disk.
If you want to be able to locally cache a site, it's even worse, because you have to re-jigger all the anchors and links in the pages to point to the local cache. In addition you have to write a full site spider which is smart enough to stay within a site, not follow redundant links, obey a site's ROBOTS file, and not consume all your, or their, bandwidth and get you banned, or sued.
As the task grows you also have to consider how you are going to organize all the files. Storing one page's resources in one folder is sloppy, but the easy way to do it. Storing resources for two pages in one folder becomes a problem because you can have filename collisions for different images or scripts or CSS. At that point you have to use multiple folders, or switch to using a database to track the locations of the resources, and rename them with unique identifiers, and rewrite those back to your saved HTML, or write an app that can resolve those requests and return the correct content.

Related

Ruby Sinatra Embed erb Partial External HTML File

I have a need to hold a "Purchase Contract" type of report in my website. I am using Sinatra using erb files to deliver content. I would like to email the current report (the versions will change) out when people sign up for various items.
I'm thinking I can house it in the database, or an external file, in some kind of format, so I can do the both:
import it into an erb file for presentation on the web
use it in an email so it's readable in text format
So basically I need it in a format that's basic as possible, but it has to translate into HTML (erb) and text.
What are my options with the format of this file? And how can I translate that into HTML? I've looked at markdown and it's not very pretty with the gems that I find that translate to text. Seeing that it needs plain text as well as HTML I'm a bit lost as to how to get this done.
File Snippet
Privacy Policy
Updated Feb 20, 2019
Website.com (“Website”) is a private business. In this Privacy Statement the terms “we” and “our” refer to Website. This Privacy Statement explains Website’s practices regarding personal information of our users and visitors to this website (the “Website”), as well as those who have transactions with us through telephone, Internet, faxes and other means of communications.
Website’s Commitment to Privacy
At Website, we are committed to respecting the privacy of our members and our Website visitors. For that reason we have taken, and will continue to take, measures to help protect the privacy of personal information held by us.
This Privacy Statement provides you with details regarding: (1) how and why we collect personal information; (2) what we do with that information; (3) the steps that we take to help ensure that access to that information is secure; (4) how you can access personal information pertaining to you; and (5) who you should contact if you have questions and concerns about our policies or practices.
Solution: Save the file as HTML and use this gem for conversion into text:
https://github.com/soundasleep/html2text_ruby
Works fine if the HTML is simple enough.
Remaining: Still have the issue as using the HTML file as a partial.
Solved:
#text = markdown File.read('views/privacy.md')
So park the source file as a markdown file, which can translate to HTML. When I need the email version, I need to translate to HTML then to text using the HTML2text gem. https://rubygems.org/gems/html2text
As I understand it, you have a portion of text (stored in a database or a file, it doesn't really matter where) and you want to:
serve this up formatted as HTML via a webpage
send it plain via email
Assuming a standard Sinatra project layout where the views directory lives in the project dir, e.g.
project-root/
app.rb
views/
and a route to deliver the text in app.rb:
get "/sometext" do
end
If you put the erb template in the views directory and as the last line of the route make a call to the erb template renderer you should get the output in HTML. e.g.
project-root/
app.rb
views/
sometext.erb # this is the erb template
In the Sinatra app
# app.rb
# I'm assuming you've some way to differentiate
# bits of text, e.g.
get "/sometext/:id" do |id|
#text = DB.sometext.getid id # my fake database call
erb :sometext # <- this will render it, make it the final statement of the block
# make sure #text is in the template
# else use locals, e.g.
# erb :sometext, :locals => { text: #text }
end
Now when a user visits http://example.org/sometext/485995 they will receive HTML. Emailing the text to the user could be triggered via the website or some other method of your choice.

Yahoo Pipes to loop through all pages

I am looking to pull job postings from a site that has multiple pages of postings. I can pull the content from one page
On a simple example I can get it to iterate and grab page content (this is a simple example site base)
However when I take the first example and try to clean the data (I can't use the Xpath filter to grab the HTML id and I cand seem to find a way to limit the scope elsewhere. Here is what I am trying (regex, rename...):
http://pipes.yahoo.com/pipes/pipe.edit?_id=3619ea93d66e47442659a1976746ba6c
Any thoughts?

What is the most efficient way to write headers and footers, 'global' header/footer or 'local' ones?

I'm about to start coding a website, and because this is my first time writing a code for a webpage, there is something I've been wondering about.
Writing separate header.php and footer.php is probably the fastest and easiet way to do stuff.
The problem is, what if for some pages I'd like to use specific javascript files and codes and for some I would like to use others?
It would result in more HTTP request and will eventually impact the performance of the site.
I thought about using if statements in the header and just give every page exactly what it needs, and nothing more.
But which way is more efficient?:
Coding global header.php and footer.php files and separating the codes using if statements OR add the whole header+footer code to every single file (ie local header/footer)?
P.S global and local header/footer is something I just made up, didn't really know how to call it, lol.
The advantage of your "global" header and footer is that 1) they are consistent and changes are "global" and 2) they are included in the pages in server code. So there isn't a lot of HTTP traffic if you do the include on the server side.
You can (and should) do page-specific includes on the server side if at all possible using logic that determines what to load at the time of the Request.
There are other ways to accomplish this but with straight up PHP, what you are considering is the best way.
If you are using a framework like Yii, you can do this sort of thing in layouts but with simple PHP, you are on the right track.
Defining the header and footer in each page (local), causes you to repeat a lot of code and causes maintenance headaches going forward. You have a lot of pages to update for simple header/footer changes.

How to differentiate from the server side, between the first request of the browser (HTML file) and the following (images, CSS, scripts...)?

I'm programming a website with SEO friendly links, ie, put the page title or other descriptive text in the link, separated by slashes. For example: h*tp://www.domain.com/section/page-title-bla-bla-bla/.
I redirect the request to the main script with mod_rewrite, but links in script, img and link tags are not resolved correctly. For example: assuming you are visiting the above link, the tag request the file at the URL h*tp://www.domain.com/section/page-title-bla-bla-bla/js/file.js, but the file is actually http://www.domain.com/js/file.js
I do not want to use a variable or constant in all HTML file URLs.
I'm trying to redirect client requests to a directory or to another of the server. It is possible to distinguish the first request for a page, which comes after? It is possible to do with mod_rewrite for Apache, or PHP?
I hope I explained well:)
Thanks in advance.
Using rewrite rules to fix the problem of relative paths is unwise and has numberous downsides.
Firstly, it makes things more difficult to maintain because there are hundreds of different links in your system.
Secondly and more seriously, you destroy cacheability. A resource requested from here:
http://www.domain.com/section/page-title-bla-bla-bla/js/file.js
will be regarded as a different resource from
http://www.domain.com/section/some-other-page-title/js/file.js
and loaded two times, causing the number of requests to grow dozenfold.
What to do?
Fix the root cause of the problem instead: Use absolute paths
<script src="/js/file.js">
or a constant, or if all else fails the <base> tag.
This is an issue of resolving relative URIs. Judging by your description, it seems that you reference the other resources using relative URI paths: In /section/page-title-bla-bla-bla a URI reference like js/file.js or ./js/file.js would be resolved to /section/page-title-bla-bla-bla/js/file.js.
To always reference /js/file.js independet from the actual base URI path, use the absolute path /js/file.js. Another solution would be to set the base URI explicitly to / using the BASE element (but note that this will affect all relative URIs).

Why do sites like twitter, gawker use #! instead of simple URL? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What's the shebang (#!) in Facebook and new Twitter URLs for?
Twitter's profiles now have URL in the form of:
http://twitter.com/#!/username
instead of the simpler structure:
http://twitter.com/username
What does #! do? What is the advantage of using #!? I read that it's related to google's web crawler, but I don't understand how exactly does that work.
There are two parts to this:
Why a fragment identifier instead of a real page?
Because they are overusing Ajax. Instead of linking to a new page, they link to a non-existent or dynamically generated fragment of the current page and then use JavaScript to change the content.
Why start the fragment identifier with !
Because Google will map it onto a different URL so you can serve up a special alternative version just for them. This allows the content to be indexed by search engines.
In a URL, the contents after the hash mark (#) are not sent to the server, but is instead visible to JavaScript on the page. So, using a # basically allows the page "http://twitter.com/" to handle it (for example, by opening up background connections to load up additional data). This also means that the content that doesn't change from one page to another (think the general layout of the page) can be cached and served immediately (since the effective URL is still "http://twitter.com/"), whereas putting it in the path of the URL (without the hash) would require a full separate fetch to get that layout.

Resources