pdf hosting and disk space - disk

While I can't leak too much information about this, there is a website I am thinking about making that would be large and would have A LOT of pdf files. However, disk space doesn't come cheap. How would I make it so that I can host all these pdfs (users would post them too) without eating up all this disk space?

Maybe do some checks on the markup of the PDF? Or set a max file-size, maybe a set of maxium file-sizes depending on the amount of pages?
I'm using pdf2html to handle (=get text from) PDF files.
It's a bit difficult to say... do the PDF's contain images or other data that will 'shrink' kb's?

That is kind of like asking how you you store trillions of gallons of salt water but not create an ocean. It really cant be done. At the best, you would manage your resources to cut down on the space a bit. For instance, if your PDFs can be turned into Forms, them just have the data and links to the embedded images in an xfdf file, that might cut down a percentage. But there are of course caveats. 1) this wont work if your PDFs arent forms and cant be standardized. 2) If your PDFs are mostly just images in a PDF format, this wont really help at all.

Related

exist-db how to access a pdf

I am sure it is very simple ... I just cannot get my head around this...
the exist-db Documentation is a bit fuzzy on content extraction...
http://exist-db.org/exist/apps/doc/contentextraction.
I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...
please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)
I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image
(the choosen image is just an example! )
This pic
When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!
I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ...
(it is not about doing my job for me, it's just about pointing me in the right direction)
I m afraid the short answer is simply don't.
Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.
You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital

Preload & site preformance: x amount of PDF's vs x * amount of pages PNG's

I have a question.
I am currently developing a website and I am nearly finished with the gallery section. The gallery section will display either a) a pdf file or b) several pngs(these pngs will be displayed under each other, it will look just like the pdf) when a user clicks on one of the gallery pictures.
I have the option of choosing between as said PDF or png.
The problem is the pdf files are up towards 5mb and I cannot do anything about this. I was thinking of preloading the pdf files or pngs in advance.
I am looking for one thing: site preformance.
Is it better to preload the one pdf (there will be up towards 20-30 pdfs in this gallery, so I will be preloading 20-30 pdf's) or is it better to go with the png approach (20-30*x amount of pages).
Or if anyone else has a better idea?
The best criteria will be to try, measure and see for yourself.
In any case, you might try to reduce the load by finding balance between amount of pre-loaded data and simplicity and load.
In case of PNGs you might try to split images into smaller parts and preload only images required to display one or two first pages. This will allow visitor to quickly open those pages (he might not go farther after all).
In case of PDFs you might linearize documents (optimize them for Fast Web View). For linearized documents you might try to preload beginning of the file only. There is a question about finding how much you should preload. Adobe plugin opens linearized files faster then non-linearized. I am not aware about other PDF plugins that make use of this optimization, though.

Best way to deal with image thumbnails on news web site (custom CMS based on codeigniter)?

I’ve been thinking a while about the best solution and as much as I read I get more and more confused. There are a lot of different libraries and helpers (most of them are outdated or for CI 1.x) and I really need your help.
I have a custom CMS based on CodeIgniter 2.1.3, news site that has about 40-50 images on the home page, but 80% of them are really small thumbnails in 3 different sizes and the other 20% of the images on the home page are in 2 sizes + for the inner pages when I list the news from a category there is 1 size of thumbnails. So in total I will need the original image for the news story, + 5-6 thumbnails sizes for the home page.
What’s the smartest way to deal with this? There will be let’s say 10-50 new news per day.
Is it still better to create 5-6 thumbnails per image during the upload?
What about the method “on the fly”? I’m more into this method, as I read, only the first visitor will call the library/helper to generate the thumbnails, and for the others the thumbnails will be already created so it won’t waste CPU. What about this method? Is it good practice?
What caching techniques I should use for these what I need?
Also I forgot to ask, how the other CMS system deal with generating the thumbnails? I mean about Wordpress, Drupal, Joomla, etc.
Do they store predefined sizes or generate them on the fly?
I guess their logic should be the best, or maybe not, but I want to implement something smart in my CodeIgniter CMS.
I didn’t mention, but I think it’s not important to this, I use Grocery CRUD for the admin panel.
Any help is appreciated.
Your best bet is to create images on the fly + use CDN like Amazon Cloudfront to cache the resized versions of your source image.
I’ve been using CodeIgniter for a number of years to build websites where lots of different sizes of images are used throughout the website. At the beginning I used to create every size needed out of the original image during the upload process (could easily end up with more than 5 thumbnails). This proved to be delivering the best performance – whenever you need an image of the certain size you just include it with no additional PHP processing. However I noticed that I end up with a huge number of images on the server, where the older ones may not even be used that often (e.g.: articles older than a year). Plus developing this way takes longer.
Then I started creating images on the fly, firstly using 3rd party libraries and later developed my own interface for CodeIgniter. This saves a lot of time, because during the upload process you save an original version of the image not worrying about resized versions. When displaying an image in the front end, all you normally need to do is to pass certain dimensions of the image required. Doing this way, not only you can get 5-6 versions of the image, but as many as you need. Also that’s a solution for the future when you redesign your website where the different sized images might be needed! What would you do when none of your 5 thumbnail options are no longer valid and you need different sizes?
You’re right, resizing an image on the fly can really be CPU consuming operation (especially when the large images are involved), therefore caching is a must. You can cache images right on your server or get CDN on top of that.
To keep the server tidy I normally run a cron job to delete on-the-fly images older than let's say a week. That saves space + doesn't cause harm - whenever image is needed to display, it'll just get recreated.
Check out timthumb, it's a script that resizes images on the fly and stores them in a cache. It's a simple as including an image tag with parameters in the URL.
ALso check this link which looks promising http://www.jenssegers.be/blog/31/Codeigniter-resizing-and-cropping-images-on-the-fly-continued
I love the way Drupal manage this. In Drupal 6 there was a module called imagecache (now is in core in Drupal 7, but functionality is very similar), which basically stores presets for images (image sizes, transformations, effects...) and when the visitor ask for an image the module generate different images based on presets and serve this images. This way you upload an image but have different images for different purposes.
The module has a really useful feature, if you want to change one preset, you can "flush" all the images related with that preset, so the visitors can see the changes.
Of course there are many other modules in Drupal related to imagecache or image styles, that add other effects like watermarks...
More information:
http://drupal.org/node/949222
http://drupal.org/node/163561

Image size guidelines

This may well be a little of an open-ended question
The site I am working on requires to be optimised for performance. One of the key areas is to optimise the file sizes of the images used upon the site.
Unfortunatley these images are being created by employees who do not have the required knowledge for creating images for the web, and it is my job to produce a set of guidelines for them to use.
I was wondering whether there was any resource/guidlines/literature regarding typical images file sizes for images of different dimensions - as I would like to include something like this to aid them to ensure their images are being created properly.
Any info would be greatly appreciated.
Thanks in advance
I can't answer the opinion question, but I can suggest some guidelines that will keep your images smaller.
First off, if they're using Photoshop to edit their images, it's likely they're storing a whole bunch of crap in the headers (digital papertrail, EXIF data, and such). Also, folks will frequently save in too high a bit depth.
For novice users, trying to explain why they need to use "save for web" is more likely to confuse them. Instead, just point them at:
http://www.smushit.com/ysmush.it/
This site is rather handy - it will compress all the images on a page you specify, or you can upload the images.
You should strongly consider writing some guidelines about where images are stored as well. It's frequently very beneficial to have your static image content stored on several servers, apart from your dynamic content. Most browsers will only download a limited # of files at a time from any given website (usually it's 2).
Unless there's a good reason, all your images should be cached using one of the HTTP cache techniques (expires, etags, etc).
Good luck.
72 dpi as a resolution and either jpeg or png formats work best.
Try to use images at the exact pixel area size they will end up being displayed as. This is specified by the images height and width attributes.
You can set the output quality of a jpeg image which will also save file size although there is a trade off against image quality.
I hope this is of use.

How would you optimise/simulate 'random' loading of large image files?

We use large background images (hi-res photos, up to 700 KB) for our page design.
It's part of the experience of the site that as you browse around, you see different images.
At the moment a different (random) image is loaded on each page request, from a pool of ~15 images, which could grow over time.
I'm looking for a sane way to optimize this:
To avoid the user having to download a big image file on every page view
To reduce load on the server (is this an issue, will the server keep the images in memory?)
The ideas I have so far include:
A timer which loads a different image at set intervals
Progressively loading other images in the background with ajax
Associating images with specific content (pages, tags)
The question is, how to keep it feeling somewhat random, while minimizing page load times and server hit?
I usually avoid sites with huge images, I am very impatient. I would rethink your design.
As a first step you should make sure, that the images can be properly cached:
use sane urls (no session id's etc)
set appropriate http headers ETag
Firstly, hearing that the background-images alone are 700kb astounds me. In addition to the content ON screen...that is a pretty heavy site.
For starters, I would try to use image compression tools. Two tools come to mind Imagemagick and PNGCrush. PNGCrush is excellent in reducing all the extraneous metadata attached to photos, without compromising photo quality.
I only recommend this as compressing the images will assist you in enabling the user to download a smaller quantity of content, which means quicker load times, which...at the end of the day...is what users want.
I would also cache the images, such that when a user re-visits the site, the image is already cached on their end. This minimises the HTTP requests that are made each time a user visits your site.
An example of where this technique is used on a commercial site is www.reactive.com. If you look the /js/headerImages.js file, they make use of image caching. Funnily enough, you will find the same src code at: http://javascript.internet.com/miscellaneous/random-image.html
Considering that you have mentioned that images are randomly loaded, I am assuming you are using a Javascript library such as jQuery to create the effect.
If you are, you can minimize page load times by using a CDN as opposed to referencing to a local copy of the jQuery lib which is stored on your server. I have performed performance testing on a site I made for a client, and over an average of 20 hits, saved 1.6 seconds through this technique!
Hope that helps for now :)

Resources