Data Structure for large and detailed maps - algorithm

Does anyone has recommendation of data structures for relative large maps with high resolution, something like 400mile x 400mile with 10-15ft resolution. Using 2D array, that would be roughly 2Mx2M cells.
The map only needs to store the elevation and terrain (earth, water, rock, etc.), and I don't think storing tiles is a good strategy.
Thank you!

It depends on what you need to do with it: view it, store it, analyze it, etc...
One thing I can say, however, is that that file will be HUGE at your stated resolution, and you should consider splitting it up into at least a few tiles, even better at 1x1 mile tiles.
The list of raster formats supported by GDAL could serve as a good starting point for exploring various formats, keeping in mind that many software packages (GRASS, ArcGIS, etc. use GDAL to read and write most raster formats). Note also that some file formats have maximum sizes which may prevent you from using them with your very large file.
For analysis and non-viewable storage, HDF5 format might be of interest.

If you want people to see the data as a map over the web, then creating small image tile overlays will be the fastest approach to sharing such a large dataset.


Processing raster files as arrays?

I need to process a raster file by scanning every point and radiating around it to find other points of interest. As you may know, a GIS raster file contains millions, if not tens or hundreds of millions of pixel points. I need a platform that will allow me to process this data efficiently. I am slightly familiar with vba but from what I see such image processing capabilities are beyond its scope.
You probably do not want to be starting by coding this from scratch; raster processing in GIS is almost as old as the hills. I suggest you look at something like QGIS as a starting point. If it or the many raster plugins don't do what you want it's time to start learning how to write your own; likely not in VBA.
Some options for making augmented local maps with d3

I am new to d3 geo. My task is to make a map of Boston and add some interactive features to it.
So far I've been able to get an outline of Boston. But the base map should be comparable to something you'd see in Google Maps - it should have buildings, roads, street names and city names, rivers, etc. A basic geography that makes the region more familiar.
For now, I don't need to pan, and may have just two or three zoom states.
All the visualizations I've seen that overlay interactive features onto maps like this seem to use images for the underlying maps: windhistory, polymaps, google maps and more. So I guess my questions are:
Why do some maps use images for the "backdrop"? Is it just the easiest way to build on top of existing maps? Is it more performant?
If I go with the images approach, are there any limitations to the features I can add? I'm hoping to do things like windmaps, animations, heatmaps, etc.
What are the copyright implications for using images? I imagine the answer to this is, "depends on which images I use," but are there some standard libraries that have no strings attached? For example I know if I use Google Maps, I have to display their logo, there's an API limit, etc. Are there any standard sources that are completely open?
Are there any examples where geography is added purely through TopoJSON?
Sorry if some of these seem obvious, but I am completely new to maps and just don't know the standard practices. Thanks for any help!
A quick take on your questions. Hopefully someone with more mapping experience can give you more detail:
Why do some maps use images for the "backdrop"?
File size and computation time, mostly. Drawing complete maps with buildings, roads, and topography requires a lot of data and a lot of time for the browser to render it. If your browser DOM gets too complicated, it can slow down all interactions even after the original drawing.
If I go with the images approach, are there any limitations to the features I can add?
There's a reason most interactive maps use multiple layers. The background images are best for the underlying "lay of the land" type imagery, anything you want to be interactive should be on top with SVG.
What are the copyright implications for using images?
If you're using someone's images, you have to follow their licence. You might want to look at the OpenStreetMap project.
Are there any examples where geography is added purely through TopoJSON?
I suppose that depends on what you mean by "geography"; Mike Bostock has generated topoJSON for a variety of features based on US Atlas data.
As for whether it makes sense: TopoJSON encodes paths/boundaries directly, and encodes regions as the area enclosed by a set of boundaries. You could use it to encode streets and rivers and even building outlines, but you're not saving any file size relative regular GeoJSON because those paths generally aren't duplicated the way that region boundaries are. Relative to using image tiles, any improvement in file size would be countered with increased processing time.

Find duplicate images of different sizes

I am wondering if there is a pre-existing algorithm/library/framework to compare two images to see if one is a re-sized version of the other? The programming language doesn't matter at this stage.
If there is nothing out there, I'd need to write something up. What I have thought of so far:
(Expensive) Resize the larger to the smaller and compare pixel by pixel.
Better yet, just resize a few random "areas" on the picture and compare. If they match, convert more, etc...
Break the image into a number of rows and columns and do some sort of parity math on the color values.
The problem I see with the first two ideas especially, is that there are different ways to re-size a picture in the first place, so the math will likely not work out the same at all. Some re-sizing adds blur, etc....
If anyone could point me to some good literature on this subject, that would be great. My googling turns up mostly shareware applications which is not what I want.
The goal is to have this running in the back of a webserver.
The best approach depends on the characteristics of the images you are comparing, what percentage of probability it is that the images are the same, and when they are different, are they typically off by a lot or could it be as minute as a single pixel difference?
If the answers to the above is that the images you need to compare will be completely random then going with the expensive solution, or some available package might be the best bet.
If it is that you know that the images are different more often than not, and that the images typically differ quite a lot, and you really want to hand-roll a solution you could implement some initial 'quick compare' steps that would be less expensive and that would quickly identify a lot of the cases where the images are different.
For example you could resize the larger image, then either compare pixel-by-pixel (or calculate a hash of the pixel values) only a 'diagonal line' of the image (top left pixel to bottom right pixel) and by doing so exclude differing images and only do the more expensive comparison for those that pass this test.
Or take a pre-set number of points at whatever is a 'good distribution' depending on the type of image and only do the more expensive comparison for those that pass this test.
If you know a lot about the images you will be comparing, they have known characteristics and they are different more often than they are the same, implementing a cheap 'quick elimination compare' along the lines of the above could be worthwhile.
You need to look into dHash algorithm for this.
I wrote a pure java library just for this few days back. You can feed it with directory path(includes sub-directory), and it will list the duplicate images in list with absolute path which you want to delete. Alternatively, you can use it to find all unique images in a directory too.
It used awt api internally, so can't be used for Android though. Since, imageIO has problem reading alot of new types of images, i am using twelve monkeys jar which is internally used.
Jar with dependencies bundled internally can be downloaded from,
The api can find duplicates among images of different sizes too.

Self-describing file format for gigapixel images?

In medical imaging, there appears to be two ways of storing huge gigapixel images:
Use lots of JPEG images (either packed into files or individually) and cook up some bizarre index format to describe what goes where. Tack on some metadata in some other format.
Use TIFF's tile and multi-image support to cleanly store the images as a single file, and provide downsampled versions for zooming speed. Then abuse various TIFF tags to store metadata in non-standard ways. Also, store tiles with overlapping boundaries that must be individually translated later.
In both cases, the reader must understand the format well enough to understand how to draw things and read the metadata.
Is there a better way to store these images? Is TIFF (or BigTIFF) still the right format for this? Does XMP solve the problem of metadata?
The main issues are:
Storing images in a way that allows for rapid random access (tiling)
Storing downsampled images for rapid zooming (pyramid)
Handling cases where tiles are overlapping or sparse (scanners often work by moving a camera over a slide in 2D and capturing only where there is something to image)
Storing important metadata, including associated images like a slide's label and thumbnail
Support for lossy storage
What kind of (hopefully non-proprietary) formats do people use to store large aerial photographs or maps? These images have similar properties.
It seems like starting with TIFF or BigTIFF and defining a useful subset of tags + XMP metadata might be the way to go. FITS is no good since it is basically for lossless data and doesn't have a very appropriate metadata mechanism.
The problem with TIFF is that it just allows too much flexibility, but a subset of TIFF should be acceptable.
The solution may very well be and
It looks like DICOM now has support:
You probably want FITS.
Arbitrary size
1--3 dimensional data
Extensive header
Widely used in astronomy and endorsed by NASA and the IAU
I'm a pathologist (and hobbyist programmer) so virtual slides and digital pathology are a huge interest of mine. You may be interested in the OpenSlide project. They have characterized a number of the proprietary formats from the large vendors (Aperio, BioImagene, etc). Most seem to consist of a pyramidal zoomed (scanned at different microscopic objectives, of course), large tiff files containing multiple tiled tiffs or compressed (JPEG or JPEG2000) images.
The industry standard is DICOM Sup 145; getting vendors to adopt it though has been sluggish, but inventing yet another format would probably not be helpful.
PNG might work for you. It can handle large images, metadata, and the PNG format can have some interlacing, so you can get up to (down to?) an n/8 x n/8 downsampled image pretty easily.
I'm not sure if PNG can do rapid random access. It is chunked, but that might not be enough.
You could represent sparse data with the transparency channel.
JPEG2000 might be worth a look, some interesting efforts from National libraries in this space.

Detecting if two images are visually identical

Sometimes two image files may be different on a file level, but a human would consider them perceptively identical. Given that, now suppose you have a huge database of images, and you wish to know if a human would think some image X is present in the database or not. If all images had a perceptive hash / fingerprint, then one could hash image X and it would be a simple matter to see if it is in the database or not.
I know there is research around this issue, and some algorithms exist, but is there any tool, like a UNIX command line tool or a library I could use to compute such a hash without implementing some algorithm from scratch?
edit: relevant code from findimagedupes, using ImageMagick
try $image->Sample("160x160!");
try $image->Modulate(saturation=>-100);
try $image->Blur(radius=>3,sigma=>99);
try $image->Normalize();
try $image->Equalize();
try $image->Sample("16x16");
try $image->Threshold();
try $image->Set(magick=>'mono');
($blob) = $image->ImageToBlob();
edit: Warning! ImageMagick $image object seems to contain information about the creation time of an image file that was read in. This means that the blob you get will be different even for the same image, if it was retrieved at a different time. To make sure the fingerprint stays the same, use $image->getImageSignature() as the last step.
findimagedupes is pretty good. You can run "findimagedupes -v fingerprint images" to let it print "perceptive hash", for example.
Cross-correlation or phase correlation will tell you if the images are the same, even with noise, degradation, and horizontal or vertical offsets. Using the FFT-based methods will make it much faster than the algorithm described in the question.
The usual algorithm doesn't work for images that are not the same scale or rotation, though. You could pre-rotate or pre-scale them, but that's really processor intensive. Apparently you can also do the correlation in a log-polar space and it will be invariant to rotation, translation, and scale, but I don't know the details well enough to explain that.
MATLAB example: Registering an Image Using Normalized Cross-Correlation
Wikipedia calls this "phase correlation" and also describes making it scale- and rotation-invariant:
The method can be extended to determine rotation and scaling differences between two images by first converting the images to log-polar coordinates. Due to properties of the Fourier transform, the rotation and scaling parameters can be determined in a manner invariant to translation.
Colour histogram is good for the same image that has been resized, resampled etc.
If you want to match different people's photos of the same landmark it's trickier - look at haar classifiers. Opencv is a great free library for image processing.
I don't know the algorithm behind it, but Microsoft Live Image Search just added this capability. Picasa also has the ability to identify faces in images, and groups faces that look similar. Most of the time, it's the same person.
Some machine learning technology like a support vector machine, neural network, naive Bayes classifier or Bayesian network would be best at this type of problem. I've written one each of the first three to classify handwritten digits, which is essentially image pattern recognition.
resize the image to a 1x1 pixle... if they are exact, there is a small probability they are the same picture...
now resize it to a 2x2 pixle image, if all 4 pixles are exact, there is a larger probability they are exact...
then 3x3, if all 9 pixles are exact... good chance etc.
then 4x4, if all 16 pixles are exact,... better chance.
doing it this way, you can make efficiency improvments... if the 1x1 pixel grid is off by a lot, why bother checking 2x2 grid? etc.
If you have lots of images, a color histogram could be used to get rough closeness of images before doing a full image comparison of each image against each other one (i.e. O(n^2)).
There is DPEG, "The" Duplicate Media Manager, but its code is not open. It's a very old tool - I remember using it in 2003.
You could use diff to see if they are REALLY different.. I guess it will remove lots of useless comparison. Then, for the algorithm, I would use a probabilistic approach.. what are the chances that they look the same.. I'd based that on the amount of rgb in each pixel. You could also find some other metrics such as luminosity and stuff like that.
