Format for representing GIS data - format

Is there open data format for representing such GIS data as roads, localities, sublocalities, countries, buildings, etc.
I expect that format would define address structure and names for components of address.
What I need is a data format to return in response to reverse geocoding requests.
I looked for it on the Internet, but it seems that every geocoding provider defines its own format.
Should I design my own format?
Does my question make any sense at all? (I'm a newbie to GIS).
In case I have not made myself clear I don't look for such data formats as GeoJSON, GML or WKT, since they define geometry and don't define any address structure.
UPD. I'm experimenting with different geocoding services and trying to isolate them into separate module. I need to provide one common interface for all of them and I don't want to make up one more data format (because on the one hand I don't fully understand domain and on the other hand the field itself seems to be well studied). The module's responsibility is to take partial address (or coordinates) like "96, Dubininskaya, Moscow" and to return data structure containing house number (96), street name (Dubininskaya), sublocality (Danilovsky rn), city (Moscow), administrative area (Moskovskaya oblast), country (Russia). The problem is that in different countries there might be more/less division (more/less address components) and I need to unify these components across countries.

Nope there is not unfortunately.
Why you may ask
Beacuse different nations and countries have vastly different formats and requirements for storing addresses.
Here in the UK for example, defining a postcode has quite a complex set of rules, where as ZIP codes in the US, are 4 digit numerical prefixed with a simple 2 letter state code.
Then you have to consider the question what exactly constitutes an address? again this differences not just from country to country, but some times drastically within the same territory.
for example: (Here in the UK)
Smith and Sons Butchers
10 High street
Some town
Mr smith
10 High street
Some town
The Occupier
10 High Street
Some Town
Smith and Sons Butchers
High Street
Some Town
Are all valid addresses in the UK, and in all cases the post would arrive at the correct destination, a GPS however may have trouble.
A GPS database might be set up so that each building is a square bit of geometry, with the ID being the house number.
That, would give us the ability to say exactly where number 10 is, which means immediately the last look up is going to fail.
Plots may be indexed by name of business, again that s fine until you start using person names, or generic titles.
There's so much variation, that it's simply not possible to create one unified format that can encompass every possible rule required to allow any application on the planet to format any geo-coded address correctly.
So how do we solve the problem?
Simple, by narrowing your scope.
Deal ONLY with a specific set of defined entities that you need to work with.
Hold only the information you need to describe what you need to describe (Always remember YAGNI* here)
Use standard data transmission formats such as JSON, XML and CSV this will increase your chances of having to do less work on code you don't control to allow it to read your data output
(* YAGNI = You ain't gonna need it)
Now, to dig in deeper however:
When it comes to actual GIS data, there's a lot of standard format files, the 3 most common are:
Esri Shape Files (*.shp)
Keyhole mark up Language (*.kml)
Comma separated values (*.csv)
All of the main stay GIS packages free and paid for can work with any of these 3 file types, and many more.
Shape files are by far the most common ones your going to come across, just about every bit of Geospatial data Iv'e come across in my years in I.T has been in a shape file, I would however NOT recommend storing your data in them for processing, they are quite a complex format, often slow and sequential to access.
If your geometry files to be consumed in other systems however, you can't go wrong with them.
They also have the added bonus that you can attach attributes to each item of data too, such as address details, names etc.
The problem is, there is no standard as to what you would call the attribute columns, or what you would include, and probably more drastically, the column names are restricted to UPPERCASE and limited to 32 chars in length.
Kml files are another that's quite universally recognized, and because there XML based and used by Google, you can include a lot of extra data in them, that technically is self describing to the machine reading it.
Unfortunately, file sizes can be incredibly bulky even just for a handful of simple geometries, this trade off does mean though that they are pretty easy to handle in just about any programming language on the planet.
and that brings us to the humble CSV.
The main stay of data transfer (Not just geo-spatial) ever since time began.
If you can put your data in a database table or a spreadsheet, then you can put it in a CSV file.
Again, there is no standards, other than how columns may or may not be quoted and what the separation points are, but readers have to know ahead of time what each column represents.
Also there's no "Pre-Made" geographic storage element (In fact there's no data types at all) so your reading application, also will need to know ahead of time what the column data types are meant to be so it can parse them appropriately.
On the plus side however, EVERYTHING can read them, whether they can make sense of them is a different story.


Algorithm for Matching Hospital Names

I work in a health care company and I have trouble with the hospitalization report data. I have the data are coming from various sources: Excel Reports, Plain Text File, and in some cases paper. I managed to get all the data into an Excel File. But I am running into a problem where each person spelled and referred to the same hospital.
For Example: New York Presbyterian Hospital, I have seen more than 10 variation.
New York Presbyterian Hospital
NY Presbyterian Hospital
Presbyterian Hospital
Presb Hospital
Columbia Presbyterian Medical Center
NYP/Columbia University Medical Center
New York Presbyterian Hospital Columbia University Medical
A more more cases where the hospital name is misspelled
A few of the different system string limit and cut off the string in random places, or maybe they copy and pasted incorrectly.
Different nurses refer to the Hospital in a differently
In my effect I am trying to create a true database that can store all the membership's information, but I am running into a wall because each staff/department are naming the hospital in a different way. (There is a provider ID unique to each hospital), but most of the reports I received only included "name". I have over 2000 members with about 100-150 hospitals, but 3 or 4 times the amount of different names.
I know Levenshtein distance could be in use, but in such extreme case, is there a strategy to build a match? There are too much data to do by hands (time consuming), since this is one of the dozens or reports I am assigned. Any suggestion would be appreciated.
This is a pretty standard and pretty difficult problem. Entire companies exist to solve it for big data.
The usual strategy is to encode what is known about the data domain in a heuristic algorithm to classify the data before putting it in the database.
A standard classification method would be to create a set of pattern strings for each hospital. The examples you gave might go in the pattern set initially.
Then for each incoming string and each pattern, calculate a metric that's the difference between the string and pattern. Levenshtein is a good starting point. The set containing the least distance pattern (in this case Columbia Presbyterian) wins. An excessive least distance means your pattern set is no good. (You get to tweak what "excessive" means.) More than one low distance (you get to define "low," too) means the pattern set has inadvertent overlaps.
Both problems may be handled in various ways, usually involving human intervention either to classify the data or enhance the pattern sets or both.
A second possibility is to use regexes as patterns. Then a match is equivalent to distance zero above, and a non-match is distance infinity. As you might expect, this makes the algorithm less flexible. Yet for some kinds of data - probably not yours though - it's the best choice.
You should look for "specific patterns" which your data is forming. What i have observed is, out of the strings that you've revealed-- "Presb" is the sub-string which is used in all strings (variations of hospital fields that you have been provided with). #M-ohem's comment is a nice approach as well. But for the starters, you can put up a regular expression which checks if any input string has the pattern "Persb" in it.

abstract data types in algorithms

The data structures that we use in applications often contain a great
deal of information of various types, and certian pieces of
information may be belong to multiple independent data structures. For
example, a file of personnel data may contain records with names,
addresses, and various other pieces of information about employees;
and each record may need to belong to one data structure for searching
for particular employees, to another data structure for answering
statistical queries, and so forth.
Despite this diverstiy and complexity, a large class of computing
applications involve generic manipulation of data objects, and need
access to the information associated with them for a limited number of
specific reasons. Many of the manipulations that are required are a
natural outgrowth of basic computational procedures, so they are
needed in broad variety of applications.
Above text is described in context of abstract data types by Robert Sedwick in Algorithms in C++.
My questions is what does author mean by first paragraph in above text?
Data structures are combinations of data storage and algorithms that work on those organisations of data to provide implementations of certain operations (searching, indexing, sorting, updating, adding, etc) with particular constraints. These are the building blocks (in a black box sense) of information representation in software. At the most basic level, these are things like queues, stacks, lists, hash maps/associative containers, heaps, trees etc.
Different data structures have different tradeoffs. You have to use the right one in the right situation. This is key.
In this light, you can use multiple (or "compound") data structures in parallel that allow different ways of querying and operating on the same logical data, hence filling each other's tradeoffs (strengths/weaknesses e.g. one might be presorted, another might be good at tracking changes, but be more costly to delete entries from, etc), usually at the cost of some extra overhead since those data structures will need to be kept synchronised with each other.
It would help if one knew what the conclusion of all this is, but from what I gather:
Employee record:
Name Address Phone-Number Salary Bank-Account Department Superior
As you can see, the employee database has information for each employee that by itself is "subdivided" into chunks of more-or-less independent pieces: the contact information for an employee has little or nothing to do with the department he works in, or the salary he gets.
EDIT: As such, depending on what kind of stuff needs to be done, different parts of this larger record need to be looked at, possibly in different fashion. If you want to know how much salary you're paying in total you'll need to do different things than for looking up the phone number of an employee.
An object may be a part of another object/structure, and the association is not unique; one object may be a part of multiple different structures, depending on context.
Say, there's a corporate employee John. His "employee record" will appear in the list of members of his team, in the salaries list, in the security clearances list, parking places assignment etc.
Not all data contained within his "employee record" will be needed in all of these contexts. His expertise fields are not necessary for parking place allotment, and his marital status should not play a role in meeting room assignment - separate subsystems and larger structures his entry is a part of don't require all of the data contained within his entry, just specific parts of it.

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on, and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

Optimizing Data Translation

Our business deals with houses and over the years we have created several business objects to represent them. We also receive lots of data from outside sources, and send data to external consumers. Every one of these represents the house in a different way and we spend a lot of time and energy translating one format into another. I'm looking for some general patterns or best practices on how to deal with this situation. How can I write a universal data translator that is flexible, extensible, and fast.
Background: A house generally has 30-40 attributes such as size, number of bedrooms, roof type, construction material, siding material, etc. These are typically represented as key/value pairs. A typical translation problem is that one vendor will represent the number of bedrooms as a single key/value pair: NumBedrooms=3, while a different vendor will have a key/value pair per bedroom: Bedroom=master, Bedroom=small, Bedroom=small.
There's nothing particularly hard about the translation, but we spend a lot of time and energy writing and testing translations. How can I optimize this?
(My environment is .Net)
The best place to start is by creating an "internal representation" which is the representation that your processing will always. Then create translators from and to "external representations" as needed. I'd imagine that this is what you are already doing, but it should be mentioned for completeness. The optimization comes from being able to selectively write import and export only when you need them.
A good implementation strategy is to externalize the transformation if you can. If you can get your inputs and outputs into XML documents, then you can write XSLT transforms between your internal and external representations. The goal is to be able to set up a pipeline of transformations from an input XML document to your internal representation. If everything is represented in XML and using a common protocol (say... hmm... HTTP), then the process can be controlled using configuration. BTW - this is essentially the Pipes and Filters design pattern.
Take a look at Yahoo pipes, Apache Cocoon, XML pipeline, and NetKernel for inspiration.
My employer back in the 90s faced this problem. We had a standard format we converted the customers' data to and from, as D.Shawley suggests.
I went further and designed a simple format-description language; we described our standard format in that language and then, for a new dataset, we'd write up its format too. Then a program would take both descriptions and convert the data from one format to the other, with automatic type conversions, safety checks, etc. (This came in handy for some other operations as well, not just these initial/final conversions.)
The particulars probably won't help you -- chances are you deal with completely different kinds of data. You can likely profit from the general principle, though. The "data definition language" needn't necessarily be a fancy thing with a parser and scanner; you might define it directly with a data structure in IronPython, say.

Regional Proximity UI

I'm developing a UI (AJAX-enabled; LAMP server) which will allow a user to designate regions in which a company operates. A "region" in this case may be a state (if dealing with the US) a province (Canada), or entire country (everyone else).
As there are 195 countries in the world, I would like to avoid a multi-select box or list of checkboxes. In the workflow leading to this particular screen, the user will have already entered the full address of the company, so I have a starting region to work from.
Since the majority of companies only operate out of their own region, and those covering multiple regions tend not to branch out too far, I am considering displaying the list of regions gradually based on proximity. I realize at some point (I'm using 3 passes for now) the full list will need to be displayed; I'm just trying to delay the user from reaching that point as it's a definite edge case.
Here is a PNG mockup that explains this concept a bit more clearly. (196kb)
What suggestions do you have for the actual form interaction? This has not been presented to representative end users yet, but I'm open to all suggestions during the prototyping stage.
Do you think 'rolling up' US states and/or Canadian provinces between transitions will negatively affect the user's spatial memory?
More clearly: after the 3rd pass, the company will operate in every US state - so convert those 50 inputs into one.
Are there any existing applications that have utilized this approach to use as a baseline or demo?
And, since I know my developer will want to know - what would be the easiest way to store each region's proximity? Lat/long of the center? Lat/long of each corner of a 'bounding box' (more accurate)? I'm assuming we will end up writing some proximity calculations based on the lat/long of the company's actual address.
Are you expecting users to read the map in order to know what list of checkboxes to go to? If your users have than level of geographic ability, then it’s less work for them to select the regions directly from the map, rather than have them make the map-to-Proximity-Level cognitive transfer, followed by a Proximity-Level-to-region transfer.
If some users do not have that level geographic expertise (you may be surprised how many Americans cannot find their own state on a US map), then I’d try, perhaps in addition to the map, no more than two lists, one proximal (the default) with regions close to the home address, and one exhaustive. I can’t see users with weak geographic abilities being be able to handle multiple arbitrary levels of proximity. People who can’t read maps well are not going to able to estimate the proximity level of one region to another. So the idea is to try a proximal list and if that doesn’t work, then forget about proximity and go exhaustive –don’t send your users wandering among proximity levels looking for Idaho (“I swear it’s near Indiana”).
By default, show the proximal list with regions likely to satisfy most of your users based on research of your likely clients. A “more” button displays the exhaustive list. Both lists should be sorted alphabetically, except first subdivide the exhaustive list into States of the US, Provinces & Territories of Canada, and Country (which includes the US (all) and Canada (all)).
You can provide some command buttons to select multiple regions (e.g., “All 48 contiguous US states, All of South America), allowing users to de-select some regions afterward. For this reason, I wouldn’t roll anything up until the user commits the input.
As an example of someone using a map plus list (all in HTML, no less), see
I am not really clear what it is that you are trying to achieve from the current UI (are you looking for branch offices? other companies? etc?)
I am not a big fan of using pure geographical proximity to define regions. For example, if one company operates in NYC, it could have an office in NJ which could well be as far as the moon. On the other hand, for a company in anchorage, an office in Vancouver could still be within the region. Unfortunately, state boundaries are fairly meaningless too. For example, I live in western PA, and can tell you that while Pittsburgh and Philly are in the same state, they could be different countries for all that matters, and most companies have offices in each.
If your project is lamp based, why not just let a user click a point on the map, and based on that ask him what he means (e.g., nearest city, entire county, entire state, entire country?. If you then need to define the entire region, you can perhaps use some sort of a grab tool to click or delineate all the other regions that could be part of it?
Either way, present your offices as pushpins on the map, and then maybe have a list on the side the way that standard google maps handles searches.
It may be a lot of work, but if it's an important form, users may prefer that over manual text entry or selections from a list.
