abstract data types in algorithms - algorithm

The data structures that we use in applications often contain a great
deal of information of various types, and certian pieces of
information may be belong to multiple independent data structures. For
example, a file of personnel data may contain records with names,
addresses, and various other pieces of information about employees;
and each record may need to belong to one data structure for searching
for particular employees, to another data structure for answering
statistical queries, and so forth.
Despite this diverstiy and complexity, a large class of computing
applications involve generic manipulation of data objects, and need
access to the information associated with them for a limited number of
specific reasons. Many of the manipulations that are required are a
natural outgrowth of basic computational procedures, so they are
needed in broad variety of applications.
Above text is described in context of abstract data types by Robert Sedwick in Algorithms in C++.
My questions is what does author mean by first paragraph in above text?

Data structures are combinations of data storage and algorithms that work on those organisations of data to provide implementations of certain operations (searching, indexing, sorting, updating, adding, etc) with particular constraints. These are the building blocks (in a black box sense) of information representation in software. At the most basic level, these are things like queues, stacks, lists, hash maps/associative containers, heaps, trees etc.
Different data structures have different tradeoffs. You have to use the right one in the right situation. This is key.
In this light, you can use multiple (or "compound") data structures in parallel that allow different ways of querying and operating on the same logical data, hence filling each other's tradeoffs (strengths/weaknesses e.g. one might be presorted, another might be good at tracking changes, but be more costly to delete entries from, etc), usually at the cost of some extra overhead since those data structures will need to be kept synchronised with each other.

It would help if one knew what the conclusion of all this is, but from what I gather:
Employee record:
Name Address Phone-Number Salary Bank-Account Department Superior
As you can see, the employee database has information for each employee that by itself is "subdivided" into chunks of more-or-less independent pieces: the contact information for an employee has little or nothing to do with the department he works in, or the salary he gets.
EDIT: As such, depending on what kind of stuff needs to be done, different parts of this larger record need to be looked at, possibly in different fashion. If you want to know how much salary you're paying in total you'll need to do different things than for looking up the phone number of an employee.

An object may be a part of another object/structure, and the association is not unique; one object may be a part of multiple different structures, depending on context.
Say, there's a corporate employee John. His "employee record" will appear in the list of members of his team, in the salaries list, in the security clearances list, parking places assignment etc.
Not all data contained within his "employee record" will be needed in all of these contexts. His expertise fields are not necessary for parking place allotment, and his marital status should not play a role in meeting room assignment - separate subsystems and larger structures his entry is a part of don't require all of the data contained within his entry, just specific parts of it.

Related

Should I use multiple fact tables for each grain or just aggregate from lowest grain?

Fairly new to data warehouse design and star schemas. We have designed a fact table which is storing various measures about Memberships, our grain is daily and some of the measures in this table are things like qty sold new, qty sold renewing, qty active, qty cancelled.
My question is this, the business will want to see the measures at other grains such as monthly, quarterly, yearly etc.. so would typically the approach here just be to aggregate the day level data for whatever time period was needed or would you recommend creating separate fact tables for the "key" time periods for our business requirements e.g. monthly, quarterly, yearly? I have read some mixed information on this which is mainly why I'm seeking others views.
Some information I read had people embedding a hierarchy in the fact table to designate different grains which was then identified via a "level" type column, which was advised against by quite a few people and didn't seem good to me also, those advising against we're suggesting separate fact tables per grain but to be honest I don't see why we wouldn't just aggregate from the daily entries we have, what benefits would we get from a fact table for each grain other than some slight performance improvements maybe?
Each DataMart will have its own "perspective", which may require an aggregated fact grain.
Star schema modeling is a "top-down" process, where you start from a set of questions or use cases and build a schema that makes those questions easy to answer. Not a "bottom-up" process where you start with the source data and figure out the schema design from there.
You may end up with multiple data marts that share the same granular fact table, but which need to aggregate it in different ways, either for performance, or to have a gran to calculate and store a measure that only makes sense at the aggregated grain.
Eg
SalesFact (store,day,customer,product,quantiy,price,cost)
and
StoreSalesFact(store, week, revenue, payroll_expense, last_year_revenue)

DWH and ETL explained

In this post I am not asking any tutorials, how to do something, in this post, I am asking your help, if someone could explain me with simple words, what is DWH (data warehouse ) and what is ETL.
Of course, I google'ed and youtube'd alot, I found many articles, videos, but still, I am not very sure what it is.
Why I am asking?
I need to know it very well before I am applying for a job.
This answer by no means should be treated as a complete definition of a data warehouse. It's only my attempt to explain the term in layman's terms.
Transactional (operational, OLTP) and analytical (data warehouses) systems can both use the same RDBMS as the back-end and they may contain exactly the same data. However, their data models will be completely different, because they are optimized for different access patterns.
In transactional systems you usually work with a single row (e.g. a customer or an invoice) and the write consistency is crucial, so the data model is normalized. On the contrary, data warehouses are optimized for reading large number of rows (e.g. all invoices from the previous year) and aggregating data, so dimensional models are flattened (star schema, Kimball's dimensions and facts).
Transactional systems store only the current version of entities (i.e. current customer's address), while data warehouses may use slowly changing dimensions (SCD) to preserve history (e.g. all addresses of the customer with date ranges to indicate when each of them was valid).
ETL stands for extract, transform, load and it is the procedure of:
extracting data from a transactional system,
transforming it into dimensional format,
loading in a data warehouse.

Format for representing GIS data

Is there open data format for representing such GIS data as roads, localities, sublocalities, countries, buildings, etc.
I expect that format would define address structure and names for components of address.
What I need is a data format to return in response to reverse geocoding requests.
I looked for it on the Internet, but it seems that every geocoding provider defines its own format.
Should I design my own format?
Does my question make any sense at all? (I'm a newbie to GIS).
In case I have not made myself clear I don't look for such data formats as GeoJSON, GML or WKT, since they define geometry and don't define any address structure.
UPD. I'm experimenting with different geocoding services and trying to isolate them into separate module. I need to provide one common interface for all of them and I don't want to make up one more data format (because on the one hand I don't fully understand domain and on the other hand the field itself seems to be well studied). The module's responsibility is to take partial address (or coordinates) like "96, Dubininskaya, Moscow" and to return data structure containing house number (96), street name (Dubininskaya), sublocality (Danilovsky rn), city (Moscow), administrative area (Moskovskaya oblast), country (Russia). The problem is that in different countries there might be more/less division (more/less address components) and I need to unify these components across countries.
Nope there is not unfortunately.
Why you may ask
Beacuse different nations and countries have vastly different formats and requirements for storing addresses.
Here in the UK for example, defining a postcode has quite a complex set of rules, where as ZIP codes in the US, are 4 digit numerical prefixed with a simple 2 letter state code.
Then you have to consider the question what exactly constitutes an address? again this differences not just from country to country, but some times drastically within the same territory.
for example: (Here in the UK)
Smith and Sons Butchers
10 High street
Some town
Mr smith
10 High street
Some town
The Occupier
10 High Street
Some Town
Smith and Sons Butchers
High Street
Some Town
Are all valid addresses in the UK, and in all cases the post would arrive at the correct destination, a GPS however may have trouble.
A GPS database might be set up so that each building is a square bit of geometry, with the ID being the house number.
That, would give us the ability to say exactly where number 10 is, which means immediately the last look up is going to fail.
Plots may be indexed by name of business, again that s fine until you start using person names, or generic titles.
There's so much variation, that it's simply not possible to create one unified format that can encompass every possible rule required to allow any application on the planet to format any geo-coded address correctly.
So how do we solve the problem?
Simple, by narrowing your scope.
Deal ONLY with a specific set of defined entities that you need to work with.
Hold only the information you need to describe what you need to describe (Always remember YAGNI* here)
Use standard data transmission formats such as JSON, XML and CSV this will increase your chances of having to do less work on code you don't control to allow it to read your data output
(* YAGNI = You ain't gonna need it)
Now, to dig in deeper however:
When it comes to actual GIS data, there's a lot of standard format files, the 3 most common are:
Esri Shape Files (*.shp)
Keyhole mark up Language (*.kml)
Comma separated values (*.csv)
All of the main stay GIS packages free and paid for can work with any of these 3 file types, and many more.
Shape files are by far the most common ones your going to come across, just about every bit of Geospatial data Iv'e come across in my years in I.T has been in a shape file, I would however NOT recommend storing your data in them for processing, they are quite a complex format, often slow and sequential to access.
If your geometry files to be consumed in other systems however, you can't go wrong with them.
They also have the added bonus that you can attach attributes to each item of data too, such as address details, names etc.
The problem is, there is no standard as to what you would call the attribute columns, or what you would include, and probably more drastically, the column names are restricted to UPPERCASE and limited to 32 chars in length.
Kml files are another that's quite universally recognized, and because there XML based and used by Google, you can include a lot of extra data in them, that technically is self describing to the machine reading it.
Unfortunately, file sizes can be incredibly bulky even just for a handful of simple geometries, this trade off does mean though that they are pretty easy to handle in just about any programming language on the planet.
and that brings us to the humble CSV.
The main stay of data transfer (Not just geo-spatial) ever since time began.
If you can put your data in a database table or a spreadsheet, then you can put it in a CSV file.
Again, there is no standards, other than how columns may or may not be quoted and what the separation points are, but readers have to know ahead of time what each column represents.
Also there's no "Pre-Made" geographic storage element (In fact there's no data types at all) so your reading application, also will need to know ahead of time what the column data types are meant to be so it can parse them appropriately.
On the plus side however, EVERYTHING can read them, whether they can make sense of them is a different story.

How to design database to store and retrieve large item/skill lists in ruby

I plan a role playing game where characters are supposed to carry/use items and train skills. When it comes to store (possibly numerous) items/skills possessed by characters, I can't think of a better way than putting a row for every possible item and skill to each character instantiated. However this seems to be an overkill to me.
To be clear, if this would be an exercise or a small game where total number of items/skills is ~30, I would add an items and a skills hash to the character class and methods to add and remove them like:
def initialize
#inventory = {}
#skills = {}
end
def add_item item, number
#inventory[item] += number
end
Regarding that I would like to store the number of the items and the levels of the skills, what else can I try to handle ~1000 items and ~150 in the inventory and possibly 100 skills?
Plan for Data Retrieval
Generally, it's a good idea to design your database around how you plan to look up and retrieve your data, rather than how you want to store it. A bad design makes your data very expensive to collect from the database.
In your example, having a separate model for each inventory item or skill would be hugely expensive in terms of lookups whenever you want to load a character. Do you really want to do 1,000 lookups every time you load someone's inventory? Probably not.
Denormalize for Speed
You typically want to normalize data that needs to be consistent, and denormalize data that needs to be retrieved/updated quickly. One option might be to serialize your character attributes.
For example, it should be faster to store a serialized Character#inventory_items field than update 100 separate records with a has_many :though or has_and_belongs_to_many relationship. There are certainly trade-offs involved with denormalization in general and serialization in particular, but it might be a good fit for your specific use case.
Consider a Document Database
Character sheets are documents. Unless you need the relational power of a SQL database, a document-oriented database might be a better fit for the data you want to manage. CouchDB seems particularly well-suited for this example, but you should certainly evaluate all your NoSQL options to see if any offer the features you need. Your mileage will definitely vary.
Always Benchmark
Don't take my word for what's optimal. Try a design. Benchmark it. See what the design does with your data. In the end, that's the only thing that matters.
I can't think of a better way than putting a row for every possible item and skill to each character instantiated.
Do characters evolve independently?
Assuming yes, there is no other choice but having each end every relevant combination physically represented in the database.
If not, then you can "reuse" the same set or items/skills for multiple characters, but this is probably not what is going on here.
In any case, relational databases are very good at managing huge amounts of data and the numbers you mentioned don't even qualify as "huge". By correctly utilizing techniques such as clustering, you can ensure that a lookup of all items/skills for a given character is done in a minimal number of I/O operations, i.e. very fast.

Efficient searching in huge multi-dimensional matrix

I am looking for a way to search in an efficient way for data in a huge multi-dimensional matrix.
My application contains data that is characterized by multiple dimensions. Imagine keeping data about all sales in a company (my application is totally different, but this is just to demonstrate the problem). Every sale is characterized by:
the product that is being sold
the customer that bought the product
the day on which it has been sold
the employee that sold the product
the payment method
the quantity sold
I have millions of sales, done on thousands of products, by hundreds of employees, on lots of days.
I need a fast way to calculate e.g.:
the total quantity sold by an employee on a certain day
the total quantity bought by a customer
the total quantity of a product paid by credit card
...
I need to store the data in the most detailed way, and I could use a map where the key is the sum of all dimensions, like this:
class Combination
{
Product *product;
Customer *customer;
Day *day;
Employee *employee;
Payment *payment;
};
std::map<Combination,quantity> data;
But since I don't know beforehand which queries are performed, I need multiple combination classes (where the data members are in different order) or maps with different comparison functions (using a different sequence to sort on).
Possibly, the problem could be simplified by giving each product, customer, ... a number instead of a pointer to it, but even then I end up with lots of memory.
Are there any data structures that could help in handling this kind of efficient searches?
EDIT:
Just to clarify some things: On disk my data is stored in a database, so I'm not looking for ways to change this.
The problem is that to perform my complex mathematical calculations, I have all this data in memory, and I need an efficient way to search this data in memory.
Could an in-memory database help? Maybe, but I fear that an in-memory database might have a serious impact on memory consumption and on performance, so I'm looking for better alternatives.
EDIT (2):
Some more clarifications: my application will perform simulations on the data, and in the end the user is free to save this data or not into my database. So the data itself changes the whole time. While performing these simulations, and the data changes, I need to query the data as explained before.
So again, simply querying the database is not an option. I really need (complex?) in-memory data structures.
EDIT: to replace earlier answer.
Can you imagine you have any other possible choice besides running qsort( ) on that giant array of structs? There's just no other way that I can see. Maybe you can sort it just once at time zero and keep it sorted as you do dynamic insertions/deletions of entries.
Using a database (in-memory or not) to work with your data seems like the right way to do this.
If you don't want to do that, you don't have to implement lots of combination classes, just use a collection that can hold any of the objects.

Resources