Accurately averaging data over time - reporting

I'm looking for a way to quickly obtain an accurate average inventory for reporting purposes.
BACKGROUND:
I'm looking for a way to determine Gross Margin Return On Investment (GMROI) for inventory items where the inventory levels are not Constantin with time (ie some items maybe out of stock then over stocked, whilst others will be constant and never out of stock)
GMROI = GrossProfit/AverageInvenotry
say over 1 year
These need to be obtained on the fly, batch processing is not an option.
THE PROBLEM:
Given the relational database used only has current stock levels.
I can calculate back to a historic stock say:
HistoricStock=CurrentStock-Purchase+Sales
But I really want an average invertory not just a single point in time.
I could calculate back a series of points then average them but I'm worried about the calculation overhead (to a lesser extent the accuracy), given I want to do this on the fly.
I could create a data warehouse and bank the data but I'm concerned about blowing out the database size (ie StockHolding Per Barcode Per Location Per Day for say 2 years)
From memory the integral of the inventory/time graph divided by the time interval would give the average inventory but how do you integrate real world data without a formula Or lots of small time strips?
Any Ideas or References would be appreciate
Thanks B.

In general this seems like a good case for developing an inventory fact table, but exactly how you would implement it depends a lot on your data and source systems.
If you haven't already, I would get the Data Warehouse Toolkit; chapter 3 is about inventory data management. As you mentioned, you can create an inventory fact table and load a daily snapshot of inventory levels from the source system, then you can easily calculate whatever averages you need from the data warehouse, not from the source system.
You mentioned that you're concerned about the volume of data, although you didn't say how many rows per day you would add. But data warehouses can be designed to handle very large tables using table partitioning or similar techniques, and you could also calculate "running averages" after adding each day's data if the calculation takes a very long time for any reason.

Related

Oracle partitioning recommendations

Due to being locked down by Corona, I don't have easy access to my more knowledgeable colleagues, so I'm hoping for a few possible recommendations here.
We do quarterly and yearly "freezes" of a number of statistical entities with a large number (1-200) of columns. Everyone then uses these "frozen" versions as a common basis for all statistical releases in Denmark. Currently, we simply create a new table for each version.
There's a demand to test if we can consolidate these several hundred tables to 26 entity-based tables to make programming against them easier, while not harming performance too much.
A "freeze" is approximately 1 million rows and consists of: Year + Period + Type + Version.
For example:
2018_21_P_V1 = Preliminary Data for 2018 first quarter version 1
2019_41_F_V2 = Final Data for 2019 yearly version 2
I am simply not very experienced in the world of partitions. My initial thought was to partition on Year + Period and Subpartiton on Type + Version, but I am no longer sure this is the right approach, nor do I have a clear picture of which partitioning type would solve the problem best.
I am hoping someone can recommend an approach as it would help me tremendously and save me a lot of time "brute force" testing a lot of different combinations.
Based on your current situation which you explained I highly recommend that "USE THE PARTITIONING". No doubt.
It's highly effective and easy to use. You can read Oracle documentation about partitioning or search on the web for that to understand how to start.
In general, when you partition a table, Oracle looks at each partition as a separate table so don't worry about the speed of fetching data.
The most important step is to choose the best field(s) to establish your partitions based on. I used the date format (20190506) in number or int data type for my daily basis. Or (201907) for a monthly basis. You should design and test it.
The next is to decide about the sub-partitions. In some cases, you don't really need one. It depends on your data structure and your expectations from the data. What do you want to do with the data? Which fields are more important? (used in where clause, ...)
Then make some index(es) for each partition. Very important.
Another important point is that using partitions may have some changes in the way you code in pl/sql. For example, you can not use 2 or more partitions in a single query at the same time. You should select and fetch data from different partitions one by one.
And don't worry about 1 million records. I used partitioning for tables way larger than this and it works fine.
Goodluck

Business Intelligence Datasource Performance - Large Table

I use Tableau and have a table with 140 fields. Due to the size/width of the table, the performance is poor. I would like to remove fields to increase reading speed, but my user base is so large, that at least one person uses each of the fields, while 90% use the same ~20 fields.
What is the best solution to this issue? (Tableau is our BI tool, BigQuery is our database)
What I have done thus far:
In Tableau, it isn't clear how to user dynamic data sources that change based on the field selected. Ideally, I would like to have smaller views OR denormalized tables. As the users makes their selections in Tableau, the underlying data sources updates to the table or view with that field.
I have tried a simple version of a large view, but that performed worse than my large table, and read significantly more data (remember, I am BigQuery, so I care very much about bytes read due to costs)
Suggestion 1: Extract your data.
Especially when it comes to datasources which are pay per query byte, (Big Query, Athena, Etc) extracts make a great deal of sense. Depending how 'fresh' the data must be for the users. (Of course all users will say 'live is the only way to go', but dig into this a little and see what it might actually be.) Refreshes can be scheduled for as little as 15 minutes. The real power of refreshes comes in the form of 'incremental refreshes' whereby only new records are added (along an index of int or date.) This is a great way to reduce costs - if your BigQuery database is partitioned - (which it should be.) Since Tableau Extracts are contained within .hyper files, a structure of Tableau's own design/control, they are extremely fast and optimized perfectly for use in Tableau.
Suggestion 2: Create 3 Data Sources (or more.) Certify these datasources after validating that they provide correct information. Provide users with with clear descriptions.
Original Large Dataset.
Subset of ~20 fields for the 90%.
Remainder of fields for the 10%
Extract of 1
Extract of 2
Extract of 3
Importantly, if field names match in each datasource (ie: not changed manually ever) then it should be easy for a user to 'scale up' to larger datasets as needed. This means that they could generally always start out with a small subset of data to begin their exploration, and then use the 'replace datasource' feature to switch to a different datasource while keeping their same views. (This wouldn't work as well if at all for scaling down, though.)

Practical importance of efficient sorting algorithms

I've been looking around trying to learn of any practical applications where sorting is needed and its efficiency matters, but couldn't find anything.
The only examples I could find or think off either did not need total sorting (like when looking for 100 top results or for the median) or sorting efficiency was hardly important (like when sorting once a year a spreadsheet with student names or past transactions).
When sorting web search results, only a few dozens of top ranked results need to be found and sorted, not all of the Internet, so classical sorting algorithms are not needed or practical.
When sorting a spreadsheet, it hardly matters if it will be sorted by a triple-pivot Las Vegas randomised quicksort or by the insertion sort.
Using sorted arrays as sets or associative arrays seems to be practically less efficient than using hash tables.
So my question is: what are practical ("real-life") examples where a total sorting is required and its efficiency is a bottleneck? I am particularly curious about applications for comparison sorting.
Update.
I've stumbled upon this phrase in lecture notes by Steven Skiena:
Computers spend more time sorting than anything else, historically 25% on mainframes.
With some details, that could make a perfect answer to my question. Where can I find the source for this statistics, ideally with some details about the kind and the application of sorting done by mainframes?
In some graphics rendering algorithms, objects need to be drawn in back to front order. A good example is transparent particles: there can be hundreds of thousands of them, and because of the transparency, traditional depth buffering doesn't work. So you need to sort these particles by distance from the camera, and keep them sorted, at 60 frames per second.
Interestingly, if the order of the particles doesn't change much (relatively slow particle motion, little camera movement), then the array of particles will already be "mostly sorted" in the next frame, and a simple bubble sort or insertion sort can actually work fine. But on frames where many particles are created, or the camera moves quickly, sort performance can become important, simply because there are so many other things to do each frame.
Imagine you have a daily list of transactions (deposits and withdrawls) for bank accounts. There are millions of accounts, and millions of transactions per day. Each night, you have to update the accounts to reflect those transactions, and compute the interest accrued that day, and print a report, ordered by account, that shows each account with its daily activity.
One way to do that is to go through the list sequentially, reading a transaction and updating the account in the database. That will work, but it has several drawbacks, including:
If there are many transactions for a single account, you pay the price of retrieving and updating the account for every transaction. Considering that a business account can have thousands of transactions per day, those costs add up.
The typical rule is that deposits are recorded before withdrawals, so as to prevent an overdraft. If an account's balance is 0, and the transactions list has a withdrawal of $5 ahead of a $10 deposit, the system will record an overdraft when it shouldn't.
Printing the report would require a separate scan of the database, after all transactions are recorded.
The solution to those problems is to sort the transactions list by account and type (deposits first). Then, the update is a simple merge operation. You read the database and the transactions list in account number order, apply any transactions for that account, compute interest, print the output line, and write the updated record to the database.
The result is much faster than doing a read-update-write for every single transaction, and it eliminates the problems #2 and #3 I outlined above. Sort-and-merge makes the difference between the update taking all night, and the update taking a few hours.
Also, MapReduce (and Hadoop), used for processing big data, make good use of sorting. Those programming models simply would not be possible without high performance sorting algorithms.
Any time you need to merge multiple large data streams into a single output stream (and those applications are legion), the sort-and-merge approach is useful. There are times when other techniques might be faster, but the sort-and-merge is reliable and durable, and, as shown by MapReduce, scales well.

Creating the N most accurate sparklines for M sets of data

I recently constructed a simple name popularity tool (http://names.yafla.com) that allows users to select names and investigate their popularity over time and by state. This is just a fun project and has no commercial or professional value, but solved a curiosity itch.
One improvement I would like to add is the display of simple sparklines beside each name in the select list, showing the normalized national popularity trends since 1910.
Doing an image request for every single name -- where hypothetically I've preconstructed the spark lines for every possible variant -- would slow the interface too much and yield a lot of unnecessary traffic as users quickly scroll and filter past hundreds of names they aren't interested in. Building sprites with sparklines for sets of names is a possibility, but again with tens of thousands of names, in the end the user's cache would be burdened with a lot of unnecessary information.
My goal is absolutely tuned minimalism.
Which got me contemplating the interesting challenge of taking M sets of data (occurrences over time) and distilling that to the most proximal N representative sparklines. For this purpose they don't have to be exact, but should be a general match, and where I could tune N to yield a certain accuracy number.
Essentially a form of sparkline lossy compression.
I feel like this most certainly is a solved problem, but can't find or resolve the heuristics that would yield the algorithms that would shorten the path.
What you describe seems to be cluster analysis - e.g. shoving that into Wikipedia will give you a starting point. Particular methods for cluster analysis include k-means and single linkage. A related topic is Latent Class Analysis.
If you do this, another option is to look at the clusters that come out, give them descriptive names, and then display the cluster names rather than inaccurate sparklines - or I guess you could draw not just a single line in the sparkline, but two or more lines showing the range of popularities seen within that cluster.

Efficient searching in huge multi-dimensional matrix

I am looking for a way to search in an efficient way for data in a huge multi-dimensional matrix.
My application contains data that is characterized by multiple dimensions. Imagine keeping data about all sales in a company (my application is totally different, but this is just to demonstrate the problem). Every sale is characterized by:
the product that is being sold
the customer that bought the product
the day on which it has been sold
the employee that sold the product
the payment method
the quantity sold
I have millions of sales, done on thousands of products, by hundreds of employees, on lots of days.
I need a fast way to calculate e.g.:
the total quantity sold by an employee on a certain day
the total quantity bought by a customer
the total quantity of a product paid by credit card
...
I need to store the data in the most detailed way, and I could use a map where the key is the sum of all dimensions, like this:
class Combination
{
Product *product;
Customer *customer;
Day *day;
Employee *employee;
Payment *payment;
};
std::map<Combination,quantity> data;
But since I don't know beforehand which queries are performed, I need multiple combination classes (where the data members are in different order) or maps with different comparison functions (using a different sequence to sort on).
Possibly, the problem could be simplified by giving each product, customer, ... a number instead of a pointer to it, but even then I end up with lots of memory.
Are there any data structures that could help in handling this kind of efficient searches?
EDIT:
Just to clarify some things: On disk my data is stored in a database, so I'm not looking for ways to change this.
The problem is that to perform my complex mathematical calculations, I have all this data in memory, and I need an efficient way to search this data in memory.
Could an in-memory database help? Maybe, but I fear that an in-memory database might have a serious impact on memory consumption and on performance, so I'm looking for better alternatives.
EDIT (2):
Some more clarifications: my application will perform simulations on the data, and in the end the user is free to save this data or not into my database. So the data itself changes the whole time. While performing these simulations, and the data changes, I need to query the data as explained before.
So again, simply querying the database is not an option. I really need (complex?) in-memory data structures.
EDIT: to replace earlier answer.
Can you imagine you have any other possible choice besides running qsort( ) on that giant array of structs? There's just no other way that I can see. Maybe you can sort it just once at time zero and keep it sorted as you do dynamic insertions/deletions of entries.
Using a database (in-memory or not) to work with your data seems like the right way to do this.
If you don't want to do that, you don't have to implement lots of combination classes, just use a collection that can hold any of the objects.

Resources