Time Calendar Data Structure - data-structures

We are looking at updating (rewriting) our system which stores information about when people can reserve rooms etc. during the day. Right now we store the start and time and the date the room is available in one table, and in another we store the individual appointment times.
On the surface it seemed like a logical idea to store the information this way, but as time progressed and the system came under heavy load, we began to realize that this data structure appears to be inefficient. (It becomes an intensive operation to search all rooms for available times and calculate when the rooms are available. If the room is available for a given time, is the time that it is available long enough to accommodate the requested time).
We have gone around in circles about how to make the system more efficient, and we feel there has to be a better way to approach this. Does anyone have suggestions about how to go about this, or have any places where to look about how to build something like this?

I found this book to be inspiring and a must-read for any kind of database involving time management/constraints:
Developing Time-Oriented Database Applications in SQL
(Added by editor: the book is available online, via the Richard Snodgrass's home page. It is a good book.)

#Radu094 has pointed you to a good source of information - but it will be tough going processing that.
At a horribly pragmatic level, have you considered recording appointments and available information in a single table, rather than in two tables? For each day, slice the time up into 'never available' (before the office opens, after the office closes - if such a thing happens), 'available - can be allocated', and 'not available'. These (two or) three classes of bookings would be recorded in contiguous intervals (with start and end time for each interval in a single record).
For each room and each date, it is necessary to create a set of 'not in use' bookings (depending on whether you go with 'never available', the set might be one 'available' record or it might include the early shift and late shift 'never available' records too).
Then you have to work out what questions you are asking. For example:
Can I book Room X on Day Y between T1 and T2?
Is there any room available on Day Y between T1 and T2?
At what times on Day Y is Room X still available?
At what times on Day Y is a room with audio-visual capabilities and capacity for 12 people available?
Who has Room X booked during the morning of Day Y?
This is only a small subset of the possibilities. But with some care and attention to detail, the queries become manageable. Validating the constraints in the DBMS will be harder. That is, ensuring that if the time [T1..T2) is booked, then no-one else books [T1+00:01..T2-00:01) or any other overlapping period. See Allen's Interval Algebra at Wikipedia and other places (including this one at uci.edu).

Related

Implementing Trending in Elasticsearch

I'm building a project that indexes celebrity-related content across sites (tmz, people, etc) because I always thought that it would be funny to "bet" on people (and maybe shows, directors, etc) like horse racing or the stock market -- only, you know, not with real money -- where the value of the person changes day to day and hour to hour and even minute to minute if we can figure this out together, stack overflow denizens.
I assign traffic values to users based on mentions in social media. I have some scrapers (probably violating some TOSes) and access to Twitter's API to get relative counts for search results for a time, so I have known "numbers" to associate w/ users outside of elasticsearch for periods of time to build the trends. Now to be clear, I am not looking to implement trending based on the number of documents in the system, that actually stays pretty consistent, but I need to rank documents that already exist based on trends.
So that's what I've got: a few hundred thousand articles with pre-determined associations to individual celebrities. Data for on-the-minute associations of a score to those celebrities which are then merged and applied to each article so that each article has a few scores associated (there's some complexity here that does not matter, but the bottom line is that I have 10 or so values that I want to assign to content to sort it when you're on the market page and I want to sort those w/ a function or script score).
So the question: How the heck do I assign these values without making elasticsearch go crazy with re-indexing? I need to use these values to sort dozens of requests per second coming from feeds on the site, but I am running this on a raspberry pi... literally, I've maxed the poor thing out for memory.
We're real write heavy, but if for some reason celebrity stock markets takes off, we're also real read heavy at the same time. I swear I remember a plugin that had metadata associated with content, but I cannot find it.
I've tried enable=false and index=false, but they seem to still thrash the read times while writing the updates. The best I've gotten to is slowing down the refresh_interval, but that's still pretty expensive and starts to affect the "real-time" nature of the app.
I believe that this is impossible as you've laid it out. Any updates to a field will update _source and fire the full update process.
There are some alternatives that you might consider:
Replication, if another cluster is available
A separate write index on the same cluster, space allowing

Can I use a spreadsheet to calculate profit/loss and current position for trading of shares?

I have data about some trades of a share (currently only 20 transactions as a learning subset, with the full set covering around 50,000 transactions for different shares).
I'm trying to work out how to determine the profit for each trade, as well as the current position. Not all the held shares are sold in 1 trade, so there is sometimes a carry over (whole and fractional values). Sometimes several buys and sells occur, rather than buy/sell/buy/sell.
I'm NOT the one trading! (Just to make that clear).
I suppose, what I'm trying to get to is "what am I actually asking about?".
I suspect there is a perfectly standardised way of doing this, but I don't know what that is.

Algorithm for Scheduling One Appointment in Already Full Schedule

I'm building a calendar scheduling application for, let's say a plumbing company. The company has one or more plumbers, who each have a schedule of appointments at different times throughout the day. So Josh's schedule on May 30th might include a 30-minute appointment at 10 AM, a 45-minute appointment at 1 PM, and an hour-long appointment at 3 PM, while Maria has a completely different schedule that day. Now say a customer wants to book an appointment with this company, and my program has already calculated the time this new appointment will take. I want my program to return a list of possible appointment times for any plumber(s). Is there a standard algorithm for this type of problem?
I'd prefer language-agnostic, general steps just to be more helpful to anyone who might be in a similar situation with a different language, though I'm using PHP and PostgreSQL if there's a specific language feature suited to this.
Here's what I've tried so far:
Get all available shifts for every plumber on the requested day
Get all appointments already made on that day
Do a sort of boolean subtraction to cut the appointments out of the shifts, leaving gaps in each plumber's schedule
Get rid of all schedule gaps that are smaller than the requested appointment length (I also calculate drive times here so I know how far appointments need to be from one another)
Return those gaps to the customer, trimmed to the appointment length, as appointment possibilities
I've learned that the problem with this approach is that it doesn't understand what to do with a gap much larger than the requested appointment. If you have a gap from 10 AM to 6 PM but you want an hour-long appointment, it will only suggest 10 AM to 11 AM. This approach doesn't allow for time-of-day choices, either. In that same scenario, what if the customer wants a morning appointment? Then it should only suggest 10-11 and 11-12. Or if they want an evening appointment, it should only suggest 5-6 PM. This approach also doesn't consider two plumbers working together. If we assume that two workers = half the time, then maybe the algorithm should look for the same 30 minutes available in both Josh and Maria's schedules along with 60-minute gaps in either plumber's schedule. Lastly, this algorithm feels very inefficient.
By the way, I've looked at several other questions here and around the Internet about how to solve similar situations, but I'm finding that most (if not all) of those questions involve optimizing a schedule. That might be valuable for other parts of this program, but for now, let's assume that the existing appointments are fixed and unchangeable. We're just looking to fit a new appointment into an existing schedule. I know this is possible because applications like Calendly have similar inputs and outputs.
In short, is there a better way of meeting these goals:
Suggest available gaps in one plumber's schedule given a time interval
If possible, only return appointment possibilities in the given time of day (morning = 4-12, afternoon = 12-5, evening = 5-10, night = 10-4, or any), and if not possible, continue with the algorithm as if no time of day had been specified
Suggest smaller gaps where n plumbers might do the job in 1/n time (there aren't that many plumbers, so setting a limit on this isn't necessary). This isn't as important as the other criteria, so if this isn't possible or would make the algorithm far more complex, then don't worry about it.
Split big appointment gaps into smaller gaps so we can suggest 4 hour-long gaps in between 10 AM and 2 PM. Obviously we can't suggest all possible hour-long segments of that gap because they'd be infinite
Thank you in advance.
There is no need for any sophisticated algorithm. There is only a small number of possible appointment times throughout a day, let's say every 30 minutes or so. Iterate over all possible times: 06:00, 06:30, 07:00, ... 20:00. Check each time if it matches the requirements, that check can either return a yes/no result, or a number that say how good a match that time is. You end up with a list of possible appointment times, pick the best one or all of them.

Recommender: Log user actions & datamine it – good solution [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am planning to log all user actions like viewed page, tag etc.
What would be a good lean solution to data-mine this data to get recommendations?
Say like:
Figure all the interests from the viewed URL (assuming I know the
associated tags)
Find out people who have similar interests. E.g. John & Jane
viewed URLS related to cars etc
Edit:
It’s really my lack of knowledge in this domain that’s a limiting factor to get started.
Let me rephrase.
Lets say a site like stackoverflow or Quora. All my browsing history going through different questions are recorded and Quora does a data mining job of looking through it and populating my stream with related questions. I go through questions relating to parenting and the next time I login I see streams of questions about parenting. Ditto with Amazon shopping. I browse watches & mixers and two days later they send me a mail of related shopping items that I am interested.
My question is, how do they efficiently store these data and then data mine it to show the next relevant set of data.
Datamining is a method that needs really enormous amounts of space for storage and also enormous amounts of computing power.
I give you an example:
Imagine, you are the boss of a big chain of supermarkets like Wal-Mart, and you want to find out how to place your products in your market so that consumers spend lots of money when they enter your shops.
First of all, you need an idea. Your idea is to find products of different product-groups that are often bought together. If you have such a pair of products, you should place those products as far away as possible. If a customer wants to buy both, he/she has to walk through your whole shop and on this way you place other products that might fit well to one of that pair, but are not sold as often. Some of the customers will see this product and buy it, and the revenue of this additional product is the revenue of your datamining-process.
So you need lots of data. You have to store all data that you get from all buyings of all your customers in all your shops. When a person buys a bottle of milk, a sausage and some bread, then you need to store what goods have been sold, in what amount, and the price. Every buying needs its own ID if you want to get noticed that the milk and the sausage have been bought together.
So you have a huge amount of data of buyings. And you have a lot of different products. Let’s say, you are selling 10.000 different products in your shops. Every product can be paired with every other. This makes 10,000 * 10,000 / 2 = 50,000,000 (50 Million) pairs. And for each of this possible pairs you have to find out, if it is contained in a buying. But maybe you think that you have different customers at a Saturday afternoon than at a Wednesday late morning. So you have to store the time of buying too. Maybee you define 20 time slices along a week. This makes 50M * 20 = 1 billion records. And because people in Memphis might buy different things than people in Beverly Hills, you need the place too in your data. Lets say, you define 50 regions, so you get 50 billion records in your database.
And then you process all your data. If a customer did buy 20 products in one buying, you have 20 * 19 / 2 = 190 pairs. For each of this pair you increase the counter for the time and the place of this buying in your database. But by what should you increase the counter? Just by 1? Or by the amount of the bought products? But you have a pair of two products. Should you take the sum of both? Or the maximum? Better you use more than one counter to be able to count it in all ways you can think of.
And you have to do something else: Customers buy much more milk and bread then champagne and caviar. So if they choose arbitrary products, of course the pair milk-bread has a higher count than the pair champagne-caviar. So when you analyze your data, you must take care of some of those effects too.
Then, when you have done this all you do your datamining-query. You select the pair with the highest ratio of factual count against estimated count. You select it from a database-table with many billion records. This might need some hours to process. So think carefully if your query is really what you want to know before you submit your query!
You might find out that in rural environment people on a Saturday afternoon buy much more beer together with diapers than you did expect. So you just have to place beer at one end of the shop and diapers on the other end, and this makes lots of people walking through your whole shop where they see (and hopefully buy) many other things they wouldn't have seen (and bought) if beer and diapers was placed close together.
And remember: the costs of your datamining-process are covered only by the additional bargains of your customers!
conclusion:
You must store pairs, triples of even bigger tuples of items which will need a lot of space. Because you don't know what you will find at the end, you have to store every possible combination!
You must count those tuples
You must compare counted values with estimated values
Store each transaction as a vector of tags (i.e. visited pages containing these tags). Then do association analysis (i can recommend Weka) on this data to find associations using the "Associate" algorithms available. Effectiveness depends on a lot of different things of course.
One thing that a guy at my uni told me was that often you can simply create a vector of all the products that one person has bought and compare this with other peoples vectors and get decent recommendations. That is represent users as the products they buy or the pages they visit and do e.g. Jaccard similarity calculations. If the "people" are similar then look at products they bought that this person didn't. (Probably those that are the most common in the population of similar people)
Storage is a whole different ballgame, there are many good indices for vector data such as KD trees implemented in different RDBMs.
Take a course in datamining :) or just read one of the excellent textbooks available (I have read Introduction to data mining by Pang-Ning tan et al and its good.)
And regarding storing all the pairs of products etc, of course this is not done and more efficient algorithms based on support and confidence are used to prune the search space.
I should say recommendation is machine learning issue.
how to store the datas depends on which algorithm you chose.

How to manage transactions, debt, interest and penalty?

I am making a BI system for a bank-like institution. This system should manage credit contracts, invoices, payments, penalties and interest.
Now, I need to make a method that builds an invoice. I have to calculate how much the customer has to pay right now. He has a debt, which he has to pay for. He also has to pay for the interest. If he was ever late with due payment, penalties are applied for each day he's late.
I thought there were 2 ways of doing this:
By having only 1 original state - the contract's original state. And each time to compute the monthly payment which the customer has to make, consider the actual, made payments.
By constantly making intermediary states, going from the last intermediary state, and considering only the events that took place between the time of these 2 intermediary states. This means having a job that performs periodically (daily, monthly), that takes the last saved state, apply the changes (due payments, actual payments, changes in global constans like the penalty rate which is controlled by the Central Bank), and save the resulting state.
The benefits of the first variant:
Always actual. If changes were made with a date from the past (a guy came with a paid invoice 5 days after he made the payment to the bank), they will be correctly reflected in the results.
The flaws of the first variant:
Takes long to compute
Documents printed with the current results may differ if the correct data changes due to operations entered with a back date.
The benefits of the second variant:
Works fast, and aggregated data is always available for search and reports.
Simpler to compute
The flaws of the second variant:
Vulnerable to failed jobs.
Errors in the past propagate until the end, to the final results.
An intermediary result cannot be changed if new data from past transactions arrives (it can, but it's hard, and with many implications, so I'd rather mark it as Tabu)
Jobs cannot be performed successfully and without problems if an unfinished transaction exists (an issued invoice that wasn't yet paid)
Is there any other way? Can I combine the benefits from these two? Which one is used in other similar systems you've encountered? Please share any experience.
Problems of this nature are always more complicated than they first appear. This
is a consequence of what I like to call the Rumsfeldian problem of the unknown unknown.
Basically, whatever you do now, be prepared to make adjustments for arbitrary future rules.
This is a tough proposition. some future possibilities that may have a significant impact on
your calculation model are back dated payments, adjustments and charges.
Forgiven interest periods may also become an issue (particularly if back dated). Requirements
to provide various point-in-time (PIT) calculations based on either what was "known" at
that PIT (past view of the past) or taking into account transactions occurring after the reference PIT that
were back dated to a PIT before the reference (current view of the past). Calculations of this nature can be
a real pain in the head.
My advice would be to calculate from "scratch" (ie. first variant). Implement optimizations (eg. second variant) only
when necessary to meet performance constraints. Doing calculations from the beginning is a compute intensive
model but is generally more flexible with respect to accommodating unexpected left turns.
If performance is a problem but the frequency of complicating factors (eg. back dated transactions)
is relatively low you could explore a hybrid model employing the best of both variants. Here you store the
current state and calculate forward
using only those transactions that posted since the last stored state to create a new current state. If you hit a
"complication" re-do the entire account from the
beginning to reestablish the current state.
Being able to accommodate the unexpected without triggering a re-write is probably more important in the long run
than shaving calculation time right now. Do not place restrictions on your computation model until you have to. Saving
current state often brings with it a number of built in assumptions and restrictions that reduce wiggle room for
accommodating future requirements.

Resources