Merge 5200 hotel and 5200 flights at run time - algorithm

I already have a solution but it seems to me not the best one. I think I’m missing something.
Imagine that you have to merge flights and hotels in packages for a half of a year in advance for every week(26 weeks) for 200 destinations. It means that we have 200*26 = 5200 flights and hotels.
At the end you need to get the best combination price(hotel price + flight price) for every destination. It means that you will have 200 best packages at the end.
I currently have 2 indices in elastic search (hotel and flight) but elastic search is not a requirement.
Best performant solution is generating all of packages in advance in the third indice. It’s actually not the best solution because it takes a lot of server resources and time. Imagine that you have like 10 different filters and every filter has many values. if every filter has 6 values then it will be like 200*26* 6^10 = 314,424,115,200 iterations. We were thinking about cloud, it might be an option. But I think that there should be an algorithm to merge 5200 hotels and flights at run time.
Performance is very important.
Thanks in advance.

Related

How to optimize employee transport service?

Gurus,
I am in the process of writing some code to optimize employee transport for corporate. I need all you expert's advice on how can this be achieved. Here is my scenario.
There are 100 pick up points all over city from where employees need to be brought to company with multiple vehicles. Each vehicle can occupy say 4 or 6 employees. My objective is to write some code which will group the people from nearby areas and bring them to company. Master data will have addresses and its latitude/longitude. I want to build an algorithm to optimize vehicle occupancy as well as distance and time. Could you guys please give some directions how can this be achieved. I understand I may need to use google maps or direction API for this but looking for some logic hint/advice how this can be achieved.
Some more inputs: These vehicles are of company's vehicle with driver. Travel time should not be more than 1.5 Hrs.
Thanks in advance.
Your problem description is a more complicated version of "The travelling salesman problem". You can look it up on and find some different examples and how they are implemented.
One point that need to be clarified : the vehicules to be use will be the employee vehicule that will be carshared or it will be company's vehicule with a driver?
You also need to define some time constrain. For example 50 employees should have under 30 min travel, 40 employe under 1h travel and 10 employe under 1,5H.
You also need to define the travel time for each road depending of the time, because at different time there will be traffic jam or not.
You also need to define group within the employee: usually people (admin clerck or CEO) in a company don't commute at the same time, it can have 1 hour range or more.
In fine, don't forget to include about 10% of the employee that wil be 2 to 5 min late to their meeting point.

Algorithm to find the best possible available times

Here is my scenario,
I run a Massage Place which offers various type of massages. Say 30 min Massage, 45 min massage, 1 hour massage, etc. I have 50 rooms, 100 employees and 30 pieces of equipment.When a customer books a massage appointment, the appointment requires 1 room, 1 employee and 1 piece of equipment to be available.
What is a good algorithm to find available resources for 10 guests for a given day
Resources:
Room – 50
Staff – 100
Equipment – 30
Business Hours : 9AM - 6PM
Staff Hours: 9AM- 6PM
No of guests: 10
Services
5 Guests- (1 hour massages)
3 Guests - (45mins massages)
2 Guests - (1 hour massage).
They are coming around the same time. Assume there are no other appointment on that day
What is the best way to get ::
Top 10 result - Fastest search which meets all conditions gets the top 10 result set. Top ten is defined by earliest available time. 9 – 11AM is best result set. 9 – 5pm is not that good.
Exhaustive search (Find all combinations) - all sets – Every possible combination
First available met (Only return the first match) – stop after one of the conditions have been met
I would appreciate your help.
Thanks
Nick
First, it seems the number of employees, rooms, and equipment are irrelevant. It seems like you only care about which of those is the lowest number. That is your inventory. So in your case, inventory = 30.
Next, it sounds like you can service all 10 people at the same time within the first hour of business. In fact, you can service 30 people at the same time.
So, no algorithm is necessary to figure that out, it's a static solution. If you take #Mario The Spoon's advice and weight the different duration massages with their corresponding profits, then you can start optimizing when you have more than 30 customers at a time.
Looks like you are trying to solve a problem for which there are quite specialized software applications. If your problem is small enough, you could try to do a brute force approach using some looping and backtracking, but as soon as the problem becomes too big, it will take too much time to iterate through all possibilities.
If the problem starts to get big, look for more specialized software. Things to look for are "constraint based optimization" and "constraint programming".
E.g. the ECLIPSe tool is an open-source constraint programming environment. You can find some examples on http://eclipseclp.org/examples/index.html. One nice example you can find there is the SEND+MORE=MONEY problem. In this problem you have the following equation:
S E N D
+ M O R E
-----------
= M O N E Y
Replace every letter by a digit so that the sum is correct.
This also illustrates that although you can solve this brute-force, there are more intelligent ways to solve this (see http://eclipseclp.org/examples/sendmore.pl.txt).
Just an idea to find a solution:
You might want to try to solve it with a constraint satisfaction problem (CSP) algorithm. That's what some people do if they have to solve timetable problems in general (e.g. room reservation at the University).
There are several tricks to improve CSP performance like forward checking, building a DAG and then do a topological sort and so on...
Just let me know, if you need more information about CSP :)

Programmatically determine the relative "popularities" of a list of items (books, songs, movies, etc)

Given a list of (say) songs, what's the best way to determine their relative "popularity"?
My first thought is to use Google Trends. This list of songs:
Subterranean Homesick Blues
Empire State of Mind
California Gurls
produces the following Google Trends report: (to find out what's popular now, I restricted the report to the last 30 days)
http://s3.amazonaws.com/instagal/original/image001.png?1275516612
Empire State of Mind is marginally more popular than California Gurls, and Subterranean Homesick Blues is far less popular than either.
So this works pretty well, but what happens when your list is 100 or 1000 songs long? Google Trends only allows you to compare 5 terms at once, so absent a huge round-robin, what's the right approach?
Another option is to just do a Google Search for each song and see which has the most results, but this doesn't really measure the same thing
Excellent question - one song by Britney Spears, might be phenomenally popular for 2 months then (thankfully) forgotten, while another song by Elvis might have sustained popularity for 30 years. How do you quantitatively distinguish the two? We know we want to think that sustained popularity is more important than a "flash in the pan", but how to get this result?
First, I would normalize around the release date - Subterranean Homesick Blues might be unpopular now (not in my house, though), but normalizing back to 1965 might yield a different result.
Since most songs climb in popularity, level off, then decline, let's choose the area when they level off. One might assume that during that period, that the two series are stationary, uncorrelated, and normally distributed. Now you can just apply a test to determine if the means are different.
There's probably less restrictive tests to determine the magnitude of difference between two time series, but I haven't run across them yet.
Anyone?
You could search for the item on Twitter and see how many times it is mentioned. Or look it up on Amazon to see how many people have reviewed it and what rating they gave it. Both Twitter and Amazon have APIs.
There is an unoffical google trends api. See http://zoastertech.com/projects/googletrends/index.php?page=Getting+Started I have not used it but perhaps it is of some help.
I would certainly treat Google's API of "restricted".
In general, comparison functions used for sorting algorithms are very "binary":
input: 2 elements
output: true/false
Here you have:
input: 5 elements
output: relative weights of each element
Therefore you will only need a linear number of calls to the API (whereas sorting usually requires O(N log N) calls to comparison functions).
You will need exactly ceil( (N-1)/4 ) calls. That you can parallelize, though do read the user guide closely as for the number of requests you are authorized to submit.
Then, once all of them are "rated" you can have a simple sort in local.
Intuitively, in order to gather them properly you would:
Shuffle your list
Pop the 5 first elements
Call the API
Insert them sorted in the result (use insertion sort here)
Pick up the median
Pop the 4 first elements (or less if less are available)
Call the API with the median and those 4 first
Go Back to Insert until your run out of elements
If your list is 1000 songs long, that 250 calls to the API, nothing too scary.

Finding a subset of numbers that equals a single number

The reason I place this post is that I am looking to reconcile customer accounts receivable accounts where "payments" are posted to accounts instead of matched with the open invoices and cleared. So here is my issue:
Have a single number (payment) that should equal a subset of a given set of numbers (invoice amounts). Simple example:
Payment $10,002
Invoices values:
5001
2932
876
98
21
9923
2069
123
432
765
I would want a way to pull out 5001, 2932 and 2069 from this set.
Being a non-programmer, an Excel spreadsheet application is easiest for me to create. Ideas?
You're talking about an NP-Complete problem called Subset-sum.
Basically, this means that in general it is very computationally hard to compute the subset of prices that sums to your grand total. It is, however, very easy to check your answer since you merely sum your answers together.
My guess is, that if you want to examine N prices, you're going to have to use about 2^N cells in Excel to calculate this. The wikiepdia article linked above give some heuristics for approximating this.
Bottom line is, if you need to do this on a large scale (N is, say, in the thousands hundreds) you should rethink why you need to do this.
If you can find out a way to do it very efficiently, there may be a prize involved.
I worked on a very similar Java application that mapped receipts to accounts receivable transactions. We did not try to progammatically link summed receipts to a single transactions or vice-versa for a number of reasons. However, we did allow users to manually do that mapping. We just mapped receipt figures to transactions figures that matched, if there were multiple reciepts and transactions with the same amount, we only matched when there were the same number of duplicate amounts.

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources