I see that for k-means, we have Lloyd's algorithm, Elkan's algorithm, and we also have hierarchical version of k-means.
For all these algorithms, I see that Elkan's algorithm could provide a boost in term of speed. But what I want to know, is the quality from all these k-means algorithms. Each time, we run these algorithms, the result would be different, due to their heuristic and probabilistic nature. Now, my question is, when it comes to clustering algorithm like k-means, if we want to have a better quality result (as in lesser distortion, etc.) between all these k-means algorithms, which algorithm would be able to give you the better quality? Is it possible to measure such thing?
A better solution is usually one that has a better (lower) J(x,c) value, where:
J(x,c) = 1/|x| * Sum(distance(x(i),c(centroid(i)))) for each i in [1,|x|]
Wherre:
x is the list of samples
|x| is the size of x (number of elements)
[1,|x|] all the numbers from 1 to |x| (inclusive)
c is the list of centroids (or means) of clusters (i.e., for k clusters |c| = k)
distance(a,b) (sometimes denoted as ||a-b|| is the distance between "point" a to "point" b (In euclidean 2D space it is sqrt((a.x-b.x)^2 + (a.y-b.y)^2))
centroid(i) - the centroid/mean which is closest to x(i)
Note that this approach does not require switching to supervised technique and can be fully automated!
As I understand it, you need some data with labels to cross-validate you clustering algorithm.
How about the pathological case of the two-moons dataset? unsupervised k-means will fail badly. A high quality method I am aware of employs a more probabilistic approach using mutual information and combinatorial optimization. Basically you cast the clustering problem as the problem of finding the optimal [cluster] subset of the full point-set for the case of two clusters.
You can find the relevant paper here (page 42) and the corresponding Matlab code here to play with (checkout the two-moons case). If you are interested in a C++ high-performance implementation of that with a speed up of >30x then you can find it here HPSFO.
To compare the quality, you should have a labeled dataset and measure the results by some criteria like NMI
We know that Genetic Algorithms (or evolutionary computation) work with an encoding of the points in our solution space Ω rather than these points directly. In the literature, we often find that GAs have the drawback : (1) since many chromosomes are coded into a similar point of Ω or similar chromosomes have very different points, the efficiency is quite low. Do you think that is really a drawback ? because these kind of algorithms uses the mutation operator in each iteration to diversify the candidate solutions. To add more diversivication we simply increase the probability of crossover. And we mustn't forget that our initial population ( of chromosones ) is randomly generated ( another more diversification). The question is, if you think that (1) is a drawback of GAs, can you provide more details ? Thank you.
Mutation and random initialization are not enough to combat the problem that is known as genetic drift which is the major problem of genetic algorithms. Genetic drift means that the GA may quickly lose most of its genetic diversity and the search proceeds in a way that is not beneficial for crossover. This is because the random initial population quickly converges. Mutation is a different thing, if it is high it will diversify, true, but at the same time it will prevent convergence and the solutions will remain at a certain distance to the optimum with higher probability. You will need to adapt the mutation probability (not the crossover probability) during the search. In a similar manner the Evolution Strategy, which is similar to a GA, adapts the mutation strength during the search.
We have developed a variant of the GA that is called OffspringSelection GA (OSGA) which introduces another selection step after crossover. Only those children will be accepted that surpass their parents' fitness (the better, the worse or any linearly interpolated value). This way you can even use random parent selection and put the bias on the quality of the offspring. It has been shown that this slows the genetic drift. The algorithm is implemented in our framework HeuristicLab. It features a GUI so you can download and try it on some problems.
Other techniques that combat genetic drift are niching and crowding which let the diversity flow into the selection and thus introduce another, but likely different bias.
EDIT: I want to add that the situation of having multiple solutions with equal quality might of course pose a problem as it creates neutral areas in the search space. However, I think you didn't really mean that. The primary problem is genetic drift, ie. the loss of (important) genetic information.
As a sidenote, you (the OP) said:
We know that Genetic Algorithms (or evolutionary computation) work with an encoding of the points in our solution space Ω rather than these points directly.
This is not always true. An individual is coded as a genotype, which can have any shape, such as a string (genetic algorithms) or a vector of real (evolution strategies). Each genotype is transformed into a phenotype when assessing the individual, i.e. when its fitness is calculated. In some cases, the phenotype is identical to the genotype: it is called direct coding. Otherwise, the coding is called indirect. (you may find more definitions here (section 2.2.1))
Example of direct encoding:
http://en.wikipedia.org/wiki/Neuroevolution#Direct_and_Indirect_Encoding_of_Networks
Example of indirect encoding:
Suppose you want to optimize the size of a rectangular parallelepiped dened by its length, height and width. To simplify the example, assume that these three quantities are integers between 0 and 15. We can then describe each of them using a 4-bit binary number. An example of a potential solution may be to genotype 0001 0111 01010. The corresponding phenotype is a parallelepiped of length 1, height 7 and width 10.
Now back to the original question on diversity, in addition to what DonAndre said you could read you read chapter 9 "Multi-Modal Problems and Spatial Distribution" of the excellent book Introduction to Evolutionary Computing written by A. E. Eiben and J. E. Smith. as well as a research paper on that matter such as Encouraging Behavioral Diversity in Evolutionary Robotics: an Empirical Study. In a word, diversity is not a drawback of GA, it is "just" an issue.
I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?
Can someone please clarify the some (or all) of the above questions?
I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.
You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].
Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.
LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.
Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.
Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.
A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].
LSH has been applied seemingly everywhere. You may want to give it a try.
[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.
[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.
[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.
[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.
I. The Distance Metric
First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:
the underlying statistical
distribution of your data;
the relationship among the features
that comprise your data (are they
independent--i.e., what does the
covariance matrix look like); and
the coordinate space from which your
data was obtained.
If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.
YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.
It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)
But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.
One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):
X_new = (X_old - mu) / sigma
II. The Data Structure
If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.
This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)
Using a VT for kNN works like this::
From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.
How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).
So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.
III. Calculating the Result (the predicted response variable)
There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.
W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.
The second component is how to weight the contribution of each of the neighbors (assuming n > 1).
The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.
A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:
def weight_gauss(dist, sig=2.0) :
return math.e**(-dist**2/(2*sig**2))
To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.
What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.
Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )
The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.
Top answers are good but old, so I'd like to add up a 2016 answer.
As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.
Hot topics that might be worthy:
Modern approaches of LSH, such as Razenshteyn's.
RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
or in a more recent approach I was part of, kd-GeRaF.
LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.
You can also check my relevant answers:
Two sets of high dimensional points: Find the nearest neighbour in the other set
Comparison of the runtime of Nearest Neighbor queries on different data structures
PCL kd-tree implementation extremely slow
To answer your questions one by one:
No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
KD tree is bad for high dimensional data ... avoid it by all means
Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.
I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.
Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.
The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.
The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.
kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.
But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.
For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)
I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!
A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.
I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.
iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.
I've experienced the same problem and can say the following.
Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.
The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.
Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.
KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.
(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)
Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.253
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.245
You could try a z order curve. It's easy for 3 dimension.
I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy
This is some example code for the Python API, which is optimized in C++.
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.
Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.
Hope this is helpful.
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:
A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 12 years ago.
Improve this question
I read over the wikipedia article, but it seems to be beyond my comprehension. It says it's for optimization, but how is it different than any other method for optimizing things?
An answer that introduces me to linear programming so I can begin diving into some less beginner-accessible material would be most helpful.
The answers so far have given an algebraic definition of linear programming, and an operational definition. But there is also a geometric definition. A polytope is an n-dimensional generalization of a polygon (in two dimensions) or a polyhedron (in three dimensions). A convex polytope is a polytope which is also a convex set. By definition, linear programming is an optimization problem in which you want to maximize or minimize a linear function on a convex polytope.
For example: Suppose that you want to buy some combination of red sand and blue sand. Suppose also:
You can't buy a negative amount of either kind.
The depot only has 300 pounds of red sand and 400 pounds of blue sand.
Also your jeep has a weight limit of 500 pounds.
If you draw a picture in the plane of how much you can buy with these constraints, it's a convex pentagon. Then, whatever you want to optimize (say, the total amount of gold in the sand), you can know that an optimum (not necessarily the only optimum) is at one of the vertices of the polytope. In fact, there is a much stronger result: Even in high dimensions, any such linear programming problem can be solved in polynomial time, in the number of constraints, or putative sides of the polytope. Note that not every constraint corresponds to a side. If the constraint is an equality, it might reduce the dimension of the polytope. Or if the constraint is an inequality, it might be not create a side if it is already implied by all of the other constraints.
There are a lot of practical optimization problems that are linear programming. One of the first examples was the "diet problem": Given a menu of a bunch of kinds of food, what is the cheapest possible balanced diet? It's a linear programming problem because the cost is linear, and because all of the constraints (vitamins, calories, the assumption that you can't buy a negative amount of food, etc.) are linear.
But, linear programming is even more important for a theoretical reason. It is one of the most powerful polynomial-time algorithms for optimization or for any other purpose. As such, it is very important as a substitute for approximately solving other optimization problems, and as a subroutine for exactly solving them.
Yes, two generalizations are convex programming and integer programming. With some qualificiations, convex programming can work just as well as linear programming, provided that the objective (the thing to maximize) is linear. It turns out that convexity, not flat sides, is the main reason that linear programming has a good algorithm.
Integer programming, on the other hand, is usually hard. For instance, suppose in the example problem you have to buy the sand in fixed-size bags rather than in bulk; that is then integer programming. There is a theorem that it can be NP-hard. How hard it is in practice depends on how close it is to linear programming. There are some celebrated examples of integer programming problems in which, miraculously, all of the vertices of the linear program are integer points. Then you can solve the linear program and the solution will happen to be integral. One example of such a problem is the marriage problem, how to marry n men and n women to each other to maximize total happiness. (Or, n cities to n factories, n jobs to n applicants, n computers to n printers, etc.)
Linear programming is a topic of 'mathematical programming', which is also called 'mathematical optimization'. Linear programs differ from general mathematical programs in that for a Linear Program (LP) all constraint functions and the objective function are linear with respect to their variables.
A good place to start would be here if you want the original work by Dantzig, or if you want to get a textbook, I recommend this one. If you want to look up your own resources, start with looking up the Simplex method--it is a very common technique to solve these programs, or the less common but definitely polynomial time Ellipsoid method. Though I haven't read it all, looking it over quickly also suggests this PDF may be a good place to start. Make sure whatever you end up reading covers duality (and perhaps specifically the Farkas' lemma) as it's a central idea in most LP solvers.
The most natural extensions are either Integer programs (similar to LP's, but all variables must take on integer values--that is, no fractional components) or Convex programming (perhaps a more general extension). A good convex optimization text book is available in PDF form here.
As everyone else has said, Linear Programming is a way to solve optimization problems, in which the terms are linear.
It might help to understand what types of problems LP's solve
One example where I've used Linear Programming is building a restaurant schedule. In a restaurant you have skill sets:
Cooks
Servers
Dishwashers
Hosts
Bussers
Manager
etc
And you have employees, each with one or more skill sets. Each employee also has a specific availability. For example, Bob can't work Sunday mornings because he's a pastor at a local church. Employees also have an associated cost. Bob might be $10.50/hr while Suzy is $5.15/hr. Finally Employees might have minimum guaranteed hours. Since Bob has been an employee for 15 years, the boss says he'll always get at least 35 hours.
The restaurant itself has demands. For example it has 3 shifts: Morning, Afternoon, and Night, and each of these shifts has a set of staffing requirements: We need 1 cook, 1 server, 1 manager in the morning, 3 cooks, 2 servers, 2 hosts, 2 managers in the afternoon, and 4 cooks, 4 servers, 3 hosts, 2 managers, 2 bussers in the evening. Each shift will have a duration, and you can figure out the cost of each shift by multiplying the duration by the employee's hourly wage.
Finally we have state and federal laws, and some basic "business" rules: No employee can work more than 8-hours without going into overtime. No employee can be scheduled for less than 2 hours (because it would suck to make a 30 minute commute for a 2 hour shift), employees can't work two overlapping shifts, etc.
Now given all these requirements, give me a schedule that has meets all requirements, and produces the lowest labor cost.
This is an example of a linear programming optimization problem.
A linear program typically consists of:
An objective function, variables, variable bounds, and constraints.
Since we want to minimize cost, our objective function is going to involve shifts that employees work, and the associated costs (shift duration * wage).
The variables in this case, are the shifts that each employee can work.
The bounds on these variables integers between 0 and 1, because either an employee is working the shift (1), or the employee is not working the shift (0). This, by the way, is a special program, called a Binary Integer Program or BIP for short, because all the variables are integers (no fractional values) and all values are either 0 or 1.
The constraints are equality/inequality constraints based on the requirements above.
For example, if Bob and Suzy can both work as cooks in the morning, then Bob_Morning_Cook1_Shift + Suzy_Morning_Cook1_Shift = 1, with Bob_Morning_Cook_Shift = {0,1} and Suzy_Morning_Cook_Shift = {0,1} due to the bounds mentioned above. These three pieces of information specify that at most only a single employee can be assigned as the first morning cook.
So once you've defined all the constraints that model your problem, you can begin to solve the problem. If a solution can be found (and depending on the constraints the problem may be infeasible), it will give you the employee assignments that produce the lowest weekly labor cost.
Linear programming is an optimization technique that involves linear constraints and a linear objective function. The constraints are written to bound the problem space, while the objective function is something that you are attempting to minimize (or possibly maximize) that satisfies the constraints. The simplex algorithm is typically used to walk along edges of the constraint intersection to find the minimum (or maximum) value of the objective function satisfying the constraints.
When setting up an LP problem, it's important to make sure that the constraints properly bound the objective function. It's possible to define constraints which result in no possible solution (e.g. x > 1 and -x > 1). This is over constrained. It's also possible to under-constrain a problem (e.g. find min x such that x < 1).
One big difference (or at least, distinguishing feature) of linear programming is that the constraints are modeled as linear equations, i.e. they're all of the form c_1 x_1 + c_2 x_2.... The standard form section of the wikipedia article gives a pretty good overview of this.
Another difference/feature is that linear programming is seeking to maximize (or minimize) ONE function - you can't effectively do multi-objective optimization.