Simple where condition does not show expected output in Hive - hadoop

Trying to master Hive, Im uploading census data ('income data of people from different countries working in USA') into a S3 bucket.
Able to run other queries, but unable to run following simple query.
Im trying to list out people from different countries with income level >50k USD.
I have created table in hive and importing the data from AWS S3 bucket, income column here is defined as string and the possible values for this column are '<=50K' and '>50K'
Following query results in empty resultset. What could be the problem here? This SQL statement runs fine on a normal MySQL console. Why its not showing expected resultset in HIVE?
hive> select country, income from census_income_data where income = '>50K';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312281227_0011, Tracking URL = http://ip-172-31-44-80.us-west-2.compute.internal:9100/jobdetails.jsp?jobid=job_201312281227_0011
Kill Command = /home/hadoop/bin/hadoop job -kill job_201312281227_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-28 13:21:05,086 Stage-1 map = 0%, reduce = 0%
2013-12-28 13:21:26,279 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:27,289 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:28,299 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:29,310 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:30,321 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:31,334 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:32,369 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.74 sec
MapReduce Total cumulative CPU time: 7 seconds 740 msec
Ended Job = job_201312281227_0011
Counters:
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 7.74 sec HDFS Read: 219 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 740 msec
OK
Time taken: 56.559 seconds
Following is the sample data from the dataset used in the above code
30, State-gov, 141297, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K
23, Private, 122272, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K
32, Private, 205019, Assoc-acdm, 12, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K
40, Private, 121772, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K
34, Private, 245487, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 45, Mexico, <=50K
25, Self-emp-not-inc, 176756, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 35, United-States, <=50K
32, Private, 186824, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, United-States, <=50K
38, Private, 28887, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K
43, Self-emp-not-inc, 292175, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, >50K
40, Private, 193524, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K
54, Private, 302146, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K
35, Federal-gov, 76845, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, Black, Male, 0, 0, 40, United-States, <=50K
43, Private, 117037, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2042, 40, United-States, <=50K
59, Private, 109015, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K
56, Local-gov, 216851, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K
19, Private, 168294, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
39, Private, 367260, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K
49, Private, 193366, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K
23, Local-gov, 190709, Assoc-acdm, 12, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K

First run select * from table limit 20 on your table to verify that the expected values do exists in the expected column.
Now there might be other characters like spaces that can cause the query to return 0 results.
Try the following :
select country, income from census_income_data where income like '%50%';
If it doesn't work then you probably have misplaced the data in creating the table.
If it works try :
select country, income from census_income_data where income like '%>50K%';
If it works, you probably have other characters in the field, try running :
select concat('INCOME:',income,'.') from census_income_data where income like '%>50K%';
and see if you get this string INCOME:>50K. exactly.

Your SQL code
select country, income from census_income_data where income = '>50K';
uses the '=' operator to compare two strings. As far as I know that operator takes the character set, surrounding whitespaces etc. into account. Maybe you will have more luck with the "LIKE" operator.
select country, income from census_income_data where income LIKE ">50K";

Related

Similarity between two ordered lists of numbers with fuzziness

I have ordered lists of numbers (like barcode positions, spectral lines) that I am trying to compare for similarity. Ideally, I would like to compare two lists to get a value from 1.0 (match) degrading gracefully to 0.
The lists could be offset by an arbitrary amount, and that should not degrade the match. The diffs between adjacent items are the most applicable characterization.
Due to noise in the system, some items may be missing (alternatively, extra items may be inserted, depending on point of view).
The diff values may be reordered.
The diff values may be scaled.
Multiple transformations above may be applied and each should reduce similarity proportionally.
Here is some test data:
# deltas
d = [100+(i*10) for i in xrange(10)] # [100, 110, 120, 130, 140, 150, 160, 170, 180, 190]
d_swap = d[:4] + [d[5]] + [d[4]] + d[6:] # [100, 110, 120, 130, 150, 140, 160, 170, 180, 190]
# absolutes
a = [1000+j for j in [0]+[sum(d[:i+1]) for i in xrange(len(d))]] # [1000, 1100, 1210, 1330, 1460, 1600, 1750, 1910, 2080, 2260, 2450]
a_offs = [i+3000 for i in a] # [4000, 4100, 4210, 4330, 4460, 4600, 4750, 4910, 5080, 5260, 5450]
a_rm = a[:2] + a[3:] # [1000, 1100, 1330, 1460, 1600, 1750, 1910, 2080, 2260, 2450]
a_add = a[:7] + [(a[6]+a[7])/2] + a[7:] # [1000, 1100, 1210, 1330, 1460, 1600, 1750, 1830, 1910, 2080, 2260, 2450]
a_swap = [1000+j for j in [0]+[sum(d_swap[:i+1]) for i in xrange(len(d_swap))]] # [1000, 1100, 1210, 1330, 1460, 1610, 1750, 1910, 2080, 2260, 2450]
a_stretch = [1000+j for j in [0]+[int(sum(d[:i+1])*1.1) for i in xrange(len(d))]] # [1000, 1110, 1231, 1363, 1506, 1660, 1825, 2001, 2188, 2386, 2595]
a_squeeze = [1000+j for j in [0]+[int(sum(d[:i+1])*0.9) for i in xrange(len(d))]] # [1000, 1090, 1189, 1297, 1414, 1540, 1675, 1819, 1972, 2134, 2305]
Sim(a, a_offs) should be 1.0 since offset is not considered a penalty.
Sim(a, a_rm) and Sim(a, a_add) should be about 0.91 because 10 of 11 or 11 of 12 match.
Sim(a, a_swap) should be about 0.96 because one diff is out of place (possibly with a further penalty based on distance if moved more than one position).
Sim(a, a_stretch) and Sim(a, a_squeeze) should be about 0.9 because diffs were scaled by about 1 part in 10.
I am thinking of something like difflib.SequenceMatcher but that works for numeric values with fuzziness instead of hard-compared hashables. It would also need to retain some awareness of the diff (first derivative) relationship.
This seems to be a dynamic programming problem, but I can't figure out how to construct an appropriate cost metric.

genetic algorithm for classification and fitness evaluation

I have been reading the Tom Mitchell book about Machine Learning the part of Genetic Algorithms for Classification. The example that they put is fairly simple, they say that if I have the following:
then the fitness function could be defined as:
I would like to apply this approach for the classification of the census-income data that has the following form:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
In this dataset the attributes are the following:
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country
At the end what I want is to have a classifier that given some attributes could predict if the income of the person would be less or greater than 50000. How could I model the fitness function for this case?
Usually, for such purpose, the Genetic Programming is used. Here is a paper describing such scenario: http://web.cs.mun.ca/~banzhaf/papers/ieee_taec.pdf
If you are looking for a source code, you can use Tiny GP by ricardo poli: http://cswww.essex.ac.uk/staff/rpoli/TinyGP/ However, first of all you must transform all attributes to numerical values.
You can also use other GP variants. I did an implementation of Multi Expression Programming which is here: http://www.mepx.org/source_code.html

Hive table created successfully but data from S3 bucket is not imported

Created a table and want to move the data from a S3 bucket.
Table is created, but data is not imported from S3.
What could be the problem? Please help me out, thanks in advance.
Following is the series of commands and the respective output:
hive> CREATE TABLE contraceptive_usage_data( wife_age int, wife_edu int, husb_edu int,no_of_children_born int, wife_religion int,
> wife_now_working int, husb_occu int, stand_living int, media_exposure int, contraceptive_method_used int) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' location 's3://emr.learnings/contraceptive_data/contraceptive_usage_data_indonesia_1988';
OK
Time taken: 16.452 seconds
hive> select * from contraceptive_usage_data limit 10;
OK
Time taken: 1.966 seconds
hive>
Sample data in the S3 bucket
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
Try using the keyword EXTERNAL,
CREATE EXTERNAL TABLE contraceptive_usage_data( wife_age int, wife_edu int, husb_edu int,no_of_children_born int, wife_religion int,
wife_now_working int, husb_occu int, stand_living int, media_exposure int, contraceptive_method_used int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://emr.learnings/contraceptive_data/contraceptive_usage_data_indonesia_1988';
I think without the EXTERNAL keyword, Hive will try to create a new empty table at the location instead of loading the existing data there.
I think its because of the space after every , in the values .

Mathematica lists in a weird format

Hi I have the following simple program:
joint = Table[0, {i, Length[labelnames]}, {j, 16}];
For[time = 1,
time < Length[topics], time++
Do[
joint[[l, t]]++, {l, labelsForTime[time]}, {t, topics[[time]]}
]
]
Result of which, joint is:
{{0, 1267, 90, 0, 0, 58, 1358, 2, 25, 1, 0, 0, 6, 0, 2585,
0}, (7507 + List)[111, 773, 3302, 8092, 405, 1776, 4203, 153, 9551,
118, 9, 2260, 17, 665, 5586, 0], (3288 + List)[0, 43, 46, 716, 0,
120, 20, 2, 576, 0, 0, 246, 0, 0, 118, 0], (382 + List)[7, 80, 191,
87, 1, 38, 2887, 3, 1967, 0, 5, 72
....
Notice the (7505 + List), (3288 + List) .. and other similar elements in the output. I just can't figure out what these are, and how they got into joint, which is a simple list of lists.
Aren't you missing a comma after time++? (I can't run your code because there are too many unknown variables...)

fitting n variable height images into 3 (similar length) column layout

I'm looking to make a 3-column layout similar to that of piccsy.com. Given a number of images of the same width but varying height, what is a algorithm to order them so that the difference in column lengths is minimal? Ideally in Python or JavaScript...
Thanks a lot for your help in advance!
Martin
How many images?
If you limit the maximum page size, and have a value for the minimum picture height, you can calculate the maximum number of images per page. You would need this when evaluating any solution.
I think there were 27 pictures on the link you gave.
The following uses the first_fit algorithm mentioned by Robin Green earlier but then improves on this by greedy swapping.
The swapping routine finds the column that is furthest away from the average column height then systematically looks for a swap between one of its pictures and the first picture in another column that minimizes the maximum deviation from the average.
I used a random sample of 30 pictures with heights in the range five to 50 'units'. The convergenge was swift in my case and improved significantly on the first_fit algorithm.
The code (Python 3.2:
def first_fit(items, bincount=3):
items = sorted(items, reverse=1) # New - improves first fit.
bins = [[] for c in range(bincount)]
binsizes = [0] * bincount
for item in items:
minbinindex = binsizes.index(min(binsizes))
bins[minbinindex].append(item)
binsizes[minbinindex] += item
average = sum(binsizes) / float(bincount)
maxdeviation = max(abs(average - bs) for bs in binsizes)
return bins, binsizes, average, maxdeviation
def swap1(columns, colsize, average, margin=0):
'See if you can do a swap to smooth the heights'
colcount = len(columns)
maxdeviation, i_a = max((abs(average - cs), i)
for i,cs in enumerate(colsize))
col_a = columns[i_a]
for pic_a in set(col_a): # use set as if same height then only do once
for i_b, col_b in enumerate(columns):
if i_a != i_b: # Not same column
for pic_b in set(col_b):
if (abs(pic_a - pic_b) > margin): # Not same heights
# new heights if swapped
new_a = colsize[i_a] - pic_a + pic_b
new_b = colsize[i_b] - pic_b + pic_a
if all(abs(average - new) < maxdeviation
for new in (new_a, new_b)):
# Better to swap (in-place)
colsize[i_a] = new_a
colsize[i_b] = new_b
columns[i_a].remove(pic_a)
columns[i_a].append(pic_b)
columns[i_b].remove(pic_b)
columns[i_b].append(pic_a)
maxdeviation = max(abs(average - cs)
for cs in colsize)
return True, maxdeviation
return False, maxdeviation
def printit(columns, colsize, average, maxdeviation):
print('columns')
pp(columns)
print('colsize:', colsize)
print('average, maxdeviation:', average, maxdeviation)
print('deviations:', [abs(average - cs) for cs in colsize])
print()
if __name__ == '__main__':
## Some data
#import random
#heights = [random.randint(5, 50) for i in range(30)]
## Here's some from the above, but 'fixed'.
from pprint import pprint as pp
heights = [45, 7, 46, 34, 12, 12, 34, 19, 17, 41,
28, 9, 37, 32, 30, 44, 17, 16, 44, 7,
23, 30, 36, 5, 40, 20, 28, 42, 8, 38]
columns, colsize, average, maxdeviation = first_fit(heights)
printit(columns, colsize, average, maxdeviation)
while 1:
swapped, maxdeviation = swap1(columns, colsize, average, maxdeviation)
printit(columns, colsize, average, maxdeviation)
if not swapped:
break
#input('Paused: ')
The output:
columns
[[45, 12, 17, 28, 32, 17, 44, 5, 40, 8, 38],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 34, 9, 37, 44, 30, 20, 28]]
colsize: [286, 267, 248]
average, maxdeviation: 267.0 19.0
deviations: [19.0, 0.0, 19.0]
columns
[[45, 12, 17, 28, 17, 44, 5, 40, 8, 38, 9],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 34, 37, 44, 30, 20, 28, 32]]
colsize: [263, 267, 271]
average, maxdeviation: 267.0 4.0
deviations: [4.0, 0.0, 4.0]
columns
[[45, 12, 17, 17, 44, 5, 40, 8, 38, 9, 34],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 37, 44, 30, 20, 28, 32, 28]]
colsize: [269, 267, 265]
average, maxdeviation: 267.0 2.0
deviations: [2.0, 0.0, 2.0]
columns
[[45, 12, 17, 17, 44, 5, 8, 38, 9, 34, 37],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 44, 30, 20, 28, 32, 28, 40]]
colsize: [266, 267, 268]
average, maxdeviation: 267.0 1.0
deviations: [1.0, 0.0, 1.0]
columns
[[45, 12, 17, 17, 44, 5, 8, 38, 9, 34, 37],
[7, 34, 12, 19, 41, 30, 16, 7, 23, 36, 42],
[46, 44, 30, 20, 28, 32, 28, 40]]
colsize: [266, 267, 268]
average, maxdeviation: 267.0 1.0
deviations: [1.0, 0.0, 1.0]
Nice problem.
Heres the info on reverse-sorting mentioned in my separate comment below.
>>> h = sorted(heights, reverse=1)
>>> h
[46, 45, 44, 44, 42, 41, 40, 38, 37, 36, 34, 34, 32, 30, 30, 28, 28, 23, 20, 19, 17, 17, 16, 12, 12, 9, 8, 7, 7, 5]
>>> columns, colsize, average, maxdeviation = first_fit(h)
>>> printit(columns, colsize, average, maxdeviation)
columns
[[46, 41, 40, 34, 30, 28, 19, 12, 12, 5],
[45, 42, 38, 36, 30, 28, 17, 16, 8, 7],
[44, 44, 37, 34, 32, 23, 20, 17, 9, 7]]
colsize: [267, 267, 267]
average, maxdeviation: 267.0 0.0
deviations: [0.0, 0.0, 0.0]
If you have the reverse-sorting, this extra code appended to the bottom of the above code (in the 'if name == ...), will do extra trials on random data:
for trial in range(2,11):
print('\n## Trial %i' % trial)
heights = [random.randint(5, 50) for i in range(random.randint(5, 50))]
print('Pictures:',len(heights))
columns, colsize, average, maxdeviation = first_fit(heights)
print('average %7.3f' % average, '\nmaxdeviation:')
print('%5.2f%% = %6.3f' % ((maxdeviation * 100. / average), maxdeviation))
swapcount = 0
while maxdeviation:
swapped, maxdeviation = swap1(columns, colsize, average, maxdeviation)
if not swapped:
break
print('%5.2f%% = %6.3f' % ((maxdeviation * 100. / average), maxdeviation))
swapcount += 1
print('swaps:', swapcount)
The extra output shows the effect of the swaps:
## Trial 2
Pictures: 11
average 72.000
maxdeviation:
9.72% = 7.000
swaps: 0
## Trial 3
Pictures: 14
average 118.667
maxdeviation:
6.46% = 7.667
4.78% = 5.667
3.09% = 3.667
0.56% = 0.667
swaps: 3
## Trial 4
Pictures: 46
average 470.333
maxdeviation:
0.57% = 2.667
0.35% = 1.667
0.14% = 0.667
swaps: 2
## Trial 5
Pictures: 40
average 388.667
maxdeviation:
0.43% = 1.667
0.17% = 0.667
swaps: 1
## Trial 6
Pictures: 5
average 44.000
maxdeviation:
4.55% = 2.000
swaps: 0
## Trial 7
Pictures: 30
average 295.000
maxdeviation:
0.34% = 1.000
swaps: 0
## Trial 8
Pictures: 43
average 413.000
maxdeviation:
0.97% = 4.000
0.73% = 3.000
0.48% = 2.000
swaps: 2
## Trial 9
Pictures: 33
average 342.000
maxdeviation:
0.29% = 1.000
swaps: 0
## Trial 10
Pictures: 26
average 233.333
maxdeviation:
2.29% = 5.333
1.86% = 4.333
1.43% = 3.333
1.00% = 2.333
0.57% = 1.333
swaps: 4
This is the offline makespan minimisation problem, which I think is equivalent to the multiprocessor scheduling problem. Instead of jobs you have images, and instead of job durations you have image heights, but it's exactly the same problem. (The fact that it involves space instead of time doesn't matter.) So any algorithm that (approximately) solves either of them will do.
Here's an algorithm (called First Fit Decreasing) that will get you a very compact arrangement, in a reasonable amount of time. There may be a better algorithm but this is ridiculously simple.
Sort the images in order from tallest to shortest.
Take the first image, and place it in the shortest column.
(If multiple columns are the same height (and shortest) pick any one.)
Repeat step 2 until no images remain.
When you're done, you can re-arrange the elements in the each column however you choose if you don't like the tallest-to-shortest look.
Here's one:
// Create initial solution
<run First Fit Decreasing algorithm first>
// Calculate "error", i.e. maximum height difference
// after running FFD
err = (maximum_height - minimum_height)
minerr = err
// Run simple greedy optimization and random search
repeat for a number of steps: // e.g. 1000 steps
<find any two random images a and b from two different columns such that
swapping a and b decreases the error>
if <found>:
swap a and b
err = (maximum_height - minimum_height)
if (err < minerr):
<store as best solution so far> // X
else:
swap two random images from two columns
err = (maximum_height - minimum_height)
<output the best solution stored on line marked with X>

Resources