genetic algorithm for classification and fitness evaluation - genetic-algorithm

I have been reading the Tom Mitchell book about Machine Learning the part of Genetic Algorithms for Classification. The example that they put is fairly simple, they say that if I have the following:
then the fitness function could be defined as:
I would like to apply this approach for the classification of the census-income data that has the following form:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
In this dataset the attributes are the following:
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country
At the end what I want is to have a classifier that given some attributes could predict if the income of the person would be less or greater than 50000. How could I model the fitness function for this case?

Usually, for such purpose, the Genetic Programming is used. Here is a paper describing such scenario: http://web.cs.mun.ca/~banzhaf/papers/ieee_taec.pdf
If you are looking for a source code, you can use Tiny GP by ricardo poli: http://cswww.essex.ac.uk/staff/rpoli/TinyGP/ However, first of all you must transform all attributes to numerical values.
You can also use other GP variants. I did an implementation of Multi Expression Programming which is here: http://www.mepx.org/source_code.html

Related

Exact orthogonalization of vectors in Wolfram

What I have is a matrix, I need to orthogonolize its eigen vectors.
That is basically all I need, but in exact form.
So here is my wolfram input
(orthogonolize(eigenvectors({{146, 112, 78, 17, 122}, {112, 86, 60, 13, 94}, {78, 60, 42 , 9, 66}, {17, 13, 9, 2, 14}, {122, 94, 66, 14, 104}})))
That gives me float numbers, while I need the exact forms.
Any ways to fix this?
Wolfram Mathematica, not WolframAlpha which is a completely different product with different rules and gives different results, given this
FullSimplify[Orthogonalize[Eigenvectors[{
{146, 112, 78, 17, 122}, {112, 86, 60, 13, 94}, {78, 60, 42 , 9, 66},
{17, 13, 9, 2, 14}, {122, 94, 66, 14, 104}}]]]
returns this exact form
{{Sqrt[121/342 + 52/(9*Sqrt[35587])], Sqrt[5/38 + 18/Sqrt[35587]],
Sqrt[25/342 + 64/(9*Sqrt[35587])], Sqrt[7/38 - 26/Sqrt[35587]]/3,
2*Sqrt[2/19 - 7/Sqrt[35587]]},
{-1/3*Sqrt[121/38 - 52/Sqrt[35587]], -Sqrt[5/38 - 18/Sqrt[35587]],
Sqrt[25/38 - 64/Sqrt[35587]]/3, -1/3*Sqrt[7/38 + 26/Sqrt[35587]],
Sqrt[8/19 + 28/Sqrt[35587]]},
{3/Sqrt[35], -Sqrt[5/7], 0, 0, 1/Sqrt[35]},
{-11/Sqrt[5110], -Sqrt[5/1022], 0, Sqrt[70/73], 4*Sqrt[2/2555]},
{-17/(3*Sqrt[2774]), -7/Sqrt[2774], Sqrt[146/19]/3, Sqrt[2/1387]/3, -9*Sqrt[2/1387]}}
Think of at least two different ways you can check that for correctness before you depend on that.
The last three of those can be simplified somewhat
1/Sqrt[35]*{3,-5,0,0,1},
1/Sqrt[5110]*{-11,-5,0,70,8},
1/(3*Sqrt[2774])*{-17,-21,146,2,-54}
but I cannot yet see a way to simplify the first two to a third of their current size. Can anyone else see a way to do that? Please check these results very carefully.

Looking for an algorithm for a perfect "snake" from the center of a field?

I'm looking for a piece of code:
From the middle, in a "circle"-way, slowly to the ends of the edges of a rectangle. And when it reaches the boundaries on one side, just skip the pixels.
I tried already some crazy for-adventures, but that was to much code.
Does anyone have any idea for a simple/ingenious way?
It's like to start the game snake from the center until the full field is used. I'll use this way to scan a picture (from the middle to find the first pixel next to center in a other color).
Maybe a picture could describe it better:
From this link requires numpy and python of course.
import numpy as np
a = np.arange(7*7).reshape(7,7)
def spiral_ccw(A):
A = np.array(A)
out = []
while(A.size):
out.append(A[0][::-1]) # first row reversed
A = A[1:][::-1].T # cut off first row and rotate clockwise
return np.concatenate(out)
def base_spiral(nrow, ncol):
return spiral_ccw(np.arange(nrow*ncol).reshape(nrow, ncol))[::-1]
def to_spiral(A):
A = np.array(A)
B = np.empty_like(A)
B.flat[base_spiral(*A.shape)] = A.flat
return B
to_spiral(a)
array([[42, 43, 44, 45, 46, 47, 48],
[41, 20, 21, 22, 23, 24, 25],
[40, 19, 6, 7, 8, 9, 26],
[39, 18, 5, 0, 1, 10, 27],
[38, 17, 4, 3, 2, 11, 28],
[37, 16, 15, 14, 13, 12, 29],
[36, 35, 34, 33, 32, 31, 30]])
how do you think about run from edge to center? It really easy to code, just run from (0;0) and if you hit a edge or a pixel already visited just turn right 90*

Hive table created successfully but data from S3 bucket is not imported

Created a table and want to move the data from a S3 bucket.
Table is created, but data is not imported from S3.
What could be the problem? Please help me out, thanks in advance.
Following is the series of commands and the respective output:
hive> CREATE TABLE contraceptive_usage_data( wife_age int, wife_edu int, husb_edu int,no_of_children_born int, wife_religion int,
> wife_now_working int, husb_occu int, stand_living int, media_exposure int, contraceptive_method_used int) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' location 's3://emr.learnings/contraceptive_data/contraceptive_usage_data_indonesia_1988';
OK
Time taken: 16.452 seconds
hive> select * from contraceptive_usage_data limit 10;
OK
Time taken: 1.966 seconds
hive>
Sample data in the S3 bucket
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
Try using the keyword EXTERNAL,
CREATE EXTERNAL TABLE contraceptive_usage_data( wife_age int, wife_edu int, husb_edu int,no_of_children_born int, wife_religion int,
wife_now_working int, husb_occu int, stand_living int, media_exposure int, contraceptive_method_used int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://emr.learnings/contraceptive_data/contraceptive_usage_data_indonesia_1988';
I think without the EXTERNAL keyword, Hive will try to create a new empty table at the location instead of loading the existing data there.
I think its because of the space after every , in the values .

Simple where condition does not show expected output in Hive

Trying to master Hive, Im uploading census data ('income data of people from different countries working in USA') into a S3 bucket.
Able to run other queries, but unable to run following simple query.
Im trying to list out people from different countries with income level >50k USD.
I have created table in hive and importing the data from AWS S3 bucket, income column here is defined as string and the possible values for this column are '<=50K' and '>50K'
Following query results in empty resultset. What could be the problem here? This SQL statement runs fine on a normal MySQL console. Why its not showing expected resultset in HIVE?
hive> select country, income from census_income_data where income = '>50K';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312281227_0011, Tracking URL = http://ip-172-31-44-80.us-west-2.compute.internal:9100/jobdetails.jsp?jobid=job_201312281227_0011
Kill Command = /home/hadoop/bin/hadoop job -kill job_201312281227_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-28 13:21:05,086 Stage-1 map = 0%, reduce = 0%
2013-12-28 13:21:26,279 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:27,289 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:28,299 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:29,310 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:30,321 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:31,334 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.74 sec
2013-12-28 13:21:32,369 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.74 sec
MapReduce Total cumulative CPU time: 7 seconds 740 msec
Ended Job = job_201312281227_0011
Counters:
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 7.74 sec HDFS Read: 219 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 740 msec
OK
Time taken: 56.559 seconds
Following is the sample data from the dataset used in the above code
30, State-gov, 141297, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K
23, Private, 122272, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K
32, Private, 205019, Assoc-acdm, 12, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K
40, Private, 121772, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K
34, Private, 245487, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 45, Mexico, <=50K
25, Self-emp-not-inc, 176756, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 35, United-States, <=50K
32, Private, 186824, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, United-States, <=50K
38, Private, 28887, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K
43, Self-emp-not-inc, 292175, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, >50K
40, Private, 193524, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K
54, Private, 302146, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K
35, Federal-gov, 76845, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, Black, Male, 0, 0, 40, United-States, <=50K
43, Private, 117037, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2042, 40, United-States, <=50K
59, Private, 109015, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K
56, Local-gov, 216851, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K
19, Private, 168294, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
39, Private, 367260, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K
49, Private, 193366, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K
23, Local-gov, 190709, Assoc-acdm, 12, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K
First run select * from table limit 20 on your table to verify that the expected values do exists in the expected column.
Now there might be other characters like spaces that can cause the query to return 0 results.
Try the following :
select country, income from census_income_data where income like '%50%';
If it doesn't work then you probably have misplaced the data in creating the table.
If it works try :
select country, income from census_income_data where income like '%>50K%';
If it works, you probably have other characters in the field, try running :
select concat('INCOME:',income,'.') from census_income_data where income like '%>50K%';
and see if you get this string INCOME:>50K. exactly.
Your SQL code
select country, income from census_income_data where income = '>50K';
uses the '=' operator to compare two strings. As far as I know that operator takes the character set, surrounding whitespaces etc. into account. Maybe you will have more luck with the "LIKE" operator.
select country, income from census_income_data where income LIKE ">50K";

What are some algorithms for finding a closed form function given an integer sequence?

I'm looking form a programatic way to take an integer sequence and spit out a closed form function. Something like:
Given: 1,3,6,10,15
Return: n(n+1)/2
Samples could be useful; the language is unimportant.
This touches an extremely deep, sophisticated and active area of mathematics. The solution is damn near trivial in some cases (linear recurrences) and damn near impossible in others (think 2, 3, 5, 7, 11, 13, ....) You could start by looking at generating functions for example and looking at Herb Wilf's incredible book (cf. page 1 (2e)) on the subject but that will only get you so far.
But I think your best bet is to give up, query Sloane's comprehensive Encyclopedia of Integer Sequences when you need to know the answer, and instead spend your time reading the opinions of one of the most eccentric personalities in this deep subject.
Anyone who tells you this problem is solvable is selling you snake oil (cf. page 118 of the Wilf book (2e).)
There is no one function in general.
For the sequence you specified, The On-Line Encyclopedia of Integer Sequences finds 133 matches in its database of interesting integer sequences. I've copied the first 5 here.
A000217 Triangular numbers: a(n) = C(n+1,2) = n(n+1)/2 = 0+1+2+...+n.
0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561, 595, 630, 666, 703, 741, 780, 820, 861, 903, 946, 990, 1035, 1081, 1128, 1176, 1225, 1275, 1326, 1378, 1431
A130484 Sum {0<=k<=n, k mod 6} (Partial sums of A010875).
0, 1, 3, 6, 10, 15, 15, 16, 18, 21, 25, 30, 30, 31, 33, 36, 40, 45, 45, 46, 48, 51, 55, 60, 60, 61, 63, 66, 70, 75, 75, 76, 78, 81, 85, 90, 90, 91, 93, 96, 100, 105, 105, 106, 108, 111, 115, 120, 120, 121, 123, 126, 130, 135, 135, 136, 138, 141, 145, 150, 150, 151, 153
A130485 Sum {0<=k<=n, k mod 7} (Partial sums of A010876).
0, 1, 3, 6, 10, 15, 21, 21, 22, 24, 27, 31, 36, 42, 42, 43, 45, 48, 52, 57, 63, 63, 64, 66, 69, 73, 78, 84, 84, 85, 87, 90, 94, 99, 105, 105, 106, 108, 111, 115, 120, 126, 126, 127, 129, 132, 136, 141, 147, 147, 148, 150, 153, 157, 162, 168, 168, 169, 171, 174, 178, 183
A104619 Write the natural numbers in base 16 in a triangle with k digits in the k-th row, as shown below. Sequence gives the leading diagonal.
1, 3, 6, 10, 15, 2, 1, 1, 14, 3, 2, 2, 5, 12, 4, 4, 4, 13, 6, 7, 11, 6, 9, 9, 10, 7, 12, 13, 1, 0, 1, 10, 5, 1, 12, 8, 1, 1, 14, 1, 9, 7, 1, 4, 3, 1, 2, 2, 1, 3, 4, 2, 7, 9, 2, 14, 1, 2, 8, 12, 2, 5, 10, 3, 5, 11, 3, 8, 15, 3, 14, 6, 3, 7, 0, 4, 3, 13, 4, 2, 13, 4, 4, 0, 5, 9, 6, 5, 1, 15, 5, 12, 11, 6
A037123 a(n) = a(n-1) + Sum of digits of n.
0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 46, 48, 51, 55, 60, 66, 73, 81, 90, 100, 102, 105, 109, 114, 120, 127, 135, 144, 154, 165, 168, 172, 177, 183, 190, 198, 207, 217, 228, 240, 244, 249, 255, 262, 270, 279, 289, 300, 312, 325, 330, 336, 343, 351, 360, 370, 381
If you restrict yourself to polynomial functions, this is easy to code up, and only mildly tedious to solve by hand.
Let , for some unknown
Now solve the equations
…
which simply a system of linear equations.
If your data is guaranteed to be expressible as a polynomial, I think you would be able to use R (or any suite that offers regression fitting of data). If your correlation is exactly 1, then the line is a perfect fit to describe the series.
There's a lot of statistics that goes into regression analysis, and I am not familiar enough with even the basics of calculation to give you much detail.
But, this link to regression analysis in R might be of assistance
The Axiom computer algebra system includes a package for this purpose. You can read its documentation here.
Here's the output for your example sequence in FriCAS (a fork of Axiom):
(3) -> guess([1, 3, 6, 10, 15])
2
n + 3n + 2
(3) [[function= -----------,order= 0]]
2
Type: List(Record(function: Expression(Integer),order: NonNegativeInteger))
I think your problem is ill-posed. Given any finite number of integers in a sequence with
no generating function, the next element can be anything.
You need to assume something about the sequence. Is it geometric? Arithmetic?
If your sequence comes from a polynomial then divided differences will find that polynomial expressed in terms of the Newton basis or binomial basis. See this.
There is no general answers; a simple method can be implemented bu using Pade approximants; in two words, assume your sequence is a sequence of coefficients of the Taylor expansion of an unknown function, then apply an algorithm (similar to the continued-fraction algorithm) in order to "simplify" this Taylor-expansion (more precisely: find a rational function very close to the initial (and truncated) function. The Maxima program can do it: look at "pade" on the page: http://maxima.sourceforge.net/docs/manual/maxima_28.html
Another answer tells about the "guess" package in the FriCAS fork of Axiom (see previous answer by jmbr). If I am not wrong; this package is itself inspired from the Rate program by Christian Krattenthaler; you can find it here: http://www.mat.univie.ac.at/~kratt/rate/rate.html Maybe looking at its source could tell you about other methods.

Resources