Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm looking for an algorithm that classifies differently formated 10 digit (mostly) integer keys. The training data set looks like that:
+------------+----------------+
| key | classification |
+------------+----------------+
| 1000012355 | US |
| 1000045331 | US |
| 0000123101 | DE |
| 0003453202 | DE |
| 000K213411 | ES |
| 000K243221 | ES |
+------------+----------------+
The keys originate from different systems and are created in a different manner. There is a large training data set available. While I assume that some part of those keys are random the structure is not.
Any help will be appreciated.
Before building models, training, and predicting.It's better to analyze the problem first, you assumed that some part of those keys are random the structure is not.You need to explore the data set to prove your hypothesis and according to the distribution of data, determine what model to use.
Converts a string to a vector, treat each character in a string as a categorical type feature,using one-hot encoding, you will get a sparse matrix of high dimensions. After this step, you can calculate, analyze, model, and so on for training data.
Then you need to analyze the data. One of the simple and effective methods is visual analysis.For high dimensional data, you can use the andrews curves, parallel coordinatesand so on.You can also use dimension reduction methods such as PCA or ICA, then visualize low dimensional data.
Depending on your visualization results, you can choose your model.If depending on the feature distribution, different categories of data are easily divided, you can use almost any classification algorithm, such as LR, SVM and even clustering.If it's a multi class problem, you can use OVO or OVR.If the visualization is poor, the distinction between classes is not obvious, you may need to do some feature engineering, or try tree models and ensemble learning methods.
You could do a one-hot encoding of each character, and concatenate these.
That is, say you have 20 possible characters that each of these 10 characters in the key can take on. You could then convert each character to a 20-length vector of zeros, with a one in the position corresponding to the particular character. You would then have an overall feature vector of length 10 * 20 = 200. You could then feed this into any classification algorithm as inputs, with the target outputs being the possible countries.
If this is truly deterministic, and the keys can be separated, a decision tree might find the perfect solution. Or even logistic regression? If there is some 'fuzziness' then something like Random Forest might work better.
Related
I have a dataset where each record could contain a different number of features.
The features in total are 56, and each record can contain from 1 to 56 record of this features.
Each features is like a flag, so or exist in the dataset or not, and if it exist, there is another value, double, that put the value of it.
An example of dataset is this
I would know if is possibile training my kNN algorithm using different features for each record, so for example one record has 3 features plus label, other one has 4 features plus label, etc...
I am trying to implement this in Python, but I have no idea about how I have to do.
Yes it is definitely possible. The one thing you need to think about is the distance measure.
The default distance used for kNN Classifiers is usually Euclidean distance. However,Euclidean distance requires records (vectors) of equal number of features (dimensions).
The distance measure you use, highly depends on what you think should make records similar.
If you have a correspondence between features of two records, so you know that feature i of record x describes the same feature as feature i from record y you can adapt Euclidean distance. For example you could either ignore missing dimensions (such that they don't add to the distance if a feature is missing in one record) or penalize missing dimensions (such that a certain penalty value is added whenever a feature is missing in a record).
If you do not have a correspondence between the features of two records, then you would have to look at set distances, e.g., minimum matching distance or Hausdorff distance.
Every instance in your dataset should be represented by the same number of features. If you have data with a variable number of features (e.g. each data point is a vector of x and y where each instance has different number of points) then you should treat such points as missing values.
Therefore you need to deal with missing values. For example:
Replace missing values with the mean value for each column.
Select an algorithm that is able to handle missing values such as Decision trees.
Use a model that is able to predict missing values.
EDIT
First of all you need to bring the data into a better format. Currently, each feature is represented by two columns which is not a very nice technique. Therefore I would suggest to restructure the data as follows:
+------+------------+-----------+----------+--------+
| ID | Feature1 | Feature2 | Feature3 | Label |
+-------------------+-----------+----------+--------+
| 1 | 15.12 | ? | 56.65 | True |
| 2 | ? | 23.6 | ? | True |
| 3 | ? | 12.3 | ? | False |
+-------------------+-----------+----------+--------+
Then you can either replace missing values (denoted with ?) with 0 (this depends on the "meaning" of each feature) or use one of the techniques that I've already mentioned before.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I build a machine learning algorithms to predict Y' value. For this, I used Log value of Y for data scaling.
As I got the predicted Y' and actual Y value, I have to convert Log value of Y&Y' to Exponential value.
BUT, there was so huge distortion from the values over exp7 (=ln1098)... It makes a lot of MSE(error).
How can I avoid this huge distortion?? (Generally, I need to get values over 1000)
Thanks!!
For this, I used Log value of Y for data scaling.
Not for scaling, but to make target variable distribution normal.
If your MSE arises when real target value arises too - it means that the model simply can't fit enough on big values. Usually it can be solved by cleaning data (removing outliers). Or take another ML-model.
UPDATE
You can run KFold and for each fold calculate MSE/MAE between predicted and real values. Then take big errors and take a look which parameters/features this cases have.
You can eliminate cases with big errors, but it's usually dangerous.
In general bad fit on big values mean that you did not remove outliers from your original dataset. Plot histograms and scatter plots and make sure that you don't have them.
Check categorical variables: maybe you have small values (<=5%). If so, group them.
Or you need to create 2 models: one for small values, one for big ones.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
for example our input file in.txt:
naturalistic 10
coppering 20
artless 30
after command: sort in.txt
artless 30
coppering 20
naturalistic 10
after command: sort -n -k 2 in.txt
naturalistic 10
coppering 20
artless 30
My Question: How can I manage keeping the lines stable while sorting according to column.
I want to whole line stays same while its order in general is changing?
What algoritm or code piece is useful? Is it about file reading or sorting facility?
Standard UNIX sort doesn't document which algorithm it uses. It may even choose a different algorithm depending on such things as the size of the input or the sort options.
The Wikipedia page on sorting algorithms lists many sorting algorithms you can choose from.
If you want a stable sort, there are plenty of options (the comparison table on the same Wikipedia page lists which ones are stable), but in fact any sorting algorithm can be made stable by tagging each data item with its original position in the input and breaking ties in the key comparison function according to that position.
Other than that, it's not exactly clear what you're asking. In your question you demonstrate the use of sort with and without -n and -k options, but it's not clear why this should influence the actual choice of sort algorithm...
I would just create a hash table of the strings with the num as key and string as value (I'm assuming they are unique) and then for the command sort , I'd sort based on values and for -n -k 2 I'd sort based on keys.
The POSIX standard does not dictate which algo to use, so different unix flavours may use different algos. GNU sort uses Merge Sort http://en.wikipedia.org/wiki/Merge_sort
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
This post was edited and submitted for review 9 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
From a series of MIDI notes stored in array (with MIDI note number), does an algorithm exist to get the most likely key or scale implied by these notes?
If you're using Python you can use the music21 toolkit to do this:
import music21
score = music21.converter.parse('filename.mid')
key = score.analyze('key')
print(key.tonic.name, key.mode)
if you care about specific algorithms for key finding, you can use them instead of the generic "key":
key1 = score.analyze('Krumhansl')
key2 = score.analyze('AardenEssen')
etc. Any of these methods will work for chords also.
(Disclaimer: music21 is my project, so of course I have a vested interest in promoting it; but you can look at the music21.analysis.discrete module to take ideas from there for other projects/languages. If you have a MIDI parser, the Krumhansl algorithm is not hard to implement).
The algorithm by Carol Krumhansl is the best-known. The basic idea is very straightforward. A reference sample of pitches are drawn from music in a known key, and transposed to the other 11 keys. Major and minor keys must be handled separately. Then a sample of pitches are drawn from the music in an unknown key. This yields a 12-component pitch vector for each of 24 reference samples and one unknown sample, something like:
[ I, I#, II, II# III, IV, IV#, V, V#, VI, VI#, VII ]
[ 0.30, 0.02, 0.10, 0.05, 0.25, 0.20, 0.03, 0.30, 0.05, 0.13, 0.10 0.15]
Compute the correlation coefficient between the unknown pitch vector and each reference pitch vector and choose the best match.
Craig Sapp has written (copyrighted) code, available at http://sig.sapp.org/doc/examples/humextra/keycor/
David Temperley and Daniel Sleator developed a different, more difficult algorithm as part of their (copyrighted) Melisma package, available at
http://www.link.cs.cmu.edu/music-analysis/ftp-contents.html
A (free) Matlab version of the Krumhansl algorithm is available from T. Eerola and P. Toiviainen in their Midi Toolbox:
https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/miditoolbox
There are a number of key finding algorithms around, in particular the ones of Carol Krumhansl (most papers that I've seen always cite Krumhansl's methods)
Assuming no key changes, a simple algorithm could be based on a pitch class histogram (an array with 12 entries for each pitch class (each note in an octave)), when you get a note you add one in the correct entry, then at the end you'll very likely have two most frequent notes that will be 7 semitones (or entries) apart representing the tonic and the dominant, the tonic being the note you're looking for and the dominant being 7 semitones above or 5 semitones below.
The good thing about this approach is that it's scale-independent, it relies on the tonic and the dominant being the two most important notes and occurring more often. The algorithm could probably be made more robust by giving extra weight to the first and last notes of large subdivisions of a piece.
As for detecting the scale then once you have the key you can generate a list of the notes you have above a certain threshold in your histogram as offsets from that root note, so let's say you detect a key of A (from having A and E occur more often) and the notes you have are A C D E G then you would obtain the offsets 0 3 5 7 10, which searching in a database like this one would give you "Minor Pentatonic" as a scale name.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
StackOverflow crowd. I have a very open-ended software design question.
I've been looking for an elagant solution to this for a while and I was wondering if anyone here had some brilliant insight into the problem. Consider this to be like a data structures puzzle.
What I am trying to do is to create a unit converter that is capable of converting from any unit to any unit. Assume that the lexing and parsing is already done. A few simple examples:
Convert("days","hours") // Yields 24
Convert("revolutions", "degrees") // Yields 360
To make things a little more complicated, it must smoothly handle ambiguities between inputs:
Convert("minutes","hours") // Yields (1/60)
Convert("minutes","revolutions") // Yields (1/21600)
To make things even more fun, it must handle complex units without needing to enumerate all possibilities:
Convert("meters/second","kilometers/hour")
Convert("miles/hour","knots")
Convert("Newton meters","foot pounds")
Convert("Acre feet","meters^3")
There's no right or wrong answer, I'm looking for ideas on how to accomplish this. There's always a brute force solution, but I want something elegant that is simple and scalable.
I would start with a hashtable (or persisted lookup table - your choice how you implement) that carries unit conversions between as many pairs as you care to put in. If you put in every possible pair, then this is your brute force approach.
If you have only partial pairs, you can then do a search across the pairs you do have to find a combination. For example, let's say I have these two entries in my hashtable:
Feet|Inches|1/12
Inches|Centimeters|2.54
Now if I want to convert feet to centimeters, I have a simple graph search: vertices are Feet, Inches, and Centimeters, and edges are the 1/12 and 2.54 conversion factors. The solution in this case is the two edges 1/12, 2.54 (combined via multiplication, of course). You can get fancier with the graph parameters if you want to.
Another approach might be applying abductive reasoning - look into AI texts about algebraic problem solvers for this...
Edit: Addressing Compound Units
Simplified problem: convert "Acres" to "Meters^2"
In this case, the keys are understanding that we are talking about units of length, so why don't we insert a new column into the table for unit type, which can be "length" or "area". This will help performance even in the earlier cases as it gives you an easy column to pare down your search space.
Now the trick is to understand that length^2 = area. Why not add another lookup that stores this metadata:
Area|Length|Length|*
We couple this with the primary units table:
Meters|Feet|3.28|Length
Acres|Feet^2|43560|Area
So the algorithm goes:
Solution is m^2, which is m * m, which is a length * length.
Input is acres, which is an area.
Search the meta table for m, and find the length * length mapping. Note that in more complex examples there may be more than one valid mapping.
Append to the solution a conversion Acres->Feet^2.
Perform the original graph search for Feet->M.
Note that:
The algorithm won't know whether to use area or length as the basic domain in which to work. You can provide it hints, or let it search both spaces.
The meta table gets a little brute-force-ish.
The meta table will need to get smarter if you start mixing types (e.g. Resistance = Voltage / Current) or doing something really ugly and mixing unit systems (e.g. a FooArea = Meters * Feet).
Whatever structure you choose, and your choice may well be directed by your preferred implementation (OO ? functional ? DBMS table ?) I think you need to identify the structure of units themselves.
For example a measurement of 1000km/hr has several components:
a scalar magnitude, 1000;
a prefix, in this case kilo; and
a dimension, in this case L.T^(-1), that is, length divided by time.
Your modelling of measurements with units needs to capture at least this complexity.
As has already been suggested, you should establish what the base set of units you are going to use are, and the SI base units immediately suggest themselves. Your data structure(s) for modelling units would then be defined in terms of those base units. You might therefore define a table (thinking RDBMS here, but easily translatable into your preferred implementation) with entries such as:
unit name dimension conversion to base
foot Length 0.3048
gallon(UK) Length^3 4.546092 x 10^(-3)
kilowatt-hour Mass.Length^2.Time^(-2) 3.6 x 10^6
and so forth. You'll also need a table to translate prefixes (kilo-, nano-, mega-, mibi- etc) into multiplying factors, and a table of base units for each of the dimensions (ie meter is the base unit for Length, second for Time, etc). You'll also have to cope with units such as feet which are simply synonyms for other units.
The purpose of dimension is, of course, to ensure that your conversions and other operations (such as adding 2 feet to 3.5 metres) are commensurate.
And, for further reading, I suggest this book by Cardarelli.
EDIT in response to comments ...
I'm trying to veer away from suggesting (implementation-specific) solutions so I'll waffle a bit more. Compound units, such as kilowatt-hours, do pose a problem. One approach would be to tag measurements with multiple unit-expressions, such as kilowatt and hour, and a rule for combining them, in this case multiplication I could see this getting quite hairy quite quickly. It might be better to restrict the valid set of units to the most common ones in the domain of the application.
As to dealing with measurements in mixed units, well the purpose of defining the Dimension of a unit is to provide some means to ensure that only sensible operations can be applied to measurements-with-units. So, it's sensible to add two lengths (L+L) together, but not a length (L) and a volume (L^3). On the other hand it is sensible to divide a volume by a length (to get an area (L^2)). And it's kind of up to the application to determine if strange units such as kilowatt-hours per square metre are valid.
Finally, the book I link to does enumerate all the possibilities, I guess most sensible applications with units will implement only a selection.
I would start by choosing a standard unit for every quantity(eg. meters for length, newtons for force, etc) and then storing all the conversion factors to that unit in a table
then to go from days to hours, for example, you find the conversion factors for seconds per day and seconds per hour and divide them to find the answer.
for ambiguities, each unit could be associated with all the types of quantities it measures, and to determine which conversion to do, you would take the intersection of those two sets of types(and if you're left with 0 or more than one you would spit out an error)
I assume that you want to hold the data about conversion in some kind of triples (fstUnit, sndUnit, multiplier).
For single unit conversions:
Use some hash functions in O(1) to change the unit stucture to a number, and then put all multipliers in a matrix (you only have to remember the upper-right part, because the reflection is the same, but inversed).
For complex cases:
Example 1. m/s to km/h. You check (m,km) in the matrix, then the (s,h), then multiply the results.
Example 2. m^3 to km^3. You check (m,km) and take it to the third power.
Of course some errors, when types don't match like field and volume.
You can make a class for Units that takes the conversion factor and the exponents of all basic units (I'd suggest to use metric units for this, that makes your life easier). E.g. in Pseudo-Java:
public class Unit {
public Unit(double factor, int meterExp, int secondExp, int kilogrammExp ... [other base units]) {
...
}
}
//you need the speed in km/h (1 m/s is 3.6 km/h):
Unit kmPerH = new Unit(1 / 3.6, 1, -1, 0, ...)
I would have a table with these fields:
conversionID
fromUnit
toUnit
multiplier
and however many rows you need to store all the conversions you want to support
If you want to support a multi-step process (degrees F to C), you'd need a one-to-many relationship with the units table, say called conversionStep, with fields like
conversionID
sequence
operator
value
If you want to store one set of conversions but support multi-step conversions, like storing
Feet|Inches|1/12
Inches|Centimeters|2.54
and supporting converting from Feet to Centimeters, I would store a conversion plan in another table, like
conversionPlanID
startUnits
endUnits
via
your row would look like
1 | feet | centimeters | inches