I'm really struggling to find a way to organize multiple values inside a single key.
ex: {'A':[(2000,1),(1999,1)]}
I'm trying to cause the values to sort based on the 0 position of each value (2000 and 1999 in this case). Any help to which pieces of code could do this would be much appreciated.
Related
I have a pretty straightforward survey dataset. Each row is a respondent, and each column is a question. Responses have a value that is a whole number, and each number has a label.
Now, I need to replace all of those values with fake data to use in a training. I need something that looks and feels like the original dataset, but isn't actually client data.
I started by replacing my variables with random number values:
COMPUTE Q1=RV.UNIFORM(1,2).
EXECUTE.
COMPUTE Q2=RV.UNIFORM(1,36).
EXECUTE.
COMPUTE Q3=RV.NORMAL(50, 13).
EXECUTE.
(rv.normal/rv.uniform depending on what kind of data I'm trying to fake - age versus multiple-choice question, for example).
This works, but then when I try and generate crosstabs, export the dataset w value labels, etc., the labels aren't applied to the columns with fake data. As far as I can tell, my fake numbers are in the exact same format they were in before - numeric, no decimals, width of 2, nominal. The labels still appear in the variable view, but they aren't actually being applied.
I'd really prefer not to have to manually re-label every one of these columns, because there's quite a few of them. Any ideas for how to get around this issue? Or is there a smarter way to generate fake data?
Your problem is the RV.UNIFORM and the RV.NORMAL functions do not generate integers - they generate decimal numbers. You may have your display hide the decimal numbers by having 0 decimals in the variable view, but they are still there (you can check this by adding decimals in the variable view).
So you neen another step of turning your decimals into integers. For example, the following are two ways to get a random 1 or 2 (integers):
COMPUTE Q1=rnd(RV.UNIFORM(1,2)).
or
COMPUTE Q1=trunc(RV.UNIFORM(1,3)).
Once the numbers generated are integers corresponding to the value labels definition, you should be able to see the labels in the output.
I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.
I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.
Upto what extent we should fill the missing values for a feature in a dataset so that it doesnt become redundant ?
I have a dataset which has a max of 42000 observations. There are three features which have around 20000, 35000 and 7000 values missing. Should I still use them by filling these missing values or dump these three features?
How do we decide the threshold for keeping or dumping a feature given the number of missing values of that feature ?
Generally, you can interpolate missing values from nearest samples in dataset, i like this manual for pandas about missing values http://pandas.pydata.org/pandas-docs/stable/missing_data.html, it lists many possible techniques to interpolate missing values from known part of dataset.
But in your case, i think that it's better to just remove those 2 first features, because i doubt that there could be any good interpolation for missing values, when you have such big amount of them, almost more than half of all values.
But you may try to fix third feature with missing values.
I'm trying to figure out an algorithm...
Input is a bunch of objects that have multiple values (eg 3 values per object, colour/taste/age, though it could be more).
The algorithm would then distribute the objects into a pre-defined number of sets. Each set should end up with almost the same number of objects (preferably the object count per set shouldn't differ more than 1), and achieve the objective of as fair a distribution of values per set as possible (eg try to have close to as many red in each set, and same for other colours, as well as tastes and ages, etc).
Values are tied to objects and cannot be changed. If you move an object from one set to another it brings all its values.
I found this related question: Algorithm for fair distribution of numbers into two sets
and the "number partitioning problem" suggested seems to help with single value distributions, but I'm looking for information/algorithms with multiple values per object (as described above).
Also note that the values cannot be normalized, ie each object cannot be totalled up into a single value.
Thank you kindly for any assistance.
IMHO, you should approach this as a clustering problem http://en.wikipedia.org/wiki/Cluster_analysis .