Is there a way to generate the random data (as Triangular matrix, array with some restrictions in its values) using Interactive CPLEX?
CPLEX doesn't include any feature that would generate random data for you.
Related
I want to construct a LightGBM Dataset object from very large X and y, which can not be load to memory. Is there any method that can construct Dataset in "batch"? eg. something like
import lightgbm as lgb
ds = lgb.Dataset()
for X, y in data_generator():
ds.add_new_data(data=X, label=y)
regarding the data there are a few hacks, for example, if your data has numeric, you make sure the precision are too long, e.g. probably two digits would be enough (it depends on your data). or if you have categorical data make sure you store them with digits. but probably you are looking for a better approach
There is a concept called incremental learning. Basically you make a model (a tree) in your first iteration using the first batch of data. Then for your next model, you use that tree as a template and only updates the values (you can also allow for shrinkage). you can use the keep_training_booster for such scenario and please read on your own to learn the mechanism.
The third technique is you make multiple models: say you divide your data into N pieces and make N models, then use an ensemble approach. This way you have used your entire data with N number of observations.
I have a problem where I look at a row of elements, and if there are no non-zero elements in the row, I want to set one to a random value. My difficulty is the update strategy. I just attempted to get a version working that used slice from Clatrix, but that does not work with Incater matrices. So should I construct another matrix of "modifications" and perform elementwise addition? Or build some other form of matrix and perform a multiplication (not sure how to build this one yet). Or maybe I could somehow map the rows? My trouble there is that assoc does not work Incanter matrices, which is apparently what sel returns.
It seems that there are no easy ways to do this in Incanter, so I ended up directly using set from Clatrix and avoiding Incanter usage.
I have a list of many users (over 10 million) each of which is represented by a userid followed by 10 floating-point numbers indicating their preference. I would like to efficiently calculate the user similarity matrix using cosine similarity based on mapreduce. However, since the values are floating-point numbers, it is hard to determine a key in the mapreduce framework. Any suggestions?
I think the easiest solution would be the Mahout library. There are a couple of map-reduce similarity matrix jobs in Mahout that might work for your use case.
The first is Mahout's ItemSimilarityJob that is part of its recommender system libraries. The specific info for that job can be found here. You would simply need to provide the input data in the required format and choose your VectorSimilarityMeasure (which for your case would be SIMILARITY_COSINE) along with any additional optimizations. Since you are looking to calculate user-user similarity based on a preference vector of ten floating point value, what you could do is assign a simple 1-to-10 numeric hash for the indices of the vector and generate a simple .csv file of vectorIndex, userID, decimalValue as input for the Mahout item-similarity job (the userID being a numeric Int or Long value). The resulting output should be a tab separated text file of userID,userID,similarity.
A second solution might be Mahout's RowSimilarityJob included in its math library. I've never used it myself, but some info can be found here and in this previous stackoverflow thread. Rather than a .csv as input, you would need to translate your input data as a DistributedRowMatrix, the userIDs being the rows of the matrix. The output, I believe, will also be a DistributedRowMatrix sequence file containing the user-user similarity data you are seeking.
I suppose which solution is better depends on what input/output format you prefer. All the best.
Description of problem:
I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.
Update: size of the dataset
My dataset is really huge in the size of hundreds of GB.
Update: Sensitive
By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.
Update: Implementation ?
Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link
Can you give me some example code to achieve the process , preferably in Python ?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)
If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?
sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...
"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."
This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.
Is there a way to set the seed for the random number generator in Apex? And if so; which function do I use for it?
It likely isn't possible to seed the RNG in Apex. If you need a repeatable sequence of random numbers, you'll have to implement a seeded pseudo random number generator yourself.
On the Apex platform, I'm sure they have a huge source of entropy available to generate random numbers, and there's no need for you to seed the generator.
There is no way to seed the built-in random number generator in Salesforce. I was in the same boat as you. I wanted to be able to use a seed, so that I could create repeatable random numbers.
So, I thought I'd attempt to write my own RNG. I spent a number of days scouring the Internet for algorithms. I was able to piece together a pretty comprehensive library of functions borrowing from various sources. The classes are: "Random.cls", which is the main RNG class, and "Random_Test.cls", which is the test code.
It has the following methods:
nextInteger(upperLimit)
nextLong(upperLimit)
nextDouble(upperLimit)
nextUniform() - Same function as Math.Random() to return a Double between 0.0 and 1.0.
nextIntegerInRange(lowerLimit, upperLimit)
nextLongInRange(lowerLimit, upperLimit)
nextDoubleInRange(lowerLimit, upperLimit)
shuffle(List<Object>) - destroys the order of the original list
shuffleWithCopy(List<Object>) - return a shuffled copy of the list, in case you wish to preserve the list's original order (less efficient than "shuffle(List<Object>)")
The "Random.cls" documents the sources that I borrowed from in case you want to read more about random number generators.
I put the code out on GitHub for anyone who wants it: https://github.com/DeviousBard/Salesforce/tree/master