For-loop/ simple data extraction and comparison in R - for-loop

In this example dataset i have created a column called 'Var'. This is the result i would like from a the code. The pseudo-code to give Var is like this : For each ID_Survey, compare the Distance in sequence, if the difference between sequential Distances is 10, then Var=1, otherwise Var=0. Var should be 1 for both elements of the sequence where the difference is 10.
#Generate data
ID_Survey=rep(seq(1,3,1),each=4)
Distance= c(0,25,30,40,50,160,170,190,200,210,1000,1010)
Var= c(0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1);
TestData=data.frame(cbind(ID_Survey,Distance,Var))
TestData
I can use a simple for-loop like this, which nearly works, but it trips-up when moving between ID_Survey.
for(i in 1:(nrow(TestData)-1)){
TestData$Var2[i]=(TestData$Distance[i+1]==TestData$Distance[i]+10)}
I need to incorporate the above into a function which splits the data.frame into groups based on ID_Survey. I'm trying to build something like the following...
New6=do.call(rbind, by(TestData,list(TestData$ID_Survey),
FUN=function(x)
for (i in nrow(x)){ #loop must build an argument then return it.
#conditional statements in here.
return(x[i,])})); #loop appears to return 1st argument only.
... but i can't get the for-loop to operate inside the by-statement.
Any guidance much appreciated. Many thanks.

Using the data.table function (.SD) manages separating and collating chunks of the data.frame (as defined by ID_Survey) after it has been sent to a function. No doubt someone else will have a more elegant solution, but this seems to do the job:
library(data.table)
ComPair=function(a,b){V=ifelse(a==b-10,TRUE,FALSE);return(V)}
TestFunction=function(FData){
if(nrow(FData)>1){
for(i in 1:(nrow(FData)-1)){
V=ComPair(FData$Distance[i],FData$Distance[i+1])
if(V==1){ FData$Var2[i]=V;FData$Var2[i+1]=V}
}
};return(FData)}
TestData_dt=data.table(TestData)
TestData2=TestData_dt[,TestFunction(.SD),ID_Survey]
TestData2

Related

Lua - Create a nested table using for loop

I'm a very new to lua so am happy to read material if it will help with tables.
I've decoded a json object and would like to build a table properly using its data, rather than writing 64 lines of the below:
a = {}
a[decode.var1[1].aId] = {decode.var2[1].bId, decode.var3[1].cId}
a[decode.var1[2].aId] = {decode.var2[2].bId, decode.var3[2].cId}
a[decode.var1[3].aId] = {decode.var2[3].bId, decode.var3[3].cId}
...etc
Because the numbers are consecutive 1-64, i presume i should be able to build it using a for loop.
Unfortunately despite going through table building ideas I cannot seem to find a way to do it, or find anything on creating nested tables using a loop.
Any help or direction would be appreciated.
Lua for-loops are, at least in my opinion, pretty easy to understand:
for i = 1, 10 do
print(i)
end
This loop inclusively prints the positive integers 1 through 10.
Lua for-loops also take an optional third argument--which defaults to 1--that indicates the step of the loop:
for i = 1, 10, 2 do
print(i)
end
This loop prints the numbers 1 through 10 but skips every other number, that is, it has a step of 2; therefore, it will print 1 3 5 7 9.
In the case of your example, if I understand it correctly, it seems that you know the minimum and maximum bounds of your for loops, which are 1 and 64, respectively. You could write a loop to decode the values and put them in a table like so:
local a = {}
for i = 1, 64 do
a[decodevar.var1[i].aId] = {decode.var2[i].bId, decode.var3[i].cId}
end
What you can do is generating a new table with all the contents from the decoded JSON with a for loop.
For example,
function jsonParse(jsonObj)
local tbl = {}
for i = 1, 64 do
a[decodevar.var1[i].aId] = {decode.var2[i].bId, decode.var3[i].cId}
end
return tbl
end
To deal with nested cases, you can recursively call that method as follows
function jsonParse(jsonObj)
local tbl = {}
for i = 1, 64 do
a[decodevar.var1[i].aId] = {decode.var2[i].bId, decode.var3[i].cId}
if type(decode.var2[i].bId) == "table" then
a[decodevar.var1[i].aid[0] = jsonParse(decode.var2[i].bId)
end
end
end
By the way, I can't understand why are you trying to create a table using a table that have done the job you want already. I assume they are just random and you may have to edit the code with the structure of the decodevar variable you have

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

Boost Confidence of Overlapping Observations In Apache Spark

I'm fairly new to scala/spark, so forgive me if my question is elementary but I've searched everywhere and can't find the answer.
Problem
I'm trying to boost the confidence scores a bunch of network router observations (observations of probable router types at different network junctions).
I have a type NetblockObservation combines device types seen on a network with an associated netblock and a confidence. The confidence is the confidence that we accurately identified which device the device we saw.
case class NetblockObservation(
device_type: String
ip_start: Long,
ip_end: Long,
confidence: Double
)
If the confidence is above some threshold thresh, then I want that observation to be in the returned dataset. If it's below thresh, it should not be.
In addition if I have two observations with the same device_type and that one contains the other, the containee should have its confidence increased by by the confidence of the container.
Example
Let's say I have 3 Netblock Observations
// 0.0.0.0/28
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 15, confidence_score: .4)
// 0.0.0.0/29
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 7, confidence_score: .4)
// 0.0.0.0/30
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 3, confidence_score: .4)
With a confidence threshold of 1, I would expect to have a single output of NetblockObservation(device_type: "x", ip_start: 0, ip_end: 4, confidence_score: 1.2)
Explanation: I am allowed to add the confidence scores of NetblockObservation's together if it's contained and has the same device_type
I was allowed to add the confidence score of the 0.0.0.0/29 to the confidence of the 0.0.0.0/30 because it's contained within it.
I was not allowed to add the confidence score of 0.0.0.0/30 to the 0.0.0.0/29 because 0.0.0.0/29 is not contained within 0.0.0.0/30.
My (pitiful) Attempt
Failure reason: Too slow / never completed
I attempted to implement this while simultaneously learning scala/spark so I'm not sure if it's the idea or the implementation which is wrong. I think it would eventually work but after an hour, it hadn't completed on a dataset of size 300,000 (small compared to production scale) so I gave up on it.
The idea is to find the largest netblock and separate the data into netblocks which are contained and netblocks which are not contained. The netblocks which are not contained are recursively passed back into the same function. If the largest netblock has a confidence_score of 1, the entire contained dataset is disregarded and the largest is added to return dataset. If the confidence_score is less then 1, then its confidence_score is added to everything in the contained dataset and that group is recursively passed back to the same function. Eventually, you should only be left with the data which has a confidence_score greater then 1. This algorithm also has the issue of not taking device_type into account.
def handleDataset(largestInNetData: Option[NetblockObservation], netData: RDD[NetblockObservation]): RDD[NetblockObservation] = {
if (netData.isEmpty) spark.sparkContext.emptyRDD else largestInNetData match {
case Some(largest) =>
val grouped = netData.groupBy(item =>
if (item.ip_start >= largest.ip_start && item.ip_end <= largest.ip_end) largestInNetData
else None)
def lookup(k: Option[NetblockObservation]) = grouped.filter(_._1 == k).flatMap(_._2)
val nos = handleDataset(None, lookup(None))
// Threshold is assumed to be 1
val next = if (largest.confidence_score >= 1) spark.sparkContext.parallelize(Seq(largest)) else
handleDataset(None, lookup(largestInNetData)
.filter(x => x != largest)
.map(x => x.copy(confidence_score = x.confidence_score + largest.confidence_score)))
nos ++ next
case None =>
val largest = netData.reduce((a: NetblockObservation, b: NetblockObservation) => if ((a.ip_end - a.ip_start) > (b.ip_end - b.ip_start)) a else b)
handleDataset(Option(largest), netData)
}
}
It is a fairly involved bit of code, so here is a general algorithm that I hope will help:
Forget about Spark for a moment and write a Scala function, probably in the companion object for NetblockObservation, that takes a collection of them and returns a subset of that collection that is contained. You should unit test the heck out of this function, and again this is pure Scala.
Moving now to Spark. Do a groupBy on your RDD[NetblockObservation] with device_type as the key producing essentially a map of String to Iterable[NetblockObservation].
Filter out all the entries in the map that have a value of size 1 and have a confidence below thresh.
For the entries that remain, apply your function from the first step to the collections of NetblockObservations with a mapValues.
Do a reduceByKey or similar to simply add up the confidence_scores of the contained values.
Enjoy a refreshing beverage.

Multiple unique random values in a single request in JMeter

I am trying to make an HTTP request in JMeter that contains multiple random numbers within a fixed range (specifically 0-50). With each request, I need to send out about 45 different integers, so on any given request, there are six integers within said range that are not included. Obviously {__Random()} doesn't work, as it will inevitably generate some equal values. My idea, and please bear with me because I am very new to this, was to create an array with the integers, such as:
String line = "0, 1, 2, 3, 4, 5.....";
String[] numbers = line.split(",");
and then assign them fixed variable names to include in the request. I can do this with counter with CSV data, but I'm unsure about how to do this with an array.
vars.put("VAR_" + counter, line);
VAR_1 = 1
VAR_2 = 2
and so on...
then shuffle the array (which I do not know how to do in Beanshell) and generate something like:
VAR_1 = 16
VAR_2 = 27
...
to send with the next request.
If anyone could help me with this, or suggest a simpler way, I would great appreciate it. Thanks.
To shuffle the list just use Collections.shuffle() method
Consider using JSR223 Test Elements and Groovy language instead of Beanshell as it is:
More Java compliant
Has better performance
Has built-in support of JSON, XML and some "syntax sugar" which minimises and simplifies code
Check out Groovy Is the New Black article for more details
I figured it out. It's kind of ugly and cumbersome, but fairly simple and does exactly what I needed it to do. In JSR223 PreProcessor, my code is
def list = [0,1,2,3,4,5,.....];
Collections.shuffle(list);
String VAR_1 = Integer.toString(list.getAt(0));
vars.put("VAR_1", VAR_1);
String VAR_2 = Integer.toString(list.getAt(1));
vars.put("VAR_2", VAR_2);
String VAR_3 = Integer.toString(list.getAt(2));
and so on.....
I had to input the 50 variables manually. I'm sure there was a simpler way, but I'm quite satisfied. Thanks for the suggestions.

understanding custom partitioner in hadoop

i am learning partitioner concept now.can any one explain me the below piece of code.it is hard for me to understand
public class TaggedJoiningPartitioner extends Partitioner<TaggedKey,Text> {
#Override
public int getPartition(TaggedKey taggedKey, Text text, int numPartitions) {
return taggedKey.getJoinKey().hashCode() % numPartitions;
}
}
how this taggedKey.getJoinKey().hashCode() % numPartitions determine which reducer to be executed for a key?
can any one explain me this?
It's not as complex as you think once you break things down a little bit.
taggedKey.getJoinKey().hashCode() will simply return an integer. Every object will have a hashCode() function that simply returns a number that will hopefully be unique to that object itself. You could look into the source code of TaggedKey to see how it works if you'd like, but all you need to know is that it returns an integer based on the contents of the object.
The % operator performs modulus division, which is where you return the remainder after performing division. (8 % 3 = 2, 15 % 7 = 1, etc.).
So let's say you have 3 partitioners (numPartitions = 3). Every time you do modulus division with 3, you'll get either 0, 1, or 2, no matter what number is passed. This is used to determine which of the 3 partitioners will get the data.
The whole idea of partitioners is that you can use them to group data to be sorted. If you wanted to sort by month, you could pass every piece of data with the string "January" to the first partition, "December" to the 12th partitioner, etc. But in your case it on the outside looks a bit confusing. But really they just want to spread the data out (hopefully) evenly, so they're using a simple hash/modulus function to choose the partition at random.

Resources