Tapply and Lapply using R studio - lapply

What is the differenced between Tapply and Lapply when you trying to split up data from a population and not a sample? Im not sure what code to use when trying to split up data for a population and not a sample.

Related

Is there a way to generate the random data using Interactive CPLEX?

Is there a way to generate the random data (as Triangular matrix, array with some restrictions in its values) using Interactive CPLEX?
CPLEX doesn't include any feature that would generate random data for you.

R Studio query definition

R Studio Query define
Using rstudio and making a decision tree using the C5.0 algorithm.
What does train.indices <- sample(1:nrow(iris), 100) do?
Thanks.
This is just choosing a random sample of 100 indices to be used for the training set, chosen from the number of indices available in your dataset. You could then get your training data using iris.train <- iris[train.indices, ].

Compare performance between using Hadoop with and without Hbase

I'm building an application that finding all similar images from user's input image using Hadoop.
I'm implementing it in two ways:
Way 1:
My collection images is converted to SequenceFile to be used as input for map function.
Then in map function, use OpenCV libary for compare similarity between those images with user's input image which include these steps:
- Extract keypoints
- Compute descriptors
- Calculate distance between each pairs to find the similarity
In Reduce function, I just copy images that is similar to output folder.
Way 2:
Similar with way 1 except:
I use Hbase to store image features first (keypoints, descriptors). To do that, because OpenCV doesnt support the way to convert keypoints, descriptors data type to bytes[] directly (in order to insert data to Hbase, we have to convert to bytesl[]) so I have to use a trick that I refer in this: OpenCV Mat object serialization in java
Then in map function, I will just query image features from Hbase to compare with user'input image feature.
In normal thought, we can see that saving all image features to a database then just query them to compare with user's input image will be faster than in each map function we have to start extract these feature to do comparison.
But in fact, when I do and test two ways in my virtual machine (standalone mode), I see that the way 2 run slower than way 1 and running time is not acceptable. In my opinion, I think that the way 2 run slowly because in the map function, it takes much time to convert from bytes[] value in Hbase to keypoints, descriptors datatype in openCV to do comparison. That why it degrades the performance of whole map function.
My collection images are just include 240 images in jpg format.
So my question here that, beside the reason I think as above that make way 2 run slower than way 1. Is there any reason else that make way 2 run slower than way 1 such as:
Run in standalone mode is not recommended for using Hbase?
Input size is not enough big to use Hbase?
Thanks.

Is it possible to get averages from multiple rows in Excel?

How is it possible to produce averages of multiple data in excel?
If I have something like the following:
How would I calculate the average for each row of the table without programming each area under "AvgBest" and "AvgMean"? If it makes it any easier I'm happy to compute the best and mean in separate spread sheets as it can all be put back together afterwards.
I'm trying to produce the averages of all of the sample data.
Is this possible in Excel?
To calculate the simple means, you could do the following. For the average best:
=AVERAGE(A2,D2,G2,J2)
in cell M2, and then copied down to the end of your data.
For the average mean:
=AVERAGE(B2,E2,H2,K2)
in cell N2, and then copied down to the end of your data.

Divide a dataset into chunks

I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.
Here's the function I'm currently using to do the chunking:
chunkData <- function(Data,chunkSize){
Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
lapply(unique(Chunks),function(x) Data[Chunks==x,])
}
chunkData(iris,100)
I would like to make this function more efficient, so that it runs faster on large datasets.
You can do this easily using split from base R. For example, split(iris, 1:3), will split the iris dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.
Since the output is still a list of data frames, you can easily use lapply on the output to process the data, and combine them as required.
Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.
Replace the lapply() call with a call to split():
split(Data, Chunks)
You should also take a look at ddply fom the plyr package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.
The general strategy I would take here is to add a new data to the dataset called chunkid. This cuts up the data in chunks of 1000 rows, look at the rep function to create this row. You can then do:
result = ddply(dat, .(chunkid), functionToPerform)
I like plyr for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table, which could be quite a bit faster in some situations.
An additional tip could be to use matrices in stead of data.frames...

Resources