Compare performance between using Hadoop with and without Hbase

Compare performance between using Hadoop with and without Hbase - hadoop

I'm building an application that finding all similar images from user's input image using Hadoop.
I'm implementing it in two ways:
Way 1:
My collection images is converted to SequenceFile to be used as input for map function.
Then in map function, use OpenCV libary for compare similarity between those images with user's input image which include these steps:
- Extract keypoints
- Compute descriptors
- Calculate distance between each pairs to find the similarity
In Reduce function, I just copy images that is similar to output folder.
Way 2:
Similar with way 1 except:
I use Hbase to store image features first (keypoints, descriptors). To do that, because OpenCV doesnt support the way to convert keypoints, descriptors data type to bytes[] directly (in order to insert data to Hbase, we have to convert to bytesl[]) so I have to use a trick that I refer in this: OpenCV Mat object serialization in java
Then in map function, I will just query image features from Hbase to compare with user'input image feature.
In normal thought, we can see that saving all image features to a database then just query them to compare with user's input image will be faster than in each map function we have to start extract these feature to do comparison.
But in fact, when I do and test two ways in my virtual machine (standalone mode), I see that the way 2 run slower than way 1 and running time is not acceptable. In my opinion, I think that the way 2 run slowly because in the map function, it takes much time to convert from bytes[] value in Hbase to keypoints, descriptors datatype in openCV to do comparison. That why it degrades the performance of whole map function.
My collection images are just include 240 images in jpg format.
So my question here that, beside the reason I think as above that make way 2 run slower than way 1. Is there any reason else that make way 2 run slower than way 1 such as:
Run in standalone mode is not recommended for using Hbase?
Input size is not enough big to use Hbase?
Thanks.

Related

How to find average of two lines in NiFi?

I need to find average of two values in separate lines.
My CSV file looks like this
Name,ID,Marks
Mahi,1,90
Mahi,1,100
Andy,2,85
Andy,2,95
Now I need to store that average of 2 marks in database.
"Average" column should add two marks and divide with 2 and store that result in SQL query
Table:
Name,ID,Average
Mahi,2,95
Andy,2,90
Is it possible to find the average of two values in separate rows using NiFi?

Given a lot of assumptions, this is doable. You are definitely better off pre-processing the data in NiFi and exporting it to a tool better suited to this, like Apache Spark using the NiFi Spark Receiver library (instructions here), because this solution will not scale well.
However, you could certainly use a combination of SplitText processors to get the proper data into individual flowfiles (i.e. all Mahi rows in one, all Andy rows in another). Once you have a record that looks like:
Andy,1,85
Andy,1,95
you can use ExtractText with regular expressions to get 85 and 95 into attributes marks.1 and marks.2 (a good example of where scaling will break down -- doing this with 2 rows is easy; doing this with 100k is ridiculous). You can then use UpdateAttribute with the Expression Language to calculate the average of those two attributes (convert toNumber() first) and populate a third attribute marks.average (either through chaining plus() and divide() functions or with the math advanced operation (uses Java Reflection)). Once you have the desired result in an attribute, use ReplaceText to update the flowfile content, and MergeContent to merge the individual flowfiles back into a single instance.
If this were me, I'd first evaluate how static my incoming data format was, and if it was guaranteed to stay the same, probably just write a Groovy script that parsed the data and calculated the averages in place. I think that would even scale better (within reason) because of the flexibility of having written domain-specific code. If you need to offload this to cluster operations, Spark is the way to go.

Faster alternative for saving images in MATLAB? [duplicate]

This question already has an answer here:
save high resolution figures with parfor in matlab
(1 answer)
Closed 8 years ago.
I've got a ~1600 line program that reads in images (either tiff or raw), performs a whole bunch of different mathematical and statistical analyses, and then outputs graphs and data tables at the end.
Almost two-thirds of my processing time is due to looping 16 times over the following code:
h = figure('Visible','off','units','normalized','outerposition',[0 0 1 1]);
set(h,'PaperPositionMode','auto');
imagesc(picdata); colormap(hot);
imgtmp = hardcopy(h,'-dzbuffer','-r0');
imwrite(imgtmp,hot,'picname.png');
Naturally, 'picname.png' and picdata are changing each time around.
Is there a better way to invisibly plot and save these pictures? The processing time mostly takes place inside imwrite, with hardcopy coming second. The whole purpose of the pictures is just to get a general idea of what the data looks like; I'm not going to need to load them back into Matlab to do future processing of any sort.

Try to place the figure off-screen (e.g., Position=[-1000,-1000,500,500]). This will make it "Visible" and yet no actual rendering will need to take place, which should make things faster.
Also, try to reuse the same figure for all images - no need to recreate the figure and image axes and colormap every time.
Finally, try using my ScreenCapture utility rather than hardcopy+imwrite. It uses a different method for taking a "screenshot" which may possibly be faster.

mapreduce way to calculate user similarity matrix

I have a list of many users (over 10 million) each of which is represented by a userid followed by 10 floating-point numbers indicating their preference. I would like to efficiently calculate the user similarity matrix using cosine similarity based on mapreduce. However, since the values are floating-point numbers, it is hard to determine a key in the mapreduce framework. Any suggestions?

I think the easiest solution would be the Mahout library. There are a couple of map-reduce similarity matrix jobs in Mahout that might work for your use case.
The first is Mahout's ItemSimilarityJob that is part of its recommender system libraries. The specific info for that job can be found here. You would simply need to provide the input data in the required format and choose your VectorSimilarityMeasure (which for your case would be SIMILARITY_COSINE) along with any additional optimizations. Since you are looking to calculate user-user similarity based on a preference vector of ten floating point value, what you could do is assign a simple 1-to-10 numeric hash for the indices of the vector and generate a simple .csv file of vectorIndex, userID, decimalValue as input for the Mahout item-similarity job (the userID being a numeric Int or Long value). The resulting output should be a tab separated text file of userID,userID,similarity.
A second solution might be Mahout's RowSimilarityJob included in its math library. I've never used it myself, but some info can be found here and in this previous stackoverflow thread. Rather than a .csv as input, you would need to translate your input data as a DistributedRowMatrix, the userIDs being the rows of the matrix. The output, I believe, will also be a DistributedRowMatrix sequence file containing the user-user similarity data you are seeking.
I suppose which solution is better depends on what input/output format you prefer. All the best.

Best way of storing differences between 2 images

I am doing motion detection.
I compare 2 images at a time.
The differences are compared at the pixel level.
I want to store the differences in a file.
I have tried saving the hex value into 2 dimensional string and using the binary formatter serializing it out to a file. But the size is 495kb and the original image size is only 32kb.
What is the most efficient way of storing differences?
I am using C#
Thanks

There are many ways. Maybe took a look how bdiff is doing it. In general, compare the binary value, not a hex representation. Maybe also the binary formatter serialization adds some overhead.

Divide a dataset into chunks

I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.
Here's the function I'm currently using to do the chunking:
chunkData <- function(Data,chunkSize){
Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
lapply(unique(Chunks),function(x) Data[Chunks==x,])
}
chunkData(iris,100)
I would like to make this function more efficient, so that it runs faster on large datasets.

You can do this easily using split from base R. For example, split(iris, 1:3), will split the iris dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.
Since the output is still a list of data frames, you can easily use lapply on the output to process the data, and combine them as required.
Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.

Replace the lapply() call with a call to split():
split(Data, Chunks)

You should also take a look at ddply fom the plyr package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.
The general strategy I would take here is to add a new data to the dataset called chunkid. This cuts up the data in chunks of 1000 rows, look at the rep function to create this row. You can then do:
result = ddply(dat, .(chunkid), functionToPerform)
I like plyr for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table, which could be quite a bit faster in some situations.
An additional tip could be to use matrices in stead of data.frames...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio