Clustering using Representatives (CURE) - algorithm

I need a numerical example which demonstrates the working of clustering using CURE algorithm.
https://www.cs.ucsb.edu/~veronika/MAE/summary_CURE_01guha.pdf

The pyclustering library has a number of clustering algorithims with examples, and example code on their Github. Here is a link the CURE example.
Googling Cure algorithim example also came up with a fair bit.
Hopefully that helps!

Using pyclustering library you can extract information about representatives points and means using corresponding methods (link to CURE pyclustering generated documentation):
# create instance of the algorithm
cure_instance = cure(<algorithm parameters>);
# start processing
cure_instance.process();
# get allocated clusteres
clusters = cure_instance.get_clusters();
# get representative points
representative = cure_instance.get_representors();
Also you can modify source code of the CURE algorithm to display changes after each step, for example, print them to console or even visualize. Here is an example how to modify code to display changes on each step clustering (after line 219) where star means representative point, small points - points itself and big points - means:
# New cluster and updated clusters should relocated in queue
self.__insert_cluster(merged_cluster);
for item in cluster_relocation_requests:
self.__relocate_cluster(item);
#
# ADD FOLLOWING PEACE OF CODE TO DISPLAY CHANGES ON EACH STEP
#
temp_clusters = [ cure_cluster_unit.indexes for cure_cluster_unit in self.__queue ];
temp_representors = [ cure_cluster_unit.rep for cure_cluster_unit in self.__queue ];
temp_means = [ cure_cluster_unit.mean for cure_cluster_unit in self.__queue ];
visualizer = cluster_visualizer();
visualizer.append_clusters(temp_clusters, self.__pointer_data);
for cluster_index in range(len(temp_clusters)):
visualizer.append_cluster_attribute(0, cluster_index, temp_representors[cluster_index], '*', 7);
visualizer.append_cluster_attribute(0, cluster_index, [ temp_means[cluster_index] ], 'o');
visualizer.show();
You will see sequence of images, something like that:
Thus, you can display any information that you need.
Also I would like to add that you can use C++ implementation of the algorithm for visualization (that is also part of pyclustering): https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/cure.cpp

Related

How to include an array of weights to adjust importance of observed data in sm.tsa.UnobservedComponents?

I have used the following 5 lines to achieve a kalman filter with your work for a smoothed pricing model, and it worked great.
mod = sm.tsa.UnobservedComponents(obs, 'local level')
lm = sm.OLS(obs, xlm, missing='drop').fit()
obs_noise = abs(lm.resid).mean()
params = [obs_noise, obs_noise / obs_noise_level]
mod_filter, mod_smooth = mod.filter(params), mod.smooth(params)
However currently I would like to adjust the filtering smoothness at certain time, for example, when unemployment rate or interest rate made a big surge, I would like to make the output (Kalman filtered/smoothed) value closer to the observed value, while in most other time I will keep the what it is from the model. So, I have created an array, while a few items greater than 1, and the others will be exactly 1.
e.g.: ir_coeff = np.array([1,1,1,1,1.345,1.23,1.78,1,1,1])
What could be the best approach to achieve this? Thank you a lot in advance.
I have tried to include it in the output file with a dot product operation, however it is not reasonable to do this.

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

Parameters for dlib::find_min_bobyqa

I'm working on the C++ version of Matt Zucker's Page dewarping. So far everything works fine, but I have a problem with optimization. In line 748 of Github repo Matt uses optimize function from Scipy. My C++ equivalent is find_min_bobyqa from dlib.net. The code is:
auto f = [&](const column_vector& ppts) { return objective( dstpoints, ppts, keypoint_index); };
dlib::find_min_bobyqa(f,
params,
2 * params.nr() + 1, // npt - number of interpolation points: x.size() + 2 <= npt && npt <= (x.size()+1)*(x.size()+2)/2
dlib::uniform_matrix<double>(params.nr(), 1, -2), // lower bound constraint
dlib::uniform_matrix<double>(params.nr(), 1, 2), // upper bound constraint
1, // initial trust region radius
1e-5, // stopping trust region radius
4000 // max number of objective function evaluations
);
In my concrete example params is a dlib::column_vector with double values and length = 189. Every element of params is less than 2.0 and greater than -2.0. Function objective() returns double value and "alone" it works properly because I get the same value as in the Python version. But after running fin_min_bobyqa function I usually get the message:
Terminate called after throwing an instance of 'dlib:bobyqa_failure', return from BOBYQA because the objective function has been called max_f_evals times.
I set max_f_evals to quite big value to see if it optimizes at all, but it doesn't. I did some tweaking with parameters but without good results. How should I set the parameters of find_min_bobyqa to get the right solution?
I am very interested in this issue as well. Zucker's work, with very minor tweaks, is ideal for straightening sheet music images, and I was looking for ways to implement it in a mobile platform when I came across your question.
My research so far suggests that BOBYQA is not the equivalent of Powell's method in scipy. BOBYQA is constrained, and the one in scipy is not.
See these links for more information, and a possible way to compile the right supporting library - I would try UOBYQA or NEWUOA.
https://github.com/jacobwilliams/PowellOpt
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#rdd2e1855725e-3
(See the Notes section)
EDIT: see C version here:
https://github.com/emmt/Algorithms/tree/master/newuoa
I wanted to post this as a comment, but I don't have enough points for that.
I am very interested in your progress. If you're willing, please keep me posted.
I finally solved this problem. I used PRAXIS library, because it doesn't need derivative information and is fast.
I modified the code a little to my needs and now it is faster around few seconds than original version written in Python.

Updating Weights from Caffe and DIGITS

I've built a model using DIGITS by Nvidia, but when I try to run it using caffe, I don't know where the Weights are. Any idea how I'd find this. I have the architecture because that is provided right on the output model screen.
The weights are not accessible from any of the output models given on the Digits UI, however they are accessible!
I use NVIDIAs DGX, which can take python code. To pull weights on that platform (where I route the models to save I use this bit of code:
net = caffe.Net('../models/bvlc_reference_caffenet/deploy.prototxt',
'../models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel',
caffe.TEST)
params = ['fc6', 'fc7', 'fc8']
fc_params = {pr: (net.params[pr][0].data, net.params[pr][1].data) for pr in params}
for fc in params:
print '{} weights are {} dimensional and biases are {} dimensional'.format(fc, fc_params[fc][0].shape, fc_params[fc][1].shape)
Good Luck!

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Resources