SPSS - Split file according to 3 string questions - filter

I have data from a 2x2x3 experiment. For this dataset I have 3 manipulation questions. I created three embedded data fields in the survey for these questions with the labels "Correct" and "Wrong".
Now, I would like to split my file in four versions to see if answering the 3 manipulation questions has an influence on my outcome.
To capture the correct/ wrong manipulation questions (ResponseReason, Attribution, Measure), I tried to create a filter variable - without success.
My code for the filter variable:
Do
if (ResponseAttribution = "Correct" and ResponseMeasure = "Correct" and ResponseReason = "Correct").
FilterVar = 3.
ELSE if ((ResponseAttribution = "Correct" and ResponseMeasure = "Correct") or (ResponseAttribution = "Correct" and ResponseReason = "Correct") or (ResponseMeasure = "Correct" and ResponseReason = "Correct")).
FilterVar = 2.
Else if ResponseAttribution = "Correct" or ResponseMeasure = "Correct" or ResponseReason = "Correct".
FilterVar = 1.
else.
FilterVar = 0.
end if.
EXECUTE.

Try this:
compute FilterVar = sum(ResponseAttribution="Correct",
ResponseMeasure="Correct", ResponseReason="Correct").
Alternatively you can create new variables and use them to calculate the filter as follows (this way will make it easier if you later want to make alterations to the filter based on other combinations):
recode ResponseAttribution ResponseMeasure ResponseReason (convert) ("Correct"=1)
into Att Msr Rsn.
compute FilterVar=sum(Att, Msr, Rsn).

Related

How to rename many images after processing with different parameters

Hello dear programmers,
I have a sequence of images and I would like to perform dilation on each of them with different dilation parameters. Then, I would like to save the processed images with new name including both the old name and the corresponding dilation parameter. My codes are as follows.
Input_folder =
Output_folder =
D = dir([Input_folder '*.jpg']);
Inputs = {D.name}';
Outputs = Inputs; % preallocate
%print(length(Inputs));
for k = 1:length(Inputs)
X = imread([Input_folder Inputs{k}]);
dotLocations = find(Inputs{k} == '.');
name_img = Inputs{k}(1:dotLocations(1)-1);
image1=im2bw(X);
vec = [3;6;10];
vec_l = length(vec);
for i = 1:vec_l
%se = strel('disk',2);
fprintf(num2str(vec(i)));
se = strel('diamond',vec(i)); %imdilate
im = imdilate(image1,se);
image2 = im - image1;
Outputs{k} = regexprep(Outputs{k}, name_img, strcat(name_img,'_', num2str(vec(i))));
imwrite(image2, [Output_folder Outputs{k}])
end
end
As it can be seen, I would like to apply dilation with parameters 3,6 and 10. Let us assume that an image has as name "image1", after processing it, I would like to have "image1_3", "image1_6" and "image1_10". However, I am getting as results "image1_3", "image1_6_3" and "image1_10_6_3". Please, how can I modify my codes to fix this problem?
This is because you rewrite each item of the Outputs variable three times, each time using the previous value to create a new value. Instead, you should use the values stored in Inputs new names. Another mistake in your code is that the size of Inputs and Outputs are equal, while for every file in the input folder, three files must be stored in the output folder.
I also suggest using fileparts function, instead of string processing, to get different parts of a file path.
vec = [3;6;10];
vec_l = length(vec);
Outputs = cell(size(Inputs, 1)*vec_l, 1); % preallocate
for k = 1:length(Inputs)
[~,name,ext] = fileparts([Input_folder Inputs{k}]);
% load image
for i = 1:vec_l
% modify image
newName = sprintf('%s_%d%s', name, vec(i), ext);
Outputs{(k-1)*vec_l+i} = newName;
imwrite(image2, [Output_folder newName])
end
end

Graphlab: How to avoid manually duplicating functions that has only a different string variable?

I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?
I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.
I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

MATLAB parfor slicing a 3D array

I'm trying to speed up my code using parfor. The purpose of the code is to slide a 3D square window on a 3D image and for each block of mxmxm apply a function.
I wrote this code:
function [ o_image ] = SlidingWindow( i_image, i_padSize, i_fun, i_options )
%SLIDINGWINDOW Summary of this function goes here
% Detailed explanation goes here
o_image = zeros(size(i_image,1),size(i_image,2),size(i_image,3));
i_image = padarray(i_image,i_padSize,'symmetric');
i_padSize = num2cell(i_padSize);
[m,n,p] = deal(i_padSize{:});
[row,col,depth] = size(i_image);
windowShape = i_options.windowShape;
mask = i_options.mask;
parfor (i = m+1:row-m,i_options.cores)
temp = i_image(i-m:i+m,:,:);
for j = n+1:col-n
for h = p+1:depth-p
ii = i-m;
jj = j-n;
hh = h-p;
temp = temp(:,j-n:j+n, h-p:h+p);
o_image(ii,jj,hh) = parfeval(i_fun, temp, windowShape, mask);
end
end
end
end
I get one warning and one error that I don't understand how to solve.
The warning says:
the entire array or structure 'i_image' is a broadcast variable.
The error says:
the PARFOR loop can not run due to the way variable 'o_image' is used.
I don't understand how to fix these two things. Any help is greatly appreciated!
As far as I understand, parfeval takes care of running your function on the available number of workers, which is why it doesn't need to be surrounded by parfor. Assuming you already have an active parpool, changing the external parfor into for eliminates both problems.
Unfortunately, I can't support my answer with a benchmark or suggest a more fitting solution because your inputs are unknown.
It seems to me that the code can be optimized in other ways, mainly by vectorization. I would suggest you looked into the following resources:
This question, for additional info on parfeval.
Examples on how to use bsxfun and permute and benchmarks thereof: ex1, ex2, ex3.
P.S.: The 2nd part of (i = m+1:row-m,i_options.cores) seems out of place...

Employ early bail-out in MATLAB

There is a example for Employ early bail-out in this book (http://www.amazon.com/Accelerating-MATLAB-Performance-speed-programs/dp/1482211297) (#YairAltman). for speed improvement we can convert this code:
data = [];
newData = [];
outerIdx = 1;
while outerIdx <= 20
outerIdx = outerIdx + 1;
for innerIdx = -100 : 100
if innerIdx == 0
continue % skips to next innerIdx (=1)
elseif outerIdx > 15
break % skips to next outerIdx
else
data(end+1) = outerIdx/innerIdx;
newData(end+1) = process(data);
end
end % for innerIdx
end % while outerIdx
to this code:
function bailableProcessing()
for outerIdx = 1 : 5
middleIdx = 10
while middleIdx <= 20
middleIdx = middleIdx + 1;
for innerIdx = -100 : 100
data = outerIdx/innerIdx + middleIdx;
if data == SOME_VALUE
return
else
process(data);
end
end % for innerIdx
end % while middleIdx
end % for outerIdx
end % bailableProcessing()
How we did this conversion? Why we have different middleIdx range in new code? Where is checking for innerIdx and outerIdx in new code? what is this new data = outerIdx/innerIdx + middleIdx calculation?
we have only this information for second code :
We could place the code segment that should be bailed-out within a
dedicated function and return from the function when the bail-out
condition occurs.
I am sorry that I did not clarify within the text that the second code segment is not a direct replacement of the first. If you reread the early bail-out section (3.1.3) perhaps you can see that it has two main parts:
The first part of the section (which includes the top code segment) illustrates the basic mechanism of using break/continue in order to bail-out from a complex processing loop, in order to save processing time in computing values that are not needed.
In contrast, the second part of the section deals with cases when we wish to break out of an ancestor loop that is not the direct parent loop. I mention in the text that there are three alternatives that we can use in this case, and the second code segment that you mentioned is one of them (the other alternatives are to use dedicated flags with break/continue and to use try/catch blocks). The three code segments that I provided in this second part of the section should all be equivalent to each other, but they are NOT equivalent to the code-segment at the top of the section.
Perhaps I should have clarified this in the text, or maybe I should have used the same example throughout. I will think about this for the second edition of the book (if and when it ever appears).
I have used a variant of these code segments in other sections of the book to illustrate various other aspects of performance speedups (for example, 3.1.4 & 3.1.6) - in all these cases the code segments are NOT equivalent to each other. They are merely used to illustrate the corresponding text.
I hope you like my book in general and think that it is useful. I would be grateful if you would place a positive feedback about it on Amazon (direct link).
p.s. - #SamRoberts was correct to surmise that mention of my name would act as a "bat-signal", attracting my attention :-)
it's all far more simple than you think!
How we did this conversion?
Irrationally. Those two codes are completely different.
Why we have different middleIdx range in new code?
Randomness. The point of the author is something different.
Where is checking for innerIdx and outerIdx in new code?
dont need that, as it's not intended to be the same code.
what is this new data = outerIdx/innerIdx + middleIdx calculation?
a random calculation as well as data(end+1) = outerIdx/innerIdx; in the original code.
i suppose the author wants to illustrate something far more profoundly: that if you wrap your code that does (possibly many) loops (fors/whiles, doesnt matter) inside a function and you issue a return statement if you somehow detect that you're done, it will result in an effectively "bailable" computation, e.g. the method that does the work returns earlier than it would normally do. that is illustrated here by the condition that checks on data == SOME_VALUE; you can have your favourite bailout condition there instead :-)
moreover, the keywords [continue/break] inside the first example are meant to illustrate that you can [skip the rest of/leave] the inner-most loop from whereever you call them. in principal, you can implement a bailout using these by e.g.
bailing = false;
for outer = 1:1000
for inner = 1:1000
if <somebailingcondition>
bailing = true;
break;
else
<do stuff>
end
end
if bailing
break;
end
end
but that would be very clumsy as that "cascade" of breaks will need to be as long as you have nested loops and messes up the code.
i hope that could clarify your issues.

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Resources