Graphlab: How to avoid manually duplicating functions that has only a different string variable? - sentiment-analysis

I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?

I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.

I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

Related

Office Script - Split strings in vector & use dynamic cell address - Run an excel script from Power Automate

I'm completely new on Office Script (with only old experience on Python and C++) and I'm trying to run a rather "simple" Office Script on excel from power automate. The goal is to fill specific cells (always the same, their position shouldn't change) on the excel file.
The power Automate part is working, the problem is managing to use the information sent to Excel, in excel.
The script take three variables from Power automate (all three strings) and should fill specific cells based on these. CMQ_Name: string to use as is.
Version: string to use as is.
PT_Name: String with names separated by a ";". The goal is to split it in as much string as needed (I'm stocking them in an Array) and write each name in cells on top of each other, always starting on the same position (cell A2).
I'm able to use CMQ_Names & Version and put them in the cell they're supposed to go in, I've already make it works.
However, I cannot make the last part (in bold above, part 2 in the code below) work.
Learning on this has been pretty frustrating as some elements seems to sometime works and sometimes not. Newbie me is probably having syntax issues more than anyting...  
function main(workbook: ExcelScript.Workbook,
CMQ_Name: string,
Version: string,
PT_Name: string )
{
// create reference for each sheet in the excel document
let NAMES = workbook.getWorksheet("CMQ_NAMES");
let TERMS = workbook.getWorksheet("CMQ_TERMS");
//------Part 1: Update entries in sheet CMQ_NAMES
NAMES.getRange("A2").setValues(CMQ_Name);
NAMES.getRange("D2").setValues(Version);
//Update entries in sheet CMQ_TERMS
TERMS.getRange("A2").setValues(CMQ_Name);
//-------Part 2: work with PT_Name
//Split PT_Name
let ARRAY1: string[] = PT_Name.split(";");
let CELL: string;
let B: string = "B"
for (var i = 0; i < ARRAY1.length; i++) {
CELL = B.concat(i.toString());
NAMES.getRange(CELL).setValues(ARRAY1[i]);
}
}
  I have several problems:
Some parts (basically anything with red are detected as a problem and I have no idea why. Some research indicated it could be false positive, other not. It's not the biggest problem either as it seems the code sometimes works despite these warnings.
Argument of type 'string' is not assignable to parameter of type '(string | number | boolean)[ ][ ]'.
I couldn't find a way to use a variable as address to select a specific cell to write in, which is preventing the for loop at the end from working.  I've been bashing my head against this for a week now without solving it.
Could you kindly take a look?
Thank you!!
I tried several workarounds and other syntaxes without much success. Writing the first two strings in cells work, working with the third string doesn't.
EDIT: Thanks to the below comment, I managed to make it work:
function main(
workbook: ExcelScript.Workbook,
CMQ_Name: string,
Version: string,
PT_Name: string )
{
// create reference for each table
let NAMES = workbook.getWorksheet("CMQ_NAMES");
let TERMS = workbook.getWorksheet("CMQ_TERMS");
//------Part 0: clear previous info
TERMS.getRange("B2:B200").clear()
//------Part 1: Update entries in sheet CMQ_NAMES
NAMES.getRange("A2").setValue(CMQ_Name);
NAMES.getRange("D2").setValue(Version);
//Update entries in sheet CMQ_TERMS
TERMS.getRange("A2").setValue(CMQ_Name);
//-------Part 2: work with PT_Name
//Split PT_Name
let ARRAY1: string[] = PT_Name.split(";");
let CELL: string;
let B: string = "B"
for (var i = 2; i < ARRAY1.length + 2; i++) {
CELL = B.concat(i.toString());
//console.log(CELL); //debugging
TERMS.getRange(CELL).setValue(ARRAY1[i - 2]);
}
}
You're using setValues() (plural) which accepts a 2 dimensional array of values that contains the data for the given rows and columns.
You need to look at using setValue() instead as that takes a single argument of type any.
https://learn.microsoft.com/en-us/javascript/api/office-scripts/excelscript/excelscript.range?view=office-scripts#excelscript-excelscript-range-setvalue-member(1)
As for using a variable to retrieve a single cell (or set of cells for that matter), you really just need to use the getRange() method to do that, this is a basic example ...
function main(workbook: ExcelScript.Workbook) {
let cellAddress: string = "A4";
let range: ExcelScript.Range = workbook.getWorksheet("Data").getRange(cellAddress);
console.log(range.getAddress());
}
If you want to specify multiple ranges, just change cellAddress to something like this ... A4:C10
That method also accepts named ranges.

Dynamic Nested Ruby Loops

So, What I'm trying to do is make calls to a Reporting API to filter by all possible breakdowns (breakdown the reports by site, avertiser, ad type, campaign, etc...). But, one issue is that the breakdowns can be unique to each login.
Example:
user1: alice123's reporting breakdowns are ["site","advertiser","ad_type","campaign","line_items"]
user2: bob789's reporting breakdowns are ["campaign","position","line_items"]
When I first built the code for this reporting API, I only had one login to test with, so I hard coded the loops for the dimensions (["site","advertiser","ad_type","campaign","line_items"]). So what I did was pinged the API for a report by sites. Then for each site, pinged for advertisers, and each advertiser, I pinged for the next dimension and so on..., leaving me with a nested loop of ~6 layers.
basically what I'm doing:
sites = mechanize.get "#{base_ur}/report?dim=sites"
sites = Yajl::Parser.parse(sites.body) # json parser
sites.each do |site|
advertisers = mechanize.get "#{base_ur}/report?site=#{site.fetch("id")}&dim=advertiser"
advertisers = Yajl::Parser.parse(advertisers.body) # json parser
advertisers.each do |advertiser|
ad_types = mechanize.get "#{base_ur}/report?site=#{site.fetch("id")}&advertiser=#{advertiser.fetch("id")}&dim=ad_type"
ad_types = Yajl::Parser.parse(ad_types.body) # json parser
ad_types.each do |ad_type|
...and so on...
end
end
end
GET <api_url>/?dim=<dimension to breakdown>&site=<filter by site id>&advertiser=<filter by advertiser id>...etc...
At the end of the nested loop, I'm left with a report that's broken down as much granularity as possible.
This works now since I only thought that there was one path of breaking down, but apparently each account could have different dimensions breakdowns.
So what I'm asking is if given an array of breakdowns, how can I set up a nested loop to traverse down dynamically do the granularity singularity?
Thanks.
I'm not sure what your JSON/GET returns exactly but for a problem like this you would need recursion.
Something like this perhaps? It's not very elegant and can definitely be optimised further but should hopefully give you an idea.
some_hash = {:id=>"site-id", :body=>{:id=>"advertiser-id", :body=>{:id=>"ad_type-id", :body=>{:id=>"something-id"}}}}
#breakdowns = ["site", "advertiser", "ad_type", "something"]
def recursive(some_hash, str = nil, i = 0)
if #breakdowns[i+1].nil?
str += "#{#breakdowns[i]}=#{some_hash[:id]}"
else
str += "#{#breakdowns[i]}=#{some_hash[:id]}&dim=#{#breakdowns[i + 1]}"
end
p str
some_hash[:body].is_a?(Hash) ? recursive(some_hash[:body], str.gsub(/dim.*/, ''), i + 1) : return
end
recursive(some_hash, 'base-url/report?')
=> "base-url/report?site=site-id&dim=advertiser"
=> "base-url/report?site=site-id&advertiser=advertiser-id&dim=ad_type"
=> "base-url/report?site=site-id&advertiser=advertiser-id&ad_type=ad_type-id&dim=something"
=> "base-url/report?site=site-id&advertiser=advertiser-id&ad_type=ad_type-id&something=something-id"
If you are just looking to map your data, you can recursively map to a hash as another user pointed out. If you are actually looking to do something with this data while within the loop and want to dynamically recreate the loop structure you listed in your question (though I would advise coming up with a different solution), you can use metaprogramming as follows:
require 'active_support/inflector'
# Assume we are given an input of breakdowns
# I put 'testarr' in place of the operations you perform on each local variable
# for brevity and so you can see that the code works.
# You will have to modify to suit your needs
result = []
testarr = [1,2,3]
b = binding
breakdowns.each do |breakdown|
snippet = <<-END
eval("#{breakdown.pluralize} = testarr", b)
eval("#{breakdown.pluralize}", b).each do |#{breakdown}|
END
result << snippet
end
result << "end\n"*breakdowns.length
eval(result.join)
Note: This method is probably frowned upon, and as I've said I'm sure there are other methods of accomplishing what you are trying to do.

Difficulty Accessing Members of Tuple in Apache Pig

I have a variable titled F.
Describe F returns:
F: {group: bytearray,indexkey: {(indexkey: chararray)}}
Dump F returns:
(321,{(CHOW),(DREW)})
(5011,{(CHOW),(DREW)})
(5825,{(TANNER),(SPITZENBERGER)})
(16631,{(CHOW),(DREW)})
(34299,{(CHOW),(DREW)})
(35044,{(TANNER),(SPITZENBERGER)})
(65623,{(CHOW),(DREW)})
(74597,{(SPITZENBERGER),(TANNER)})
(83499,{(SPITZENBERGER),(TANNER)})
(90257,{(SPITZENBERGER),(TANNER)})
What I need is to produce an output that looks like this (only 1st row as an example):
(321,DREW,{(CHOW)})
I've tried using deference to pull out the first element by using this:
G = FOREACH F generate indexkey.$0;
But, this still returns the whole tuple.
Can anyone suggest a method for doing this? I was under the impression that the deference operator should allow me to do this.
Thanks in advance!
Daniel
You can't index into bags like that. The reason for that is bags don't have any notion of ordering. Selecting the first item in a bag should be treated as picking a random one.
Either way, if you want only one item instead of all of them you can used a nested FOREACH to pull a LIMIT of 1:
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, lim;
}
(disclaimer: I can't test this code right now, if it doesn't work let me know. Hopefully you can get the gist)
You can take this a bit further and FLATTEN it to remove the bag of one item entirely, but be careful in that if the bag is empty i think you throw away the entire record in this case.
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, FLATTEN(lim);
}

LotusScript: how to iterate numbered fields with for loop

I'm using LotusScript to clean and export values from a form to a csv file. In the form there are multiple date fields with names like enddate_1, enddate_2, enddate_3, etc.
These date fields are Data Type: Text when empty, but Data Type: Time/Date when filled.
To get the values as string in the csv without errors, I did the following (working):
If Isdate(doc.enddate_1) Then
enddate_1 = Format(doc.enddate_1,"dd-mm-yyyy")
Else
enddate_1 = doc.enddate_1(0)
End If
But to do such a code block for each date field didnt feel right.
Tried the following, but that isnt working.
For i% = 1 To 9
If Isdate(doc.enddate_i%) Then
enddate_i% = Format(doc.enddate_i%,"dd-mm-yyyy")
Else
enddate_i% = doc.enddate_i%(0)
End If
Next
Any suggestions how to iterate numbered fields with a for loop or otherwise?
To iterate numbered fields with a for loop or otherwise?
valueArray = notesDocument.GetItemValue( itemName$ )
however do you know that there is a possibility to export documents in CSV format using Notes Menu?
File\Exort
Also there is a formula:
#Command([FileExport]; "Comma Separated Value"; "c:\document.csv")
Combined solution of Dmytro, clarification of Richard Schwartz with my block of code to a working solution. Tried it as an edit on solution of Dmytro, but was rejected.
My problem was not only to iterate the numbered fields, but also store the values in an iterative way to easily retrieve them later. This I found out today trying to implement the solution of Dmytro combined with the clarification of Richard Schwartz. Used a List to solve it completely.
The working solution for me now is:
Dim enddate$ List
For i% = 1 To 9
itemName$ = "enddate_" + CStr(i%)
If Isdate(doc.GetItemValue(itemName$)) Then
enddate$(i%) = Format(doc.GetItemValue(itemName$),"dd-mm-yyyy")
Else
enddate$(i%) = doc.GetItemValue(itemName$)(0)
End If
Next

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Resources