How to rename Pipe fields in cascading? - hadoop

In two separate occasions, I've had to rename all the fields in a Pipe to join (using Merge or CoGroup). What I have done recently is:
//These two pipes contain similar values but different Field Names
Pipe papa = new Retain(papa, fieldsFrom);
Pipe pepe = new Retain(pepe, fieldsTo);
//Where fieldsFrom.size() == fieldsTo.size() and the fields positions match
for (int i =0; i < fieldsFrom.size(); i++){
pepe = new Rename(pepe, fieldsFrom.select(new Fields(i)),
fieldsTo.select(new Fields(i)));
}
//this allows me to do this
Pipe retVal = new Merge(papa, pepe);
Obviously this is pretty fragile since I need to ensure field positions in FieldsFrom and FieldsTo remain constant and that they are the same size etc.
Is there a better - less fragile way to merge without going through all the ceremony above?

You can eliminate some ceremony by utilizing Rename's ability to handle aligned from/to fields like this:
pepe = new Rename(pepe, fieldsFrom, fieldsTo);
But this only eliminates the for loop; yes, you must ensure fieldsFrom and fieldsTo are the same size and aligned to correctly express the rename.
cascading.jruby addresses this by wrapping renaming in a function that accepts a mapping rather than aligned from/to fields.
It is also the case that Merge requires incoming pipes to declare the same fields, but CoGroup only requires that you provide declaredFields to ensure there are no name collisions on the output (all fields propagate through, even grouping keys from all inputs).

Related

How to modify this kind of file by Ruby

I have a file like:
Fruit.Store={
#order:123, order:345, order:456
#order:789
"customer-id:12345,item:store/apple" = 10;
"customer-id:23456,item:store/banana" = 10;
"customer-id:23456,item:store/watermelon" = 10;
#order:987
"customer-id:67890,item:store/pear" = 10;
}
Except the comments, each line has the same format: customer-id and item:store/ are fixed, and customer-id is a 5-digit number. There are about 1000 unique lines in the file.
When a new order is placed based on same customer-id and fruit type with different quantity, I want the order id be added in the comment line above and update the quantity, like if a new order 001 is placed with information "customer-id:23456,item:store/watermelon" = 5; than we should have a new file:
Fruit.Store={
#order:123, order:345, order:456
#order:789, order:000
"customer-id:12345,item:store/apple" = 10;
"customer-id:23456,item:store/banana" = 10;
"customer-id:23456,item:store/watermelon" = 5;
#order:987
"customer-id:67890,item:store/pear" = 10;
}
Is it possible to do so in an efficient way? Because file has to be read and written line by line, how could we detect the matched information and go back to previous line to do modification? Thank you.
In short: no, it is not possible to do so in an efficient way. Your best bet is to open separate files for reading and writing, but even then you'll effectively be rewriting the entire file over and over.
Ultimately, you should be using some sort of referential database, like SQL. Those databases were practically invented for this exact use-case.
I really don't like to tell people that the only solution is to do something entirely different from what they're doing, but in this case I can't stress enough how poorly text files scale for managing data.

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Exporting data into excel using iterative loop

I am doing an iterative calculation on maple and I want to store the resulting data (which comes in a column matrix) from each iteration into a specific column of an Excel file. For example, my data is
mydat||1:= <<11,12,13,14>>:
mydat||2:= <<21,22,23,24>>:
mydat||3:= <<31,32,33,34>>:
and so on.
I am trying to export each of them into an excel file and I want each data to be stored in consecutive columns of the same excel file. For example, mydat||1 goes to column A, mydat||2 goes to column B and so on. I tried something like following.
with(ExcelTools):
for k from 1 to 3 do
Export(mydat||k, "data.xlsx", "Sheet1", "A:C"): #The problem is selecting the range.
end do:
How do I select the range appropriately here? Is there any other method to export the data and store in the way that I explained above?
There are couple of ways to do this. The easiest is certainly to put all of your data into one data structure and then export that. For example:
mydat1:= <<11,12,13,14>>:
mydat2:= <<21,22,23,24>>:
mydat3:= <<31,32,33,34>>:
mydata := Matrix( < mydat1 | mydat2 | mydat3 > );
This stores your data in a Matrix where mydat1 is the first column, mydat2 is the second column, etc. With the data in this form, either ExcelTools:-Export or the more generic Export command will work:
ExcelTools:-Export( data, "data.xlsx" );
Export( "data.xlsx", data );
Now since you mention that you are doing an iterative calculation, you may want to write the results out column by column. Here's another method that doesn't involve the creation of another data structure to house the results. This does assume that the data in mydat"i" has been created before the loop.
for i to 3 do
ExcelTools:-Export( cat(`mydat`,i), "data.xlsx", 1, ["A1","B1","C1"][i] );
end do;
If you want to write the data out to a file as you are building it, then just do the Export call after the creation of each of the columns, i.e.
ExcelTools:-Export( mydat1, "data.xlsx", 1, "A1" );
Note that I removed the "||" characters. These are used in Maple for concatenation and caused some issues with the second method.

Process an input file having multiple name value pairs in a line

I am writing kettle transformation.
My input file looks like following
sessionId=40936a7c-8af9|txId=40936a7d-8af9-11e|field3=val3|field4=val4|field5=myapp|field6=03/12/13 15:13:34|
Now, how do i process this file? I am completely at loss.
First step is CSV file input with | as delimiter
My analysis will be based on "Value" part of name value pair.
Has anyone processes such files before?
Since you have already splitted the records into fields of 'key=value' you could use an expression transform to cut the string into two by locating the position of the = character and create two out ports where one holds the key and the other the value.
From there it depends what you want to do with the information, if you want to store them as key/value route them trough a union, or use a router transform to send them to different targets.
Her is an example of an expression to split the pairs:
You could use the Modified Javascript Value Step, add this step after this grouping with pipes.
Now do some parsing javascript like this:
var mainArr = new Array();
var sessionIdSplit = sessionId.toString().split("|");
for(y = 0; y < sessionIdSplit.length; y++){
mainArr[y] = sessionIdSplit[y].toString();
//here you can add another loop to parse again and split the key=value
}
Alert("mainArr: "+ mainArr);

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Resources