How to sort comma separated keys in Reducer ouput? - sorting

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=BigInteger, F=Binteger, M=BigDecimal and the value is also a Text representing Customer_ID. I know that Hadoop sorts output based on keys but my final result is a bit wierd. I want the output keys to be sorted by R first, then F and then M. But I am getting the following output sort order for unknown reasons:
545,1,7652 100000
545,23,390159.402343750 100001
452,13,132586 100002
452,4,32202 100004
452,1,9310 100007
452,1,4057 100018
452,3,18970 100021
But I want the following output:
545,23,390159.402343750 100001
545,1,7652 100000
452,13,132586 100002
452,4,32202 100004
452,3,18970 100021
452,1,9310 100007
452,1,4057 100018
NOTE: The customer_ID was the key in Map phase and all the RFM values belonging to a particular Customer_ID are brought together at the Reducer for aggregation.

So after a lot of searching I found some useful material the compilation of which I am posting now:
You have to start with your custom data type. Since I had three comma separated values which needed to be sorted descendingly, I had to create a TextQuadlet.java data type in Hadoop. The reason I am creating a quadlet is because the first part of the key will be the natural key and the rest of the three parts will be the R, F, M:
import java.io.*;
import org.apache.hadoop.io.*;
public class TextQuadlet implements WritableComparable<TextQuadlet> {
private String customer_id;
private long R;
private long F;
private double M;
public TextQuadlet() {
}
public TextQuadlet(String customer_id, long R, long F, double M) {
set(customer_id, R, F, M);
}
public void set(String customer_id2, long R2, long F2, double M2) {
this.customer_id = customer_id2;
this.R = R2;
this.F = F2;
this.M=M2;
}
public String getCustomer_id() {
return customer_id;
}
public long getR() {
return R;
}
public long getF() {
return F;
}
public double getM() {
return M;
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.customer_id);
out.writeLong(this.R);
out.writeLong(this.F);
out.writeDouble(this.M);
}
#Override
public void readFields(DataInput in) throws IOException {
this.customer_id = in.readUTF();
this.R = in.readLong();
this.F = in.readLong();
this.M = in.readDouble();
}
// This hashcode function is important as it is used by the custom
// partitioner for this class.
#Override
public int hashCode() {
return (int) (customer_id.hashCode() * 163 + R + F + M);
}
#Override
public boolean equals(Object o) {
if (o instanceof TextQuadlet) {
TextQuadlet tp = (TextQuadlet) o;
return customer_id.equals(tp.customer_id) && R == (tp.R) && F==(tp.F) && M==(tp.M);
}
return false;
}
#Override
public String toString() {
return customer_id + "," + R + "," + F + "," + M;
}
// LHS in the conditional statement is the current key
// RHS in the conditional statement is the previous key
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// Returning 0 or a positive value means that you are keeping the
// order as it is
#Override
public int compareTo(TextQuadlet tp) {
// Here my natural is is customer_id and I don't even take it into
// consideration.
// So as you might have concluded, I am sorting R,F,M descendingly.
if (this.R != tp.R) {
if(this.R < tp.R) {
return 1;
}
else{
return -1;
}
}
if (this.F != tp.F) {
if(this.F < tp.F) {
return 1;
}
else{
return -1;
}
}
if (this.M != tp.M){
if(this.M < tp.M) {
return 1;
}
else{
return -1;
}
}
return 0;
}
public static int compare(TextQuadlet tp1, TextQuadlet tp2) {
int cmp = tp1.compareTo(tp2);
return cmp;
}
public static int compare(Text customer_id1, Text customer_id2) {
int cmp = customer_id1.compareTo(customer_id1);
return cmp;
}
}
Next you'll need a custom partitioner so that all the values which have the same key end up at one reducer:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class FirstPartitioner_RFM extends Partitioner<TextQuadlet, Text> {
#Override
public int getPartition(TextQuadlet key, Text value, int numPartitions) {
return (int) key.hashCode() % numPartitions;
}
}
Thirdly, you'll need a custom group comparater so that all the values are grouped together by their natural key which is customer_id and not the composite key which is customer_id,R,F,M:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class GroupComparator_RFM_N extends WritableComparator {
protected GroupComparator_RFM_N() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// Here we tell hadoop to group the keys by their natural key.
return ip1.getCustomer_id().compareTo(ip2.getCustomer_id());
}
}
Fourthly, you'll need a key comparater which will again sort the keys based on R,F,M descendingly and implement the same sort technique which is used in TextQuadlet.java. Since I got lost while coding, I slightly changed the way I compared data types in this function but the underlying logic is the same as in TextQuadlet.java:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class KeyComparator_RFM extends WritableComparator {
protected KeyComparator_RFM() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// LHS in the conditional statement is the current key-value pair
// RHS in the conditional statement is the previous key-value pair
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// If you are comparing strings, the string which ends up as the argument
// for the `compareTo` method turns out to be the previous key and the
// string which is invoking the `compareTo` method turns out to be the
// current key.
if(ip1.getR() == ip2.getR()){
if(ip1.getF() == ip2.getF()){
if(ip1.getM() == ip2.getM()){
return 0;
}
else{
if(ip1.getM() < ip2.getM())
return 1;
else
return -1;
}
}
else{
if(ip1.getF() < ip2.getF())
return 1;
else
return -1;
}
}
else{
if(ip1.getR() < ip2.getR())
return 1;
else
return -1;
}
}
}
And finally, in your driver class, you'll have to include our custom classes. Here I have used TextQuadlet,Text as k-v pair. But you can choose any other class depending on your needs.:
job.setPartitionerClass(FirstPartitioner_RFM.class);
job.setSortComparatorClass(KeyComparator_RFM.class);
job.setGroupingComparatorClass(GroupComparator_RFM_N.class);
job.setMapOutputKeyClass(TextQuadlet.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(TextQuadlet.class);
job.setOutputValueClass(Text.class);
Do correct me if I am technically going wrong somewhere in the code or in the explanation as I have based this answer purely on my personal understanding from what I read on the internet and it works for me perfectly.

Related

How to read numeric value from excel file using spring batch excel

I am reading values from .xlsx using spring batch excel and POI. I see numeric values are printing with different format than the original value in .xlsx
Please suggest me , How to print the values as its in .xlsx file. Below are the details.
In my Excel values are as follows
The values are printing as below
My code is as below
public ItemReader<DataObject> fileItemReader(InputStream inputStream){
PoiItemReader<DataObject> reader = new PoiItemReader<DataObject>();
reader.setLinesToSkip(1);
reader.setResource(new InputStreamResource(DataObject));
reader.setRowMapper(excelRowMapper());
reader.open(new ExecutionContext());
return reader;
}
private RowMapper<DataObject> excelRowMapper() {
return new MyRowMapper();
}
public class MyRowMapper implements RowMapper<DataObject> {
#Override
public DataRecord mapRow(RowSet rowSet) throws Exception {
DataObject dataObj = new DataObject();
dataObj.setFieldOne(rowSet.getColumnValue(0));
dataObj.setFieldTwo(rowSet.getColumnValue(1));
dataObj.setFieldThree(rowSet.getColumnValue(2));
dataObj.setFieldFour(rowSet.getColumnValue(3));
return dataObj;
}
}
I had this same problem, and its root is the class org.springframework.batch.item.excel.poi.PoiSheet inside PoiItemReader.
The problem happens in the method public String[] getRow(final int rowNumber) where it gets a org.apache.poi.ss.usermodel.Row object and convert it to an array of Strings after detecting the type of each column in the row. In this method, we have the code:
switch (cellType) {
case NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
Date date = cell.getDateCellValue();
cells.add(String.valueOf(date.getTime()));
} else {
cells.add(String.valueOf(cell.getNumericCellValue()));
}
break;
case BOOLEAN:
cells.add(String.valueOf(cell.getBooleanCellValue()));
break;
case STRING:
case BLANK:
cells.add(cell.getStringCellValue());
break;
case ERROR:
cells.add(FormulaError.forInt(cell.getErrorCellValue()).getString());
break;
default:
throw new IllegalArgumentException("Cannot handle cells of type '" + cell.getCellTypeEnum() + "'");
}
In which the treatment for a cell identified as NUMERIC is cells.add(String.valueOf(cell.getNumericCellValue())). In this line, the cell value is converted to double (cell.getNumericCellValue()) and this double is converted to String (String.valueOf()). The problem happens in the String.valueOf() method, that will generate scientific notation if the number is too big (>=10000000) or too small(<0.001) and will put the ".0" on integer values.
As an alternative to the line cells.add(String.valueOf(cell.getNumericCellValue())), you could use
DataFormatter formatter = new DataFormatter();
cells.add(formatter.formatCellValue(cell));
that will return to you the exact values of the cells as a String. However, this also mean that your decimal numbers will be locale dependent (you'll receive the string "2.5" from a document saved on an Excel configured for UK or India and the string "2,5" from France or Brazil).
To avoid this dependency, we can use the solution presented on https://stackoverflow.com/a/25307973/9184574:
DecimalFormat df = new DecimalFormat("0", DecimalFormatSymbols.getInstance(Locale.ENGLISH));
df.setMaximumFractionDigits(340);
cells.add(df.format(cell.getNumericCellValue()));
That will convert the cell to double and than format it to the English pattern without scientific notation or adding ".0" to integers.
My implementation of the CustomPoiSheet (small adaptation on original PoiSheet) was:
class CustomPoiSheet implements Sheet {
protected final org.apache.poi.ss.usermodel.Sheet delegate;
private final int numberOfRows;
private final String name;
private FormulaEvaluator evaluator;
/**
* Constructor which takes the delegate sheet.
*
* #param delegate the apache POI sheet
*/
CustomPoiSheet(final org.apache.poi.ss.usermodel.Sheet delegate) {
super();
this.delegate = delegate;
this.numberOfRows = this.delegate.getLastRowNum() + 1;
this.name=this.delegate.getSheetName();
}
/**
* {#inheritDoc}
*/
#Override
public int getNumberOfRows() {
return this.numberOfRows;
}
/**
* {#inheritDoc}
*/
#Override
public String getName() {
return this.name;
}
/**
* {#inheritDoc}
*/
#Override
public String[] getRow(final int rowNumber) {
final Row row = this.delegate.getRow(rowNumber);
if (row == null) {
return null;
}
final List<String> cells = new LinkedList<>();
final int numberOfColumns = row.getLastCellNum();
for (int i = 0; i < numberOfColumns; i++) {
Cell cell = row.getCell(i);
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
FormulaEvaluator evaluator = getFormulaEvaluator();
if (evaluator == null) {
cells.add(cell.getCellFormula());
} else {
cellType = evaluator.evaluateFormulaCell(cell);
}
}
switch (cellType) {
case NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
Date date = cell.getDateCellValue();
cells.add(String.valueOf(date.getTime()));
} else {
// Returns numeric value the closer possible to it's value and shown string, only formatting to english format
// It will result in an integer string (without decimal places) if the value is a integer, and will result
// on the double string without trailing zeros. It also suppress scientific notation
// Regards to https://stackoverflow.com/a/25307973/9184574
DecimalFormat df = new DecimalFormat("0", DecimalFormatSymbols.getInstance(Locale.ENGLISH));
df.setMaximumFractionDigits(340);
cells.add(df.format(cell.getNumericCellValue()));
//DataFormatter formatter = new DataFormatter();
//cells.add(formatter.formatCellValue(cell));
//cells.add(String.valueOf(cell.getNumericCellValue()));
}
break;
case BOOLEAN:
cells.add(String.valueOf(cell.getBooleanCellValue()));
break;
case STRING:
case BLANK:
cells.add(cell.getStringCellValue());
break;
case ERROR:
cells.add(FormulaError.forInt(cell.getErrorCellValue()).getString());
break;
default:
throw new IllegalArgumentException("Cannot handle cells of type '" + cell.getCellTypeEnum() + "'");
}
}
return cells.toArray(new String[0]);
}
private FormulaEvaluator getFormulaEvaluator() {
if (this.evaluator == null) {
this.evaluator = delegate.getWorkbook().getCreationHelper().createFormulaEvaluator();
}
return this.evaluator;
}
}
And my implementation of CustomPoiItemReader (small adaptation on original PoiItemReader) calling CustomPoiSheet:
public class CustomPoiItemReader<T> extends AbstractExcelItemReader<T> {
private Workbook workbook;
#Override
protected Sheet getSheet(final int sheet) {
return new CustomPoiSheet(this.workbook.getSheetAt(sheet));
}
public CustomPoiItemReader(){
super();
}
#Override
protected int getNumberOfSheets() {
return this.workbook.getNumberOfSheets();
}
#Override
protected void doClose() throws Exception {
super.doClose();
if (this.workbook != null) {
this.workbook.close();
}
this.workbook=null;
}
/**
* Open the underlying file using the {#code WorkbookFactory}. We keep track of the used {#code InputStream} so that
* it can be closed cleanly on the end of reading the file. This to be able to release the resources used by
* Apache POI.
*
* #param inputStream the {#code InputStream} pointing to the Excel file.
* #throws Exception is thrown for any errors.
*/
#Override
protected void openExcelFile(final InputStream inputStream) throws Exception {
this.workbook = WorkbookFactory.create(inputStream);
this.workbook.setMissingCellPolicy(Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
}
}
just change your code like this while reading data from excel.
dataObj.setField(Float.valueOf(rowSet.getColumnValue(idx)).intValue();
this is only working for Column A,B,C

Java8 Method chaining for Single object without Stream/Optional?

I felt it easiest to capture my question with the below example. I would like to apply multiple transformations on an object (in this case, they all return same class, Number, but not necessarily). With an Optional (Method 3) or Stream (Method 4), I can use the .map elegantly and legibly. However, when used with a single object, I have to either just make an Optional just to use the .map chaining (with a .get() in the end), or use Stream.of() with a findFirst in the end, which seems like unnecessary work.
[My Preference]: I prefer methods 3 & 4, as they seem better for readability than the pre-java8 options - methods 1 & 2.
[Question]: Is there a better/neater/more preferable/more elegant way of achieving the same than all the methods used here? If not, what method would you use?
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Tester {
static class Number {
private final int value;
private Number(final int value) {
this.value = value;
}
public int getValue() {
return value;
}
#Override
public String toString() {
return String.valueOf(value);
}
}
private static Number add(final Number number, final int val) {
return new Number(number.getValue() + val);
}
private static Number multiply(final Number number, final int val) {
return new Number(number.getValue() * val);
}
private static Number subtract(final Number number, final int val) {
return new Number(number.getValue() - val);
}
public static void main(final String[] args) {
final Number input = new Number(1);
System.out.println("output1 = " + method1(input)); // 100
System.out.println("output2 = " + method2(input)); // 100
System.out.println("output3 = " + method3(input)); // 100
System.out.println("output4 = " + method4(input)); // 100
processAList();
}
// Processing an object - Method 1
private static Number method1(final Number input) {
return subtract(multiply(add(input, 10), 10), 10);
}
// Processing an object - Method 2
private static Number method2(final Number input) {
final Number added = add(input, 10);
final Number multiplied = multiply(added, 10);
return subtract(multiplied, 10);
}
// Processing an object - Method 3 (Contrived use of Optional)
private static Number method3(final Number input) {
return Optional.of(input)
.map(number -> add(number, 10))
.map(number -> multiply(number, 10))
.map(number -> subtract(number, 10)).get();
}
// Processing an object - Method 4 (Contrived use of Stream)
private static Number method4(final Number input) {
return Stream.of(input)
.map(number -> add(number, 10))
.map(number -> multiply(number, 10))
.map(number -> subtract(number, 10))
.findAny().get();
}
// Processing a list (naturally uses the Stream advantage)
private static void processAList() {
final List<Number> inputs = new ArrayList<>();
inputs.add(new Number(1));
inputs.add(new Number(2));
final List<Number> outputs = inputs.stream()
.map(number -> add(number, 10))
.map(number -> multiply(number, 10))
.map(number -> subtract(number, 10))
.collect(Collectors.toList());
System.out.println("outputs = " + outputs); // [100, 110]
}
}
The solution is to build your methods into your Number class. For example:
static class Number {
// instance variable, constructor and getter unchanged
public Number add(final int val) {
return new Number(getValue() + val);
}
// mulitply() and subtract() in the same way
// toString() unchanged
}
Now your code becomes very simple and readable:
private static Number method5(final Number input) {
return input
.add(10)
.multiply(10)
.subtract(10);
}
You may even write the return statement on one line if you prefer:
return input.add(10).multiply(10).subtract(10);
Edit: If you can't change the Number class, my personal taste would be for method2. Using Optional or Stream would be misuse or at least misplaced and could easily confuse your reader. If you insist, write your own Mandatory class, like Optional except it always holds a value, which makes it simpler. For my part I wouldn't bother.

Generating all the elements of a power set

Power set is just set of all subsets for given set.
It includes all subsets (with empty set).
It's well-known that there are 2^N elements in this set, where N is count of elements in original set.
To build power set, following thing can be used:
Create a loop, which iterates all integers from 0 till 2^N-1
Proceed to binary representation for each integer
Each binary representation is a set of N bits (for lesser numbers, add leading zeros).
Each bit corresponds, if the certain set member is included in current subset.
import java.util.NoSuchElementException;
import java.util.BitSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;
public class PowerSet<E> implements Iterator<Set<E>>, Iterable<Set<E>> {
private final E[] ary;
private final int subsets;
private int i;
public PowerSet(Set<E> set) {
ary = (E[])set.toArray();
subsets = (int)Math.pow(2, ary.length) - 1;
}
public Iterator<Set<E>> iterator() {
return this;
}
#Override
public void remove() {
throw new UnsupportedOperationException("Cannot remove()!");
}
#Override
public boolean hasNext() {
return i++ < subsets;
}
#Override
public Set<E> next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
Set<E> subset = new TreeSet<E>();
BitSet bitSet = BitSet.valueOf(new long[] { i });
if (bitSet.cardinality() == 0) {
return subset;
}
for (int e = bitSet.nextSetBit(0); e != -1; e = bitSet.nextSetBit(e + 1)) {
subset.add(ary[e]);
}
return subset;
}
// Unit Test
public static void main(String[] args) {
Set<Integer> numbers = new TreeSet<Integer>();
for (int i = 1; i < 4; i++) {
numbers.add(i);
}
PowerSet<Integer> pSet = new PowerSet<Integer>(numbers);
for (Set<Integer> subset : pSet) {
System.out.println(subset);
}
}
}
The output I am getting is:
[2]
[3]
[2, 3]
java.util.NoSuchElementException
at PowerSet.next(PowerSet.java:47)
at PowerSet.next(PowerSet.java:20)
at PowerSet.main(PowerSet.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at edu.rice.cs.drjava.model.compiler.JavacCompiler.runCommand(JavacCompiler.java:272)
So, the problems are:
I am got getting all the elements(debugging shows me next is called only for even i's).
The exception should not have been thrown.
The problem is in your hasNext. You have i++ < subsets there. What happens is that since hasNext is called once from next() and once more during the iteration for (Set<Integer> subset : pSet) you increment i by 2 each time. You can see this since
for (Set<Integer> subset : pSet) {
}
is actually equivalent to:
Iterator<PowerSet> it = pSet.iterator();
while (it.hasNext()) {
Set<Integer> subset = it.next();
}
Also note that
if (bitSet.cardinality() == 0) {
return subset;
}
is redundant. Try instead:
#Override
public boolean hasNext() {
return i <= subsets;
}
#Override
public Set<E> next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
Set<E> subset = new TreeSet<E>();
BitSet bitSet = BitSet.valueOf(new long[] { i });
for (int e = bitSet.nextSetBit(0); e != -1; e = bitSet.nextSetBit(e + 1)) {
subset.add(ary[e]);
}
i++;
return subset;
}

Lossless hierarchical run length encoding

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.
For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F
I am not concerned that an option is picked between two identical possible nestings E.g.
ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.
However I do want the choice to be MOST greedy. For instance:
ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.
In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.
Welcome ideas.
I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.
You can paste the following code into LINQPad and run it, and it should produce the following output:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)
As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.
Basically, the code runs like this:
For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
If it can't find a repeating sequence, it spits out the single symbol at that location
It then skips what it encoded, and continues from #1
Anyway, here's the code:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}
Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.
ABCBCABCBCDEEF
s->ttDuuF
t->Avv
v->BC
u->E
ABABCDABABCD
s->ABtt
t->ABCD
Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.
The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

Hadoop seems to modify my key object during an iteration over values of a given reduce call

Hadoop Version: 0.20.2 (On Amazon EMR)
Problem: I have a custom key that i write during map phase which i added below. During the reduce call, I do some simple aggregation on values for a given key. Issue I am facing is that during the iteration of values in reduce call, my key got changed and i got values of that new key.
My key type:
class MyKey implements WritableComparable<MyKey>, Serializable {
private MyEnum type; //MyEnum is a simple enumeration.
private TreeMap<String, String> subKeys;
MyKey() {} //for hadoop
public MyKey(MyEnum t, Map<String, String> sK) { type = t; subKeys = new TreeMap(sk); }
public void readFields(DataInput in) throws IOException {
Text typeT = new Text();
typeT.readFields(in);
this.type = MyEnum.valueOf(typeT.toString());
subKeys.clear();
int i = WritableUtils.readVInt(in);
while ( 0 != i-- ) {
Text keyText = new Text();
keyText.readFields(in);
Text valueText = new Text();
valueText.readFields(in);
subKeys.put(keyText.toString(), valueText.toString());
}
}
public void write(DataOutput out) throws IOException {
new Text(type.name()).write(out);
WritableUtils.writeVInt(out, subKeys.size());
for (Entry<String, String> each: subKeys.entrySet()) {
new Text(each.getKey()).write(out);
new Text(each.getValue()).write(out);
}
}
public int compareTo(MyKey o) {
if (o == null) {
return 1;
}
int typeComparison = this.type.compareTo(o.type);
if (typeComparison == 0) {
if (this.subKeys.equals(o.subKeys)) {
return 0;
}
int x = this.subKeys.hashCode() - o.subKeys.hashCode();
return (x != 0 ? x : -1);
}
return typeComparison;
}
}
Is there anything wrong with this implementation of key? Following is the code where I am facing the mixup of keys in reduce call:
reduce(MyKey k, Iterable<MyValue> values, Context context) {
Iterator<MyValue> iterator = values.iterator();
int sum = 0;
while(iterator.hasNext()) {
MyValue value = iterator.next();
//when i come here in the 2nd iteration, if i print k, it is different from what it was in iteration 1.
sum += value.getResult();
}
//write sum to context
}
Any help in this would be greatly appreciated.
This is expected behavior (with the new API at least).
When the next method for the underlying iterator of the values Iterable is called, the next key/value pair is read from the sorted mapper / combiner output, and checked that the key is still part of the same group as the previous key.
Because hadoop re-uses the objects passed to the reduce method (just calling the readFields method of the same object) the underlying contents of the Key parameter 'k' will change with each iteration of the values Iterable.

Resources