Lossless hierarchical run length encoding - algorithm

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.
For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F
I am not concerned that an option is picked between two identical possible nestings E.g.
ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.
However I do want the choice to be MOST greedy. For instance:
ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.
In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.
Welcome ideas.

I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.
You can paste the following code into LINQPad and run it, and it should produce the following output:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)
As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.
Basically, the code runs like this:
For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
If it can't find a repeating sequence, it spits out the single symbol at that location
It then skips what it encoded, and continues from #1
Anyway, here's the code:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}

Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.
ABCBCABCBCDEEF
s->ttDuuF
t->Avv
v->BC
u->E
ABABCDABABCD
s->ABtt
t->ABCD
Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.
The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

Related

Dart - Best sort Algorithm for pushing a list of objects to a list of First Class Lists

I have a list of objects that are retrieved from a DB. The object looks like this:
class MonthlyFinancePlan {
final int id;
final DateTime date;
final double incomeAfterTax;
final double totalToPayOut;
final double totalRemainingForMonth;
final Map<String, dynamic> items;
MonthlyFinancePlan({ this.id, this.date, this.incomeAfterTax, this.totalToPayOut, this.totalRemainingForMonth, this.items });
MonthlyFinancePlan.fromEntity(MonthlyFinancePlanEntity monthlyFinancePlanEntity):
this.id = monthlyFinancePlanEntity.id,
this.date = DateTime.parse(monthlyFinancePlanEntity.date),
this.incomeAfterTax = monthlyFinancePlanEntity.incomeAfterTax.toDouble(),
this.totalToPayOut = monthlyFinancePlanEntity.totalToPayOut.toDouble(),
this.totalRemainingForMonth = monthlyFinancePlanEntity.moneyRemainingForMonth.toDouble(),
this.items = monthlyFinancePlanEntity.items != null ? json.decode(monthlyFinancePlanEntity.items) : Map();
}
I need to sort these by date.year and then pass them into a first class List, I'd like to create a List of these First class lists so that all the MonthlyFinancePlan objects that are from the year 2020 are sorted and contained within the first class list, same for 2021, etc.
The first class list looks like this:
class YearlyFinancePlan {
List<MonthlyFinancePlan> _monthlyFinancePlanList;
int _year;
double _totalIncomeForYear;
double _totalOutGoingsForYear;
List<MonthlyFinancePlan> get items {
return this._monthlyFinancePlanList;
}
int get year {
return this._year;
}
double get totalIncomeForYear {
return this._totalIncomeForYear;
}
double get totalOutgoingsForYear {
return this._totalOutGoingsForYear;
}
YearlyFinancePlan(this._monthlyFinancePlanList) {
this._year = this._monthlyFinancePlanList.first.date.year;
this._totalIncomeForYear = this._setTotalIncomeFromList(this._monthlyFinancePlanList);
this._totalOutGoingsForYear = this._setTotalOutGoingsForYear(this._monthlyFinancePlanList);
}
double _setTotalIncomeFromList(List<MonthlyFinancePlan> monthlyFinancePlanList) {
double totalIncome;
monthlyFinancePlanList.forEach((plan) => totalIncome += plan.incomeAfterTax);
return totalIncome;
}
double _setTotalOutGoingsForYear(List<MonthlyFinancePlan> monthlyFinancePlanList) {
double totalOutgoings;
monthlyFinancePlanList.forEach((plan) => totalOutgoings += plan.totalToPayOut);
return totalOutgoings;
}
}
My question is, what sort algorithm would be best suited for what I need? I don't have any code to show as I don't know what sort algorithm to use. I'm not looking for anyone to write my code, but more to guide me through it.
Any help would be greatly appreciated
I've created a Mapper that checks if the MonthlyPlanner.date.year exists as a key in a standard Dart Map and adds it if it doesn't exist. Once the check is complete, it also calls the addMonthlyPlan method to add the entry to the MonthlyPlan to the correct YearlyPlan like so:
class FinancePlanMapper {
static Map<int, YearlyFinancePlan> toMap(List<MonthlyFinancePlan> planList) {
Map<int, YearlyFinancePlan> planMap = Map();
planList.forEach((monthlyPlan) {
planMap.putIfAbsent(monthlyPlan.date.year, () => YearlyFinancePlan(List()));
planMap[monthlyPlan.date.year].addMonthlyPlan(monthlyPlan);
});
return planMap;
}
}
I'm not too sure whether it's the most efficient way of sorting but I plan to refactor it as much as possible. I've also updated the YearlyFinancePlan object so that it doesn't initialise any fields on construction, which would cause the object to throw an error when being initialised with an empty list:
class YearlyFinancePlan {
List<MonthlyFinancePlan> _monthlyFinancePlanList;
List<MonthlyFinancePlan> get items {
return this._monthlyFinancePlanList;
}
int get year {
return this.items.first.date.year;
}
double get totalIncomeForYear {
return this._setTotalIncomeFromList(this._monthlyFinancePlanList);
}
double get totalOutgoingsForYear {
return this._setTotalOutGoingsForYear(this._monthlyFinancePlanList);
}
YearlyFinancePlan(this._monthlyFinancePlanList);
void addMonthlyPlan(MonthlyFinancePlan plan) {
this._monthlyFinancePlanList.add(plan);
}
double _setTotalIncomeFromList(List<MonthlyFinancePlan> monthlyFinancePlanList) {
double totalIncome = 0;
monthlyFinancePlanList.forEach((plan) => totalIncome += plan.incomeAfterTax);
return totalIncome;
}
double _setTotalOutGoingsForYear(List<MonthlyFinancePlan> monthlyFinancePlanList) {
double totalOutgoings = 0;
monthlyFinancePlanList.forEach((plan) => totalOutgoings += plan.totalToPayOut);
return totalOutgoings;
}
}

How to get the text position from the pdf page in iText 7

I am trying to find the text position in PDF page?
What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:
var Words = pdftextextractor.Split(new char[] { ' ', '\n' });
What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.
I was able to manipulate it with my previous version for Itext5. I don't know if you are looking for C# but that is what the below code is written in.
using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class TextLocationStrategy : LocationTextExtractionStrategy
{
private List<textChunk> objectResult = new List<textChunk>();
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
TextRenderInfo renderInfo = (TextRenderInfo)data;
string curFont = renderInfo.GetFont().GetFontProgram().ToString();
float curFontSize = renderInfo.GetFontSize();
IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
foreach (TextRenderInfo t in text)
{
string letter = t.GetText();
Vector letterStart = t.GetBaseline().GetStartPoint();
Vector letterEnd = t.GetAscentLine().GetEndPoint();
Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));
if (letter != " " && !letter.Contains(' '))
{
textChunk chunk = new textChunk();
chunk.text = letter;
chunk.rect = letterRect;
chunk.fontFamily = curFont;
chunk.fontSize = curFontSize;
chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;
objectResult.Add(chunk);
}
}
}
}
public class textChunk
{
public string text { get; set; }
public Rectangle rect { get; set; }
public string fontFamily { get; set; }
public int fontSize { get; set; }
public float spaceWidth { get; set; }
}
I also get down to each individual character because it works better for my process. You can manipulate the names, and of course the objects, but I created the textchunk to hold what I wanted, rather than have a bunch of renderInfo objects.
You can implement this by adding a few lines to grab the data from your pdf.
PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));
Once you are this far, you can pull the objectResult from the strat by making it public or creating a method within your class to grab the objectResult and do something with it.
#Joris' answer explains how to implement a completely new extraction strategy / event listener for the task. Alternatively one can try and tweak an existing text extraction strategy to do what you required.
This answer demonstrates how to tweak the existing LocationTextExtractionStrategy to return both the text and its characters' respective y coordinates.
Beware, this is but a proof-of-concept which in particular assumes text to be written horizontally, i.e. using an effective transformation matrix (ctm and text matrix combined) with b and c equal to 0.
Furthermore the character and coordinate retrieval methods of TextPlusY are not at all optimized and might take long to execute.
As the OP did not express a language preference, here a solution for iText7 for Java:
TextPlusY
For the task at hand one needs to be able to retrieve character and y coordinates side by side. To make this easier I use a class representing both text its characters' respective y coordinates. It is derived from CharSequence, a generalization of String, which allows it to be used in many String related functions:
public class TextPlusY implements CharSequence
{
final List<String> texts = new ArrayList<>();
final List<Float> yCoords = new ArrayList<>();
//
// CharSequence implementation
//
#Override
public int length()
{
int length = 0;
for (String text : texts)
{
length += text.length();
}
return length;
}
#Override
public char charAt(int index)
{
for (String text : texts)
{
if (index < text.length())
{
return text.charAt(index);
}
index -= text.length();
}
throw new IndexOutOfBoundsException();
}
#Override
public CharSequence subSequence(int start, int end)
{
TextPlusY result = new TextPlusY();
int length = end - start;
for (int i = 0; i < yCoords.size(); i++)
{
String text = texts.get(i);
if (start < text.length())
{
float yCoord = yCoords.get(i);
if (start > 0)
{
text = text.substring(start);
start = 0;
}
if (length > text.length())
{
result.add(text, yCoord);
}
else
{
result.add(text.substring(0, length), yCoord);
break;
}
}
else
{
start -= text.length();
}
}
return result;
}
//
// Object overrides
//
#Override
public String toString()
{
StringBuilder builder = new StringBuilder();
for (String text : texts)
{
builder.append(text);
}
return builder.toString();
}
//
// y coordinate support
//
public TextPlusY add(String text, float y)
{
if (text != null)
{
texts.add(text);
yCoords.add(y);
}
return this;
}
public float yCoordAt(int index)
{
for (int i = 0; i < yCoords.size(); i++)
{
String text = texts.get(i);
if (index < text.length())
{
return yCoords.get(i);
}
index -= text.length();
}
throw new IndexOutOfBoundsException();
}
}
(TextPlusY.java)
TextPlusYExtractionStrategy
Now we extend the LocationTextExtractionStrategy to extract a TextPlusY instead of a String. All we need for that is to generalize the method getResultantText.
Unfortunately the LocationTextExtractionStrategy has hidden some methods and members (private or package protected) which need to be accessed here; thus, some reflection magic is required. If your framework does not allow this, you'll have to copy the whole strategy and manipulate it accordingly.
public class TextPlusYExtractionStrategy extends LocationTextExtractionStrategy
{
static Field locationalResultField;
static Method sortWithMarksMethod;
static Method startsWithSpaceMethod;
static Method endsWithSpaceMethod;
static Method textChunkSameLineMethod;
static
{
try
{
locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
locationalResultField.setAccessible(true);
sortWithMarksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("sortWithMarks", List.class);
sortWithMarksMethod.setAccessible(true);
startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace", String.class);
startsWithSpaceMethod.setAccessible(true);
endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
endsWithSpaceMethod.setAccessible(true);
textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
textChunkSameLineMethod.setAccessible(true);
}
catch(NoSuchFieldException | NoSuchMethodException | SecurityException e)
{
// Reflection failed
}
}
//
// constructors
//
public TextPlusYExtractionStrategy()
{
super();
}
public TextPlusYExtractionStrategy(ITextChunkLocationStrategy strat)
{
super(strat);
}
#Override
public String getResultantText()
{
return getResultantTextPlusY().toString();
}
public TextPlusY getResultantTextPlusY()
{
try
{
List<TextChunk> textChunks = new ArrayList<>((List<TextChunk>)locationalResultField.get(this));
sortWithMarksMethod.invoke(this, textChunks);
TextPlusY textPlusY = new TextPlusY();
TextChunk lastChunk = null;
for (TextChunk chunk : textChunks)
{
float chunkY = chunk.getLocation().getStartLocation().get(Vector.I2);
if (lastChunk == null)
{
textPlusY.add(chunk.getText(), chunkY);
}
else if ((Boolean)textChunkSameLineMethod.invoke(chunk, lastChunk))
{
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (isChunkAtWordBoundary(chunk, lastChunk) &&
!(Boolean)startsWithSpaceMethod.invoke(this, chunk.getText()) &&
!(Boolean)endsWithSpaceMethod.invoke(this, lastChunk.getText()))
{
textPlusY.add(" ", chunkY);
}
textPlusY.add(chunk.getText(), chunkY);
}
else
{
textPlusY.add("\n", lastChunk.getLocation().getStartLocation().get(Vector.I2));
textPlusY.add(chunk.getText(), chunkY);
}
lastChunk = chunk;
}
return textPlusY;
}
catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e)
{
throw new RuntimeException("Reflection failed", e);
}
}
}
(TextPlusYExtractionStrategy.java)
Usage
Using these two classes you can extract text with coordinates and search therein like this:
try ( PdfReader reader = new PdfReader(YOUR_PDF);
PdfDocument document = new PdfDocument(reader) )
{
TextPlusYExtractionStrategy extractionStrategy = new TextPlusYExtractionStrategy();
PdfPage page = document.getFirstPage();
PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
parser.processPageContent(page);
TextPlusY textPlusY = extractionStrategy.getResultantTextPlusY();
System.out.printf("\nText from test.pdf\n=====\n%s\n=====\n", textPlusY);
System.out.print("\nText with y from test.pdf\n=====\n");
int length = textPlusY.length();
float lastY = Float.MIN_NORMAL;
for (int i = 0; i < length; i++)
{
float y = textPlusY.yCoordAt(i);
if (y != lastY)
{
System.out.printf("\n(%4.1f) ", y);
lastY = y;
}
System.out.print(textPlusY.charAt(i));
}
System.out.print("\n=====\n");
System.out.print("\nMatches of 'est' with y from test.pdf\n=====\n");
Matcher matcher = Pattern.compile("est").matcher(textPlusY);
while (matcher.find())
{
System.out.printf("from character %s to %s at y position (%4.1f)\n", matcher.start(), matcher.end(), textPlusY.yCoordAt(matcher.start()));
}
System.out.print("\n=====\n");
}
(ExtractTextPlusY test method testExtractTextPlusYFromTest)
For my test document
the output of the test code above is
Text from test.pdf
=====
Ein Dokumen t mit einigen
T estdaten
T esttest T est test test
=====
Text with y from test.pdf
=====
(691,8) Ein Dokumen t mit einigen
(666,9) T estdaten
(642,0) T esttest T est test test
=====
Matches of 'est' with y from test.pdf
=====
from character 28 to 31 at y position (666,9)
from character 39 to 42 at y position (642,0)
from character 43 to 46 at y position (642,0)
from character 49 to 52 at y position (642,0)
from character 54 to 57 at y position (642,0)
from character 59 to 62 at y position (642,0)
=====
My locale uses the comma as decimal separator, you might see 666.9 instead of 666,9.
The extra spaces you see can be removed by fine-tuning the base LocationTextExtractionStrategy functionality further. But that is the focus of other questions...
First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.
Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.
Possible implementation:
implement IEventListener
get notified for all events that render text, and store the corresponding TextRenderInfo object
once you're finished with the document, sort these objects based on their position in the page
loop over this list of TextRenderInfo objects, they offer both the text being rendered and the coordinates
how to:
implement ITextExtractionStrategy (or extend an existing
implementation)
use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
your strategy should be set up to keep track of locations for the text it processed
ITextExtractionStrategy has the following method in its interface:
#Override
public void eventOccurred(IEventData data, EventType type) {
// you can first check the type of the event
if (!type.equals(EventType.RENDER_TEXT))
return;
// now it is safe to cast
TextRenderInfo renderInfo = (TextRenderInfo) data;
}
Important to keep in mind is that rendering instructions in a pdf do not need to appear in order.
The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to:
render "Ipsum Do"
render "Lorem "
render "lor Sit Amet"
You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.
Once that's done, it should be easy.
For anyone looking for a simple Rectangle object this worked for me. I made these two classes, and call the static method "GetTextCoordinates" with your page and desired text.
public class PdfTextLocator : LocationTextExtractionStrategy
{
public string TextToSearchFor { get; set; }
public List<TextChunk> ResultCoordinates { get; set; }
/// <summary>
/// Returns a rectangle with a given location of text on a page. Returns null if not found.
/// </summary>
/// <param name="page">Page to Search</param>
/// <param name="s">String to be found</param>
/// <returns></returns>
public static Rectangle GetTextCoordinates(PdfPage page, string s)
{
PdfTextLocator strat = new PdfTextLocator(s);
PdfTextExtractor.GetTextFromPage(page, strat);
foreach (TextChunk c in strat.ResultCoordinates)
{
if (c.Text == s)
return c.ResultCoordinates;
}
return null;
}
public PdfTextLocator(string textToSearchFor)
{
this.TextToSearchFor = textToSearchFor;
ResultCoordinates = new List<TextChunk>();
}
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
TextRenderInfo renderInfo = (TextRenderInfo)data;
IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
for (int i = 0; i < text.Count; i++)
{
if (text[i].GetText() == TextToSearchFor[0].ToString())
{
string word = "";
for (int j = i; j < i + TextToSearchFor.Length && j < text.Count; j++)
{
word = word + text[j].GetText();
}
float startX = text[i].GetBaseline().GetStartPoint().Get(0);
float startY = text[i].GetBaseline().GetStartPoint().Get(1);
ResultCoordinates.Add(new TextChunk(word, new Rectangle(startX, startY, text[i].GetAscentLine().GetEndPoint().Get(0) - startX, text[i].GetAscentLine().GetEndPoint().Get(0) - startY)));
}
}
}
}
public class TextChunk
{
public string Text { get; set; }
public Rectangle ResultCoordinates { get; set; }
public TextChunk(string s, Rectangle r)
{
Text = s;
ResultCoordinates = r;
}
}

How to sort comma separated keys in Reducer ouput?

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=BigInteger, F=Binteger, M=BigDecimal and the value is also a Text representing Customer_ID. I know that Hadoop sorts output based on keys but my final result is a bit wierd. I want the output keys to be sorted by R first, then F and then M. But I am getting the following output sort order for unknown reasons:
545,1,7652 100000
545,23,390159.402343750 100001
452,13,132586 100002
452,4,32202 100004
452,1,9310 100007
452,1,4057 100018
452,3,18970 100021
But I want the following output:
545,23,390159.402343750 100001
545,1,7652 100000
452,13,132586 100002
452,4,32202 100004
452,3,18970 100021
452,1,9310 100007
452,1,4057 100018
NOTE: The customer_ID was the key in Map phase and all the RFM values belonging to a particular Customer_ID are brought together at the Reducer for aggregation.
So after a lot of searching I found some useful material the compilation of which I am posting now:
You have to start with your custom data type. Since I had three comma separated values which needed to be sorted descendingly, I had to create a TextQuadlet.java data type in Hadoop. The reason I am creating a quadlet is because the first part of the key will be the natural key and the rest of the three parts will be the R, F, M:
import java.io.*;
import org.apache.hadoop.io.*;
public class TextQuadlet implements WritableComparable<TextQuadlet> {
private String customer_id;
private long R;
private long F;
private double M;
public TextQuadlet() {
}
public TextQuadlet(String customer_id, long R, long F, double M) {
set(customer_id, R, F, M);
}
public void set(String customer_id2, long R2, long F2, double M2) {
this.customer_id = customer_id2;
this.R = R2;
this.F = F2;
this.M=M2;
}
public String getCustomer_id() {
return customer_id;
}
public long getR() {
return R;
}
public long getF() {
return F;
}
public double getM() {
return M;
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.customer_id);
out.writeLong(this.R);
out.writeLong(this.F);
out.writeDouble(this.M);
}
#Override
public void readFields(DataInput in) throws IOException {
this.customer_id = in.readUTF();
this.R = in.readLong();
this.F = in.readLong();
this.M = in.readDouble();
}
// This hashcode function is important as it is used by the custom
// partitioner for this class.
#Override
public int hashCode() {
return (int) (customer_id.hashCode() * 163 + R + F + M);
}
#Override
public boolean equals(Object o) {
if (o instanceof TextQuadlet) {
TextQuadlet tp = (TextQuadlet) o;
return customer_id.equals(tp.customer_id) && R == (tp.R) && F==(tp.F) && M==(tp.M);
}
return false;
}
#Override
public String toString() {
return customer_id + "," + R + "," + F + "," + M;
}
// LHS in the conditional statement is the current key
// RHS in the conditional statement is the previous key
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// Returning 0 or a positive value means that you are keeping the
// order as it is
#Override
public int compareTo(TextQuadlet tp) {
// Here my natural is is customer_id and I don't even take it into
// consideration.
// So as you might have concluded, I am sorting R,F,M descendingly.
if (this.R != tp.R) {
if(this.R < tp.R) {
return 1;
}
else{
return -1;
}
}
if (this.F != tp.F) {
if(this.F < tp.F) {
return 1;
}
else{
return -1;
}
}
if (this.M != tp.M){
if(this.M < tp.M) {
return 1;
}
else{
return -1;
}
}
return 0;
}
public static int compare(TextQuadlet tp1, TextQuadlet tp2) {
int cmp = tp1.compareTo(tp2);
return cmp;
}
public static int compare(Text customer_id1, Text customer_id2) {
int cmp = customer_id1.compareTo(customer_id1);
return cmp;
}
}
Next you'll need a custom partitioner so that all the values which have the same key end up at one reducer:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class FirstPartitioner_RFM extends Partitioner<TextQuadlet, Text> {
#Override
public int getPartition(TextQuadlet key, Text value, int numPartitions) {
return (int) key.hashCode() % numPartitions;
}
}
Thirdly, you'll need a custom group comparater so that all the values are grouped together by their natural key which is customer_id and not the composite key which is customer_id,R,F,M:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class GroupComparator_RFM_N extends WritableComparator {
protected GroupComparator_RFM_N() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// Here we tell hadoop to group the keys by their natural key.
return ip1.getCustomer_id().compareTo(ip2.getCustomer_id());
}
}
Fourthly, you'll need a key comparater which will again sort the keys based on R,F,M descendingly and implement the same sort technique which is used in TextQuadlet.java. Since I got lost while coding, I slightly changed the way I compared data types in this function but the underlying logic is the same as in TextQuadlet.java:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class KeyComparator_RFM extends WritableComparator {
protected KeyComparator_RFM() {
super(TextQuadlet.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
TextQuadlet ip1 = (TextQuadlet) w1;
TextQuadlet ip2 = (TextQuadlet) w2;
// LHS in the conditional statement is the current key-value pair
// RHS in the conditional statement is the previous key-value pair
// When you return a negative value, it means that you are exchanging
// the positions of current and previous key-value pair
// If you are comparing strings, the string which ends up as the argument
// for the `compareTo` method turns out to be the previous key and the
// string which is invoking the `compareTo` method turns out to be the
// current key.
if(ip1.getR() == ip2.getR()){
if(ip1.getF() == ip2.getF()){
if(ip1.getM() == ip2.getM()){
return 0;
}
else{
if(ip1.getM() < ip2.getM())
return 1;
else
return -1;
}
}
else{
if(ip1.getF() < ip2.getF())
return 1;
else
return -1;
}
}
else{
if(ip1.getR() < ip2.getR())
return 1;
else
return -1;
}
}
}
And finally, in your driver class, you'll have to include our custom classes. Here I have used TextQuadlet,Text as k-v pair. But you can choose any other class depending on your needs.:
job.setPartitionerClass(FirstPartitioner_RFM.class);
job.setSortComparatorClass(KeyComparator_RFM.class);
job.setGroupingComparatorClass(GroupComparator_RFM_N.class);
job.setMapOutputKeyClass(TextQuadlet.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(TextQuadlet.class);
job.setOutputValueClass(Text.class);
Do correct me if I am technically going wrong somewhere in the code or in the explanation as I have based this answer purely on my personal understanding from what I read on the internet and it works for me perfectly.

BASH brace expansion algorithm

I am stuck on this algorithmic question :
design an algorithm that parse an expression like this :
"((a,b,cy)n,m)" should give :
an - bn - cyn - m
The expression can nest, therefore :
"((a,b)o(m,n)p,b)" parses to ;
aomp - aonp - bomp - bonp - b.
I thought of using stacks, but it is too complicated.
thanks.
You can parse it with a Recursive Descent Parser.
Let's say the comma separated strings are components, so for an expression ((a, b, cy)n, m), (a, b, cy)n and m are two components. a, b and cy are also components. So this is a recursive definition.
For a component (a, b, cy)n, let's say (a, b, cy) and n are two component parts of the component. Component parts will later be combined to produce final result (i.e., an - bn - cyn).
Let's say an expression is comma separated components, for example, (a, cy)n, m is an expression. It has two components (a, cy)n and m, and the component (a, cy)n has two component parts (a, cy) and n, and component part (a, cy) is a brace expression containing a nested expression: a, cy, which also has two components a and cy.
With these definitions (you might use other terms), we can write down the grammar for your expression:
expression = component, component, ...
component = component_part component_part ...
component_part = letters | (expression)
One line is one grammar rule. The first line means an expression is a list of comma separated components. The second line means a component can be constructed with one or more component parts. The third line means a component part can be either a continuous sequence of letters or a nested expression inside a pair of braces.
Then you can use a Recursive Descent Parser to solve your problem with the above grammar.
We will define one method/function for each grammar rule. So basically we will have three methods ParseExpression, ParseComponent, ParseComponentPart.
Algorithm
As I stated above, an expression is comma separated components, so in our ParseExpression method, it simply calls ParseComponent, and then check if the next char is comma or not, like this (I'm using C#, I think you can easily convert it to other languages):
private List<string> ParseExpression()
{
var result = new List<string>();
while (!Eof())
{
// Parsing a component will produce a list of strings,
// they are added to the final string list
var items = ParseComponent();
result.AddRange(items);
// If next char is ',' simply skip it and parse next component
if (Peek() == ',')
{
// Skip comma
ReadNextChar();
}
else
{
break;
}
}
return result;
}
You can see that, when we are parsing an expression, we recursively call ParseComponent (it will then recursively call ParseComponentPart). It's a top-down approach, that's why it's called Recursive Descent Parsing.
ParseComponent is similar, like this:
private List<string> ParseComponent()
{
List<string> leftItems = null;
while (!Eof())
{
// Parse a component part will produce a list of strings (rightItems)
// We need to combine already parsed string list (leftItems) in this component
// with the newly parsed 'rightItems'
var rightItems = ParseComponentPart();
if (rightItems == null)
{
// No more parts, return current result (leftItems) to the caller
break;
}
if (leftItems == null)
{
leftItems = rightItems;
}
else
{
leftItems = Combine(leftItems, rightItems);
}
}
return leftItems;
}
The combine method simply combines two string list:
// Combine two lists of strings and return the combined string list
private List<string> Combine(List<string> leftItems, List<string> rightItems)
{
var result = new List<string>();
foreach (var leftItem in leftItems)
{
foreach (var rightItem in rightItems)
{
result.Add(leftItem + rightItem);
}
}
return result;
}
Then is the ParseComponentPart:
private List<string> ParseComponentPart()
{
var nextChar = Peek();
if (nextChar == '(')
{
// Skip '('
ReadNextChar();
// Recursively parse the inner expression
var items = ParseExpression();
// Skip ')'
ReadNextChar();
return items;
}
else if (char.IsLetter(nextChar))
{
var letters = ReadLetters();
return new List<string> { letters };
}
else
{
// Fail to parse a part, it means a component is ended
return null;
}
}
Full Source Code (C#)
The other parts are mostly helper methods, full C# source code is listed below:
using System;
using System.Collections.Generic;
using System.Text;
namespace Examples
{
public class BashBraceParser
{
private string _expression;
private int _nextCharIndex;
/// <summary>
/// Parse the specified BASH brace expression and return the result string list.
/// </summary>
public IList<string> Parse(string expression)
{
_expression = expression;
_nextCharIndex = 0;
return ParseExpression();
}
private List<string> ParseExpression()
{
// ** This part is already posted above **
}
private List<string> ParseComponent()
{
// ** This part is already posted above **
}
private List<string> ParseComponentPart()
{
// ** This part is already posted above **
}
// Combine two lists of strings and return the combined string list
private List<string> Combine(List<string> leftItems, List<string> rightItems)
{
// ** This part is already posted above **
}
// Peek next char without moving the cursor
private char Peek()
{
if (Eof())
{
return '\0';
}
return _expression[_nextCharIndex];
}
// Read next char and move the cursor to next char
private char ReadNextChar()
{
return _expression[_nextCharIndex++];
}
private void UnreadChar()
{
_nextCharIndex--;
}
// Check if the whole expression string is scanned.
private bool Eof()
{
return _nextCharIndex == _expression.Length;
}
// Read a continuous sequence of letters.
private string ReadLetters()
{
if (!char.IsLetter(Peek()))
{
return null;
}
var str = new StringBuilder();
while (!Eof())
{
var ch = ReadNextChar();
if (char.IsLetter(ch))
{
str.Append(ch);
}
else
{
UnreadChar();
break;
}
}
return str.ToString();
}
}
}
Use The Code
var parser = new BashBraceParser();
var result = parser.Parse("((a,b)o(m,n)p,b)");
var output = String.Join(" - ", result);
// Result: aomp - aonp - bomp - bonp - b
Console.WriteLine(output);
public class BASHBraceExpansion {
public static ArrayList<StringBuilder> parse_bash(String expression, WrapperInt p) {
ArrayList<StringBuilder> elements = new ArrayList<StringBuilder>();
ArrayList<StringBuilder> result = new ArrayList<StringBuilder>();
elements.add(new StringBuilder(""));
while(p.index < expression.length())
{
if (expression.charAt(p.index) == '(')
{
p.advance();
ArrayList<StringBuilder> temp = parse_bash(expression, p);
ArrayList<StringBuilder> newElements = new ArrayList<StringBuilder>();
for(StringBuilder e : elements)
{
for(StringBuilder t : temp)
{
StringBuilder s = new StringBuilder(e);
newElements.add(s.append(t));
}
}
System.out.println("elements :");
elements = newElements;
}
else if (expression.charAt(p.index) == ',')
{
result.addAll(elements);
elements.clear();
elements.add(new StringBuilder(""));
p.advance();
}
else if (expression.charAt(p.index) == ')')
{
p.advance();
result.addAll(elements);
return result;
}
else
{
for(StringBuilder sb : elements)
{
sb.append(expression.charAt(p.index));
}
p.advance();
}
}
return elements;
}
public static void print(ArrayList<StringBuilder> list)
{
for(StringBuilder s : list)
{
System.out.print(s + " * ");
}
System.out.println();
}
public static void main(String[] args) {
WrapperInt p = new WrapperInt();
ArrayList<StringBuilder> list = parse_bash("((a,b)o(m,n)p,b)", p);
//ArrayList<StringBuilder> list = parse_bash("(a,b)", p);
WrapperInt q = new WrapperInt();
ArrayList<StringBuilder> list1 = parse_bash("((a,b,cy)n,m)", q);
ArrayList<StringBuilder> list2 = parse_bash("((a,b)dr(f,g)(k,m),L(p,q))", new WrapperInt());
System.out.println("*****RESULT : ******");
print(list);
print(list1);
print(list2);
}
}
public class WrapperInt {
public WrapperInt() {
index = 0;
}
public int advance()
{
index ++;
return index;
}
public int index;
}
// aomp - aonp - bomp - bonp - b.

Partition/split/section IEnumerable<T> into IEnumerable<IEnumerable<T>> based on a function using LINQ?

I'd like to split a sequence in C# to a sequence of sequences using LINQ. I've done some investigation, and the closest SO article I've found that is slightly related is this.
However, this question only asks how to partition the original sequence based upon a constant value. I would like to partition my sequence based on an operation.
Specifically, I have a list of objects which contain a decimal property.
public class ExampleClass
{
public decimal TheValue { get; set; }
}
Let's say I have a sequence of ExampleClass, and the corresponding sequence of values of TheValue is:
{0,1,2,3,1,1,4,6,7,0,1,0,2,3,5,7,6,5,4,3,2,1}
I'd like to partition the original sequence into an IEnumerable<IEnumerable<ExampleClass>> with values of TheValue resembling:
{{0,1,2,3}, {1,1,4,6,7}, {0,1}, {0,2,3,5,7}, {6,5,4,3,2,1}}
I'm just lost on how this would be implemented. SO, can you help?
I have a seriously ugly solution right now, but have a "feeling" that LINQ will increase the elegance of my code.
Okay, I think we can do this...
public static IEnumerable<IEnumerable<TElement>>
PartitionMontonically<TElement, TKey>
(this IEnumerable<TElement> source,
Func<TElement, TKey> selector)
{
// TODO: Argument validation and custom comparisons
Comparer<TKey> keyComparer = Comparer<TKey>.Default;
using (var iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
{
yield break;
}
TKey currentKey = selector(iterator.Current);
List<TElement> currentList = new List<TElement> { iterator.Current };
int sign = 0;
while (iterator.MoveNext())
{
TElement element = iterator.Current;
TKey key = selector(element);
int nextSign = Math.Sign(keyComparer.Compare(currentKey, key));
// Haven't decided a direction yet
if (sign == 0)
{
sign = nextSign;
currentList.Add(element);
}
// Same direction or no change
else if (sign == nextSign || nextSign == 0)
{
currentList.Add(element);
}
else // Change in direction: yield current list and start a new one
{
yield return currentList;
currentList = new List<TElement> { element };
sign = 0;
}
currentKey = key;
}
yield return currentList;
}
}
Completely untested, but I think it might work...
alternatively with linq operators and some abuse of .net closures by reference.
public static IEnumerable<IEnumerable<T>> Monotonic<T>(this IEnumerable<T> enumerable)
{
var comparator = Comparer<T>.Default;
int i = 0;
T last = default(T);
return enumerable.GroupBy((value) => { i = comparator.Compare(value, last) > 0 ? i : i+1; last = value; return i; }).Select((group) => group.Select((_) => _));
}
Taken from some random utility code for partitioning IEnumerable's into a makeshift table for logging. If I recall properly, the odd ending Select is to prevent ambiguity when the input is an enumeration of strings.
Here's a custom LINQ operator which splits a sequence according to just about any criteria. Its parameters are:
xs: the input element sequence.
func: a function which accepts the "current" input element and a state object, and returns as a tuple:
a bool stating whether the input sequence should be split before the "current" element; and
a state object which will be passed to the next invocation of func.
initialState: the state object that gets passed to func on its first invocation.
Here it is, along with a helper class (required because yield return apparently cannot be nested):
public static IEnumerable<IEnumerable<T>> Split<T, TState>(
this IEnumerable<T> xs,
Func<T, TState, Tuple<bool, TState>> func,
TState initialState)
{
using (var splitter = new Splitter<T, TState>(xs, func, initialState))
{
while (splitter.HasNext)
{
yield return splitter.GetNext();
}
}
}
internal sealed class Splitter<T, TState> : IDisposable
{
public Splitter(IEnumerable<T> xs,
Func<T, TState, Tuple<bool, TState>> func,
TState initialState)
{
this.xs = xs.GetEnumerator();
this.func = func;
this.state = initialState;
this.hasNext = this.xs.MoveNext();
}
private readonly IEnumerator<T> xs;
private readonly Func<T, TState, Tuple<bool, TState>> func;
private bool hasNext;
private TState state;
public bool HasNext { get { return hasNext; } }
public IEnumerable<T> GetNext()
{
while (hasNext)
{
Tuple<bool, TState> decision = func(xs.Current, state);
state = decision.Item2;
if (decision.Item1) yield break;
yield return xs.Current;
hasNext = xs.MoveNext();
}
}
public void Dispose() { xs.Dispose(); }
}
Note: Here are some of the design decisions that went into the Split method:
It should make only a single pass over the sequence.
State is made explicit so that it's possible to keep side effects out of func.

Resources