C# Sort array of Objects by object type (Icomparable?) - sorting

I want to sort anarray by the object type. So all songs are together, all book are together, and all movies are together.
I am reading a file and determine what each object should be. then creating the object and adding it to the array.
EDIT: Here is the actual code.
static Media[] ReadData()
{
List<Media> things = new List<Media>();
try
{
String filePath = "resources/Data.txt";
string Line;
int counter = 0; // used to check if you have reached the max
number of allowed objects (100)
using (StreamReader File = new System.IO.StreamReader(filePath))
{
while ((Line = File.ReadLine()) != null)
{
This is where each object is created. The file
search for a key word in the beginning of the line, then
creates the corresponding object. It will split the
information on the first line of the object and will
read each line until a "-" character is found and
pass each line into the summary. The summary is then
decrypted and the created object is passed into an
array List. Finally if the array list reaches 100, it
will print "You have reach max number of objects" and
stop reading the file.
if (Line.StartsWith("BOOK"))
{
String[] tempArray = Line.Split('|');
//foreach (string x in tempArray){Console.WriteLine(x);} //This is used
for testing
Book tempBook = new Book(tempArray[1],
int.Parse(tempArray[2]), tempArray[3]);
while ((Line = File.ReadLine()) != null)
{
if (Line.StartsWith("-")){break;}
tempBook.Summary = tempBook.Summary + Line;
}
tempBook.Summary = tempBook.Decrypt();
things.Add(tempBook);
counter++;
}
else if (Line.StartsWith("SONG"))
{
String[] tempArray = Line.Split('|');
//foreach (string x in tempArray)
{Console.WriteLine(x);} //This is used for testing
Song tempSong = new Song(tempArray[1],
int.Parse(tempArray[2]), tempArray[3], tempArray[4]);
things.Add(tempSong);
counter++;
}
else if (Line.StartsWith("MOVIE"))
{
String[] tempArray = Line.Split('|');
//foreach (string x in tempArray)
{Console.WriteLine(x);} //This is used for testing
Movie tempMovie = new Movie(tempArray[1],
int.Parse(tempArray[2]), tempArray[3]);
while ((Line = File.ReadLine()) != null)
{
if (Line.StartsWith("-")) { break; }
tempMovie.Summary = tempMovie.Summary + Line;
}
tempMovie.Summary = tempMovie.Decrypt();
things.Add(tempMovie);
counter++;
}
if (counter == 100)
{
Console.WriteLine("You have reached the maximum number of media
objects.");
break;
}
}
File.Close();
}
}
return things.ToArray(); // Convert array list to an Array and return the
array.
}
In the main code, I have this:
Media[] mediaObjects = new Media[100];
Media[] temp = ReadData();
int input; // holds the input from a user to determin which action to take
for (int i = 0; i<temp.Length;i++){ mediaObjects[i] = temp[i]; }
I want the array of mediaObjects to be sorted by the what type of objects.
I have also used Icomparable to do an arrayList.sort() but still no luck.
public int CompareTo(object obj)
{
if (obj == null)
{
return 1;
}
Song temp = obj as Song;
if (temp != null)
{
//Type is a string
return this.Type.CompareTo(temp.Type);
}
else
{
return 1;
}
}

So I see you have BOOK, SONG and MOVIE types.
This is a classic case of implementing the IComparable interface - although you are correct with the interface name to implement, you are not using this correctly.
Create a base class by the name MediaObject - this will be the main one for your other types of objects you create.
Add the correct properties you need. In this case, the media type is the one in need.
Let this class implement IComparable to help you with the comparison.
override the CompareTo() method in the PROPER way
public class MediaObject : IComparable
{
private string mediaType;
public string MediaType
{
get {return mediaType;}
set {mediaType=value;}
}
public MediaObject(string mType)
{
MediaType = mType;
}
int IComparable.CompareTo(object obj)
{
MediaObject mo = (MediaObject)obj;
return String.Compare(this.MediaType,mo.MediaType); //implement a case insensitive comparison if needed (for your research)
}
}
You can now compare the MediaObject objects In your main method directly.

Thank for the advice. I ended up just reformating how I was creating the list while reading it. I made multiple lists for each object then just happened each one on to the master list so It showed up sorted. I wish I figured that out before I made this long post.
As far as I could find, you can't sort by the type of object in a generic array. If anyone has a better solution feel free to post it.

Related

How to get the text position from the pdf page in iText 7

I am trying to find the text position in PDF page?
What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:
var Words = pdftextextractor.Split(new char[] { ' ', '\n' });
What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.
I was able to manipulate it with my previous version for Itext5. I don't know if you are looking for C# but that is what the below code is written in.
using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class TextLocationStrategy : LocationTextExtractionStrategy
{
private List<textChunk> objectResult = new List<textChunk>();
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
TextRenderInfo renderInfo = (TextRenderInfo)data;
string curFont = renderInfo.GetFont().GetFontProgram().ToString();
float curFontSize = renderInfo.GetFontSize();
IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
foreach (TextRenderInfo t in text)
{
string letter = t.GetText();
Vector letterStart = t.GetBaseline().GetStartPoint();
Vector letterEnd = t.GetAscentLine().GetEndPoint();
Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));
if (letter != " " && !letter.Contains(' '))
{
textChunk chunk = new textChunk();
chunk.text = letter;
chunk.rect = letterRect;
chunk.fontFamily = curFont;
chunk.fontSize = curFontSize;
chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;
objectResult.Add(chunk);
}
}
}
}
public class textChunk
{
public string text { get; set; }
public Rectangle rect { get; set; }
public string fontFamily { get; set; }
public int fontSize { get; set; }
public float spaceWidth { get; set; }
}
I also get down to each individual character because it works better for my process. You can manipulate the names, and of course the objects, but I created the textchunk to hold what I wanted, rather than have a bunch of renderInfo objects.
You can implement this by adding a few lines to grab the data from your pdf.
PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));
Once you are this far, you can pull the objectResult from the strat by making it public or creating a method within your class to grab the objectResult and do something with it.
#Joris' answer explains how to implement a completely new extraction strategy / event listener for the task. Alternatively one can try and tweak an existing text extraction strategy to do what you required.
This answer demonstrates how to tweak the existing LocationTextExtractionStrategy to return both the text and its characters' respective y coordinates.
Beware, this is but a proof-of-concept which in particular assumes text to be written horizontally, i.e. using an effective transformation matrix (ctm and text matrix combined) with b and c equal to 0.
Furthermore the character and coordinate retrieval methods of TextPlusY are not at all optimized and might take long to execute.
As the OP did not express a language preference, here a solution for iText7 for Java:
TextPlusY
For the task at hand one needs to be able to retrieve character and y coordinates side by side. To make this easier I use a class representing both text its characters' respective y coordinates. It is derived from CharSequence, a generalization of String, which allows it to be used in many String related functions:
public class TextPlusY implements CharSequence
{
final List<String> texts = new ArrayList<>();
final List<Float> yCoords = new ArrayList<>();
//
// CharSequence implementation
//
#Override
public int length()
{
int length = 0;
for (String text : texts)
{
length += text.length();
}
return length;
}
#Override
public char charAt(int index)
{
for (String text : texts)
{
if (index < text.length())
{
return text.charAt(index);
}
index -= text.length();
}
throw new IndexOutOfBoundsException();
}
#Override
public CharSequence subSequence(int start, int end)
{
TextPlusY result = new TextPlusY();
int length = end - start;
for (int i = 0; i < yCoords.size(); i++)
{
String text = texts.get(i);
if (start < text.length())
{
float yCoord = yCoords.get(i);
if (start > 0)
{
text = text.substring(start);
start = 0;
}
if (length > text.length())
{
result.add(text, yCoord);
}
else
{
result.add(text.substring(0, length), yCoord);
break;
}
}
else
{
start -= text.length();
}
}
return result;
}
//
// Object overrides
//
#Override
public String toString()
{
StringBuilder builder = new StringBuilder();
for (String text : texts)
{
builder.append(text);
}
return builder.toString();
}
//
// y coordinate support
//
public TextPlusY add(String text, float y)
{
if (text != null)
{
texts.add(text);
yCoords.add(y);
}
return this;
}
public float yCoordAt(int index)
{
for (int i = 0; i < yCoords.size(); i++)
{
String text = texts.get(i);
if (index < text.length())
{
return yCoords.get(i);
}
index -= text.length();
}
throw new IndexOutOfBoundsException();
}
}
(TextPlusY.java)
TextPlusYExtractionStrategy
Now we extend the LocationTextExtractionStrategy to extract a TextPlusY instead of a String. All we need for that is to generalize the method getResultantText.
Unfortunately the LocationTextExtractionStrategy has hidden some methods and members (private or package protected) which need to be accessed here; thus, some reflection magic is required. If your framework does not allow this, you'll have to copy the whole strategy and manipulate it accordingly.
public class TextPlusYExtractionStrategy extends LocationTextExtractionStrategy
{
static Field locationalResultField;
static Method sortWithMarksMethod;
static Method startsWithSpaceMethod;
static Method endsWithSpaceMethod;
static Method textChunkSameLineMethod;
static
{
try
{
locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
locationalResultField.setAccessible(true);
sortWithMarksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("sortWithMarks", List.class);
sortWithMarksMethod.setAccessible(true);
startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace", String.class);
startsWithSpaceMethod.setAccessible(true);
endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
endsWithSpaceMethod.setAccessible(true);
textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
textChunkSameLineMethod.setAccessible(true);
}
catch(NoSuchFieldException | NoSuchMethodException | SecurityException e)
{
// Reflection failed
}
}
//
// constructors
//
public TextPlusYExtractionStrategy()
{
super();
}
public TextPlusYExtractionStrategy(ITextChunkLocationStrategy strat)
{
super(strat);
}
#Override
public String getResultantText()
{
return getResultantTextPlusY().toString();
}
public TextPlusY getResultantTextPlusY()
{
try
{
List<TextChunk> textChunks = new ArrayList<>((List<TextChunk>)locationalResultField.get(this));
sortWithMarksMethod.invoke(this, textChunks);
TextPlusY textPlusY = new TextPlusY();
TextChunk lastChunk = null;
for (TextChunk chunk : textChunks)
{
float chunkY = chunk.getLocation().getStartLocation().get(Vector.I2);
if (lastChunk == null)
{
textPlusY.add(chunk.getText(), chunkY);
}
else if ((Boolean)textChunkSameLineMethod.invoke(chunk, lastChunk))
{
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (isChunkAtWordBoundary(chunk, lastChunk) &&
!(Boolean)startsWithSpaceMethod.invoke(this, chunk.getText()) &&
!(Boolean)endsWithSpaceMethod.invoke(this, lastChunk.getText()))
{
textPlusY.add(" ", chunkY);
}
textPlusY.add(chunk.getText(), chunkY);
}
else
{
textPlusY.add("\n", lastChunk.getLocation().getStartLocation().get(Vector.I2));
textPlusY.add(chunk.getText(), chunkY);
}
lastChunk = chunk;
}
return textPlusY;
}
catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e)
{
throw new RuntimeException("Reflection failed", e);
}
}
}
(TextPlusYExtractionStrategy.java)
Usage
Using these two classes you can extract text with coordinates and search therein like this:
try ( PdfReader reader = new PdfReader(YOUR_PDF);
PdfDocument document = new PdfDocument(reader) )
{
TextPlusYExtractionStrategy extractionStrategy = new TextPlusYExtractionStrategy();
PdfPage page = document.getFirstPage();
PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
parser.processPageContent(page);
TextPlusY textPlusY = extractionStrategy.getResultantTextPlusY();
System.out.printf("\nText from test.pdf\n=====\n%s\n=====\n", textPlusY);
System.out.print("\nText with y from test.pdf\n=====\n");
int length = textPlusY.length();
float lastY = Float.MIN_NORMAL;
for (int i = 0; i < length; i++)
{
float y = textPlusY.yCoordAt(i);
if (y != lastY)
{
System.out.printf("\n(%4.1f) ", y);
lastY = y;
}
System.out.print(textPlusY.charAt(i));
}
System.out.print("\n=====\n");
System.out.print("\nMatches of 'est' with y from test.pdf\n=====\n");
Matcher matcher = Pattern.compile("est").matcher(textPlusY);
while (matcher.find())
{
System.out.printf("from character %s to %s at y position (%4.1f)\n", matcher.start(), matcher.end(), textPlusY.yCoordAt(matcher.start()));
}
System.out.print("\n=====\n");
}
(ExtractTextPlusY test method testExtractTextPlusYFromTest)
For my test document
the output of the test code above is
Text from test.pdf
=====
Ein Dokumen t mit einigen
T estdaten
T esttest T est test test
=====
Text with y from test.pdf
=====
(691,8) Ein Dokumen t mit einigen
(666,9) T estdaten
(642,0) T esttest T est test test
=====
Matches of 'est' with y from test.pdf
=====
from character 28 to 31 at y position (666,9)
from character 39 to 42 at y position (642,0)
from character 43 to 46 at y position (642,0)
from character 49 to 52 at y position (642,0)
from character 54 to 57 at y position (642,0)
from character 59 to 62 at y position (642,0)
=====
My locale uses the comma as decimal separator, you might see 666.9 instead of 666,9.
The extra spaces you see can be removed by fine-tuning the base LocationTextExtractionStrategy functionality further. But that is the focus of other questions...
First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.
Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.
Possible implementation:
implement IEventListener
get notified for all events that render text, and store the corresponding TextRenderInfo object
once you're finished with the document, sort these objects based on their position in the page
loop over this list of TextRenderInfo objects, they offer both the text being rendered and the coordinates
how to:
implement ITextExtractionStrategy (or extend an existing
implementation)
use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
your strategy should be set up to keep track of locations for the text it processed
ITextExtractionStrategy has the following method in its interface:
#Override
public void eventOccurred(IEventData data, EventType type) {
// you can first check the type of the event
if (!type.equals(EventType.RENDER_TEXT))
return;
// now it is safe to cast
TextRenderInfo renderInfo = (TextRenderInfo) data;
}
Important to keep in mind is that rendering instructions in a pdf do not need to appear in order.
The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to:
render "Ipsum Do"
render "Lorem "
render "lor Sit Amet"
You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.
Once that's done, it should be easy.
For anyone looking for a simple Rectangle object this worked for me. I made these two classes, and call the static method "GetTextCoordinates" with your page and desired text.
public class PdfTextLocator : LocationTextExtractionStrategy
{
public string TextToSearchFor { get; set; }
public List<TextChunk> ResultCoordinates { get; set; }
/// <summary>
/// Returns a rectangle with a given location of text on a page. Returns null if not found.
/// </summary>
/// <param name="page">Page to Search</param>
/// <param name="s">String to be found</param>
/// <returns></returns>
public static Rectangle GetTextCoordinates(PdfPage page, string s)
{
PdfTextLocator strat = new PdfTextLocator(s);
PdfTextExtractor.GetTextFromPage(page, strat);
foreach (TextChunk c in strat.ResultCoordinates)
{
if (c.Text == s)
return c.ResultCoordinates;
}
return null;
}
public PdfTextLocator(string textToSearchFor)
{
this.TextToSearchFor = textToSearchFor;
ResultCoordinates = new List<TextChunk>();
}
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
TextRenderInfo renderInfo = (TextRenderInfo)data;
IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
for (int i = 0; i < text.Count; i++)
{
if (text[i].GetText() == TextToSearchFor[0].ToString())
{
string word = "";
for (int j = i; j < i + TextToSearchFor.Length && j < text.Count; j++)
{
word = word + text[j].GetText();
}
float startX = text[i].GetBaseline().GetStartPoint().Get(0);
float startY = text[i].GetBaseline().GetStartPoint().Get(1);
ResultCoordinates.Add(new TextChunk(word, new Rectangle(startX, startY, text[i].GetAscentLine().GetEndPoint().Get(0) - startX, text[i].GetAscentLine().GetEndPoint().Get(0) - startY)));
}
}
}
}
public class TextChunk
{
public string Text { get; set; }
public Rectangle ResultCoordinates { get; set; }
public TextChunk(string s, Rectangle r)
{
Text = s;
ResultCoordinates = r;
}
}

Kinect Record Only When One Body/Skeleton Is in the Frame

I need to write a program that records the frames, but only when one skeleton/body is in the frame. I looked at the bodyCount method, but it always gives a value of 6 (useless). One thing I tried to do is shown below. This code basically shows the index at which the body being tracked is stored. But, I still can't figure out how to know if there is one or more bodies in the frame.
I would really appreciate any help.
Thanks in advance.
private void Reader_FrameArrived(object sender, BodyFrameArrivedEventArgs e){
using (BodyFrame bodyFrame=e.FrameReference.AquireFrame()){
if (bodyFrame!=null){
if(this.bodies==null){
this.bodies=new Body[bodyFrame.BodyCount];
}
body.Frame.GetAndRefreshBodyData(this.bodies)
for(int i=0; i<6;i++){
if(this.bodies[i].IsTracked){
Console.WriteLine("Tracked"+i)
}
}
}
}
}
Just check the IsTracked property of each body, and store the number of tracked skeleton in a single variable. If this number is equal to 1, there is just one single skeleton tracked, and you can start your recording.
private Body[] bodies = null;
[...]
private void Reader_FrameArrived(object sender, BodyFrameArrivedEventArgs e)
{
bool dataReceived = false;
using (BodyFrame bodyFrame = e.FrameReference.AcquireFrame())
{
if (bodyFrame != null)
{
if (this.bodies == null)
{
this.bodies = new Body[bodyFrame.BodyCount];
}
// The first time GetAndRefreshBodyData is called, Kinect will allocate each Body in the array.
// As long as those body objects are not disposed and not set to null in the array,
// those body objects will be re-used.
bodyFrame.GetAndRefreshBodyData(this.bodies);
dataReceived = true;
}
}
if (dataReceived)
{
int trackedBodyCount = 0;
for (int i=0; i<this.bodies.Length; ++i)
{
if(this.bodies[i].IsTracked) {
trackedBodyCount += 1;
}
}
if (trackedBodyCount == 1)
{
// One skeleton is tracked
}
}
}

Trying to save comma-separated list

Trying to save selections from a CheckBoxList as a comma-separated list (string) in DB (one or more choices selected). I am using a proxy in order to save as a string because otherwise I'd have to create separate tables in the DB for a relation - the work is not worth it for this simple scenario and I was hoping that I could just convert it to a string and avoid that.
The CheckBoxList uses an enum for it's choices:
public enum Selection
{
Selection1,
Selection2,
Selection3
}
Not to be convoluted, but I use [Display(Name="Choice 1")] and an extension class to display something friendly on the UI. Not sure if I can save that string instead of just the enum, although I think if I save as enum it's not a big deal for me to "display" the friendly string on UI on some confirmation page.
This is the "Record" class that saves a string in the DB:
public virtual string MyCheckBox { get; set; }
This is the "Proxy", which is some sample I found but not directly dealing with enum, and which uses IEnumerable<string> (or should it be IEnumerable<Selection>?):
public IEnumerable<string> MyCheckBox
{
get
{
if (String.IsNullOrWhiteSpace(Record.MyCheckBox)) return new string[] { };
return Record
.MyCheckBox
.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
.Select(r => r.Trim())
.Where(r => !String.IsNullOrEmpty(r));
}
set
{
Record.MyCheckBox = value == null ? null : String.Join(",", value);
}
}
To save in the DB, I am trying to do this in a create class:
proxy.MyCheckBox = record.MyCheckBox; //getting error here
but am getting the error:
Cannot implicitly convert 'string' to System.Collections.Generic.IEnumerable'
I don't know, if it's possible or better, to use Parse or ToString from the API for enum values.
I know that doing something like this will store whatever I put in the ("") into the DB, so it's just a matter of figuring out how to overcome the error (or, if there is an alternative):
proxy.MyCheckBox = new[] {"foo", "bar"};
I am not good with this stuff and have just been digging and digging to come up with a solution. Any help is much appreciated.
You can accomplish this using a custom user type. The example below uses an ISet<string> on the class and stores the values as a delimited string.
[Serializable]
public class CommaDelimitedSet : IUserType
{
const string delimiter = ",";
#region IUserType Members
public new bool Equals(object x, object y)
{
if (ReferenceEquals(x, y))
{
return true;
}
var xSet = x as ISet<string>;
var ySet = y as ISet<string>;
if (xSet == null || ySet == null)
{
return false;
}
// compare set contents
return xSet.Except(ySet).Count() == 0 && ySet.Except(xSet).Count() == 0;
}
public int GetHashCode(object x)
{
return x.GetHashCode();
}
public object NullSafeGet(IDataReader rs, string[] names, object owner)
{
var outValue = NHibernateUtil.String.NullSafeGet(rs, names[0]) as string;
if (string.IsNullOrEmpty(outValue))
{
return new HashSet<string>();
}
else
{
var splitArray = outValue.Split(new[] {Delimiter}, StringSplitOptions.RemoveEmptyEntries);
return new HashSet<string>(splitArray);
}
}
public void NullSafeSet(IDbCommand cmd, object value, int index)
{
var inValue = value as ISet<string>;
object setValue = inValue == null ? null : string.Join(Delimiter, inValue);
NHibernateUtil.String.NullSafeSet(cmd, setValue, index);
}
public object DeepCopy(object value)
{
// return new ISet so that Equals can work
// see http://www.mail-archive.com/nhusers#googlegroups.com/msg11054.html
var set = value as ISet<string>;
if (set == null)
{
return null;
}
return new HashSet<string>(set);
}
public object Replace(object original, object target, object owner)
{
return original;
}
public object Assemble(object cached, object owner)
{
return DeepCopy(cached);
}
public object Disassemble(object value)
{
return DeepCopy(value);
}
public SqlType[] SqlTypes
{
get { return new[] {new SqlType(DbType.String)}; }
}
public Type ReturnedType
{
get { return typeof(ISet<string>); }
}
public bool IsMutable
{
get { return false; }
}
#endregion
}
Usage in mapping file:
Map(x => x.CheckboxValues.CustomType<CommaDelimitedSet>();

Lossless hierarchical run length encoding

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.
For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F
I am not concerned that an option is picked between two identical possible nestings E.g.
ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.
However I do want the choice to be MOST greedy. For instance:
ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.
In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.
Welcome ideas.
I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.
You can paste the following code into LINQPad and run it, and it should produce the following output:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)
As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.
Basically, the code runs like this:
For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
If it can't find a repeating sequence, it spits out the single symbol at that location
It then skips what it encoded, and continues from #1
Anyway, here's the code:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}
Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.
ABCBCABCBCDEEF
s->ttDuuF
t->Avv
v->BC
u->E
ABABCDABABCD
s->ABtt
t->ABCD
Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.
The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

Partition/split/section IEnumerable<T> into IEnumerable<IEnumerable<T>> based on a function using LINQ?

I'd like to split a sequence in C# to a sequence of sequences using LINQ. I've done some investigation, and the closest SO article I've found that is slightly related is this.
However, this question only asks how to partition the original sequence based upon a constant value. I would like to partition my sequence based on an operation.
Specifically, I have a list of objects which contain a decimal property.
public class ExampleClass
{
public decimal TheValue { get; set; }
}
Let's say I have a sequence of ExampleClass, and the corresponding sequence of values of TheValue is:
{0,1,2,3,1,1,4,6,7,0,1,0,2,3,5,7,6,5,4,3,2,1}
I'd like to partition the original sequence into an IEnumerable<IEnumerable<ExampleClass>> with values of TheValue resembling:
{{0,1,2,3}, {1,1,4,6,7}, {0,1}, {0,2,3,5,7}, {6,5,4,3,2,1}}
I'm just lost on how this would be implemented. SO, can you help?
I have a seriously ugly solution right now, but have a "feeling" that LINQ will increase the elegance of my code.
Okay, I think we can do this...
public static IEnumerable<IEnumerable<TElement>>
PartitionMontonically<TElement, TKey>
(this IEnumerable<TElement> source,
Func<TElement, TKey> selector)
{
// TODO: Argument validation and custom comparisons
Comparer<TKey> keyComparer = Comparer<TKey>.Default;
using (var iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
{
yield break;
}
TKey currentKey = selector(iterator.Current);
List<TElement> currentList = new List<TElement> { iterator.Current };
int sign = 0;
while (iterator.MoveNext())
{
TElement element = iterator.Current;
TKey key = selector(element);
int nextSign = Math.Sign(keyComparer.Compare(currentKey, key));
// Haven't decided a direction yet
if (sign == 0)
{
sign = nextSign;
currentList.Add(element);
}
// Same direction or no change
else if (sign == nextSign || nextSign == 0)
{
currentList.Add(element);
}
else // Change in direction: yield current list and start a new one
{
yield return currentList;
currentList = new List<TElement> { element };
sign = 0;
}
currentKey = key;
}
yield return currentList;
}
}
Completely untested, but I think it might work...
alternatively with linq operators and some abuse of .net closures by reference.
public static IEnumerable<IEnumerable<T>> Monotonic<T>(this IEnumerable<T> enumerable)
{
var comparator = Comparer<T>.Default;
int i = 0;
T last = default(T);
return enumerable.GroupBy((value) => { i = comparator.Compare(value, last) > 0 ? i : i+1; last = value; return i; }).Select((group) => group.Select((_) => _));
}
Taken from some random utility code for partitioning IEnumerable's into a makeshift table for logging. If I recall properly, the odd ending Select is to prevent ambiguity when the input is an enumeration of strings.
Here's a custom LINQ operator which splits a sequence according to just about any criteria. Its parameters are:
xs: the input element sequence.
func: a function which accepts the "current" input element and a state object, and returns as a tuple:
a bool stating whether the input sequence should be split before the "current" element; and
a state object which will be passed to the next invocation of func.
initialState: the state object that gets passed to func on its first invocation.
Here it is, along with a helper class (required because yield return apparently cannot be nested):
public static IEnumerable<IEnumerable<T>> Split<T, TState>(
this IEnumerable<T> xs,
Func<T, TState, Tuple<bool, TState>> func,
TState initialState)
{
using (var splitter = new Splitter<T, TState>(xs, func, initialState))
{
while (splitter.HasNext)
{
yield return splitter.GetNext();
}
}
}
internal sealed class Splitter<T, TState> : IDisposable
{
public Splitter(IEnumerable<T> xs,
Func<T, TState, Tuple<bool, TState>> func,
TState initialState)
{
this.xs = xs.GetEnumerator();
this.func = func;
this.state = initialState;
this.hasNext = this.xs.MoveNext();
}
private readonly IEnumerator<T> xs;
private readonly Func<T, TState, Tuple<bool, TState>> func;
private bool hasNext;
private TState state;
public bool HasNext { get { return hasNext; } }
public IEnumerable<T> GetNext()
{
while (hasNext)
{
Tuple<bool, TState> decision = func(xs.Current, state);
state = decision.Item2;
if (decision.Item1) yield break;
yield return xs.Current;
hasNext = xs.MoveNext();
}
}
public void Dispose() { xs.Dispose(); }
}
Note: Here are some of the design decisions that went into the Split method:
It should make only a single pass over the sequence.
State is made explicit so that it's possible to keep side effects out of func.

Resources