Data Structure to Implement Text Editor? - algorithm

Recently I was asked this question in an interview. The exact question was
What data structures will you use to implement a text editor. Size of editor can be changed and you also need to save the styling information for all the text like italic, bold etc ?
At that point of time, I tried to convince him using many different approaches like stack, Doubly Linked list and all.
From that point of time,This question is bugging me.

It looks like they'd like to know if you were aware of the flyweight pattern and how to use it correctly.
A text editor is a common example while describing that pattern.
Maybe your interviewer was a lover of the GOF book. :-)

In addition to the previous answers, I would like to add that in order to get to the data structures, you need first to know your design - otherwise the options will be too broad selected.
As example let's assume that you'll need an editing functionality. Here the State and Memento design patterns will be a good fit. Very suitable structure will be the Cord, since it's
composed of smaller strings that is used for efficiently storing and manipulating a very long string.
In our case the text editing program
may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

An open-ended question like this is designed more to see if you can think cogently about making a design that hangs together well, rather than having one, specific answer.
One specialized answer to the question is to use DOM/XML ("Document Object Model"). Markup "languages" are intended to solve this exact problem. You could store the data for the editor in a DOM. One of the advantages of using a DOM is that there are libraries like Xerces that have extensive support for building and managing DOMs, so a lot of your work is done for you. It is possible the interviewer intended this to be the ideal answer.
A more general answer is that any nested sequence structure can be used. The text can be seen as a sequence of strings. Each elment of the sequence, like rows in a database, can have multiple attributes (font type, font size, italic, bold, strikethrough, etc). Nesting (hierarchy) is useful because the document might have structure such as chapters, sections, paragraphs. For example, if a paragraph has its own styling (indent), then it may need to have its own level. So you have something like this:
Document
Chapter
Paragraph
Text
To implement this, you would use a tree and each node of the tree would have multiple attributes. You would require different kinds of nodes (Chapter nodes, Paragraph nodes, etc). So, for example, a document like a paper would have multiple Section nodes and a Notes node inside a Document node, but a book-like document might have Chapter nodes inside a document node. The advantage of this approach is that it is more specific and hand-tailored to the problem than using a DOM, which is a more flexible approach.
You could also combine the two approaches, using a DOM as your base structure and the hierarchical structure described above as your DOM implementation.
(Note: in the future you should post questions like this to https://softwareengineering.stackexchange.com/)

Related

Custom dendrogram in D3

maybe this is not the place for this question, but maybe someone is an experienced user of D3.js.
I would like to create a dendrogram where I initially show nodes from different levels (precomputed) and nodes are colored differently. The nodes have different tooltips for colored part and for the grey part.
Also I would like to side that with a heatmap.
Do you think combining those thing is possible in D3?
Since the work to do that is quite big I would like to know if it is reasonable to even start.
Part of the result I'm aiming for is here:
The short answer to your question is yes.
I'm looking into the same sort of problem/challenge and found a very nice example that almost exactly does what you describe: https://github.com/MaayanLab/clustergrammer
Since the solution involves 10k+ lines of code and this case is not a simple 'use this to do this' answer I'm not providing code excerpts (for details see their github). In short; it uses D3 libraries + javascript code for dynamic plotting, zooming and sorting of the heatmap and a collapsed dendrogram. It loads (meta)-data from a pre-computed json file that contains the information on clusters and some meta data.
I understand your question you don't prefer a pre-computed input. This is also the case for the application that I am buidling. I'm looking into generalising the generation of the json file from an SQLquery which can then hook up to the clustergrammer.js code. I will update this thread if I find out more/have a different/working solution that does everything on the fly.

Grammar parsing in Ruby

I have a task ahead of me which relies on interpreting structure of a text – to be precise, a monolingual dictionary. The dictionary has quite complex entries: up to 29 unique elements, and some are nested within others. I am designing my own XML schema for the dictionary, but I would like to write a program that parses the plain text I have automatically.
I have some basic skills in Ruby and I am a rather experienced RegEx user, but I think creating lots of if-trees and extremely long RegEx formulas is prboably not the best idea. I have found some information on Parsing Expression Grammar, Backus Normal Form and W-grammar, but it seems somewhat vague to what they apply best.
My question is: which is best way to interpret the structure of a text written in a natural language? I don't want to interpret the language itself, but rather to divide each entry into segments based on characters and keyword used, as well as their neighborhood. What gems and resources would you suggest?
Edit: here's an example of a moderately simple entry from the dictionary (in Polish). What I want to do is to tag each element (senses, explanations, collocation, label markers etc.). As you can see, I am looking for an efficient way to encompass a large number of cases in a tree-like form.
Another problem is that I want to have lots of captures, as I want to tag the segments in XML from bigger to smaller.
This looks like a problem that would be well suited for Treetop. I don't think I have enough information to be sure that it will work, but being able to combine regular expressions into a larger structure where each of the 29 elements can be managed and their information extracted/represented using any of Ruby's features as appropriate, seems like the sort of feature set you need.

Organizing large collection of things - usability and UI point of view

In our application we have a repository that contains things (they are called methods and queries, but this is not particularly relevant for this question). Each thing has a title, description (though some may lack both) and some other data. Users save things to repository and load and use things from repository.
I wonder what is the best way to organize the repository from usability point of view. There seems to be two major approaches. The first approach is to put things in folders, subfolders and so on, and have a hierarchical structure similar to a filesystem. The second approach (that has become fashionable) is two have a flat space and assign zero or more tags to each thing, so that users can view a list of things for a particular tag.
Currently we use flat space, tags and search. It appears to be somewhat unmanageable. I am not sure if switching to folders/subfolders will make it better.
I would like to learn more about the pros and cons of each approach and what properties of the collection and the things themselves suggest using one or another approach or a combination of both. If anybody can point me to some studies or discussions of those, I would really appreciate that.
There is no reason you can't use both methods. To some extent finding things is dependent upon what the thing is and why it is being looked for. A hierachical design can work well when somebody knows what they are looking for and a tag / keyword based system can work better when the structure is less obvious.
Also a network structure that links similar things can also be very good as you can see with the internet or a wikipedia.
I use the law of symmetry to help me in this situation.
First you build the tree like structure in the back end and then build the tagging system for the front end.
You use both to organize your data collection.
A tag cloud works better than a hierarchy if
the taxonomy is uncertain
("Now is this a small car or a large truck?")
there is no central authority for classification
there is no obivous or natural order between the classes
(cars can be classified by color or by size, there is no obvious rank between color and size)
new categories may be created on the fly
Otherwise, a hierarchy gives more confidence in completeness, as every item has exactly one obviously correct location: did I find all documents about birds? Is there really no document about five-story houses?
Tag clouds need some maintenance, I am not sure if this can be completely user-provided:
Dealing with synonyms, tag synonyms, merging tags, clarifying tags (e.g. is "blue" a feeling or a color?)
Another option are attribute-value pairs. They can be built upon a well-maintained tag cloud, e.g. grouping "red / black / blue" tags under "color". They can also work with floating values, search can be extended to similar values in case of not enough results (such as age, date, even multidimensionals like color).
However, this requires to know ahead what search criteria users need. If you need to introduce a new category, you need to re-tag the entire body of documents.
See also my request for clarification: what are the problems? Not enough tagging? tagging to distinct? Users not finding what they are loking for? Users not confident in search results?

What should I do with an over-bloated select-box/drop-down

All web developers run into this problem when the amount of data in their project grows, and I have yet to see a definitive, intuitive best practice for solving it. When you start a project, you often create forms with tags to help pick related objects for one-to-many relationships.
For instance, I might have a system with Neighbors and each Neighbor belongs to a Neighborhood. In version 1 of the application I create an edit user form that has a drop down for selecting users, that simply lists the 5 possible neighborhoods in my geographically limited application.
In the beginning, this works great. So long as I have maybe 100 records or less, my select box will load quickly, and be fairly easy to use. However, lets say my application takes off and goes national. Instead of 5 neighborhoods I have 10,000. Suddenly my little drop-down takes forever to load, and once it loads, its hard to find your neighborhood in the massive alphabetically sorted list.
Now, in this particular situation, having hierarchical data, and letting users drill down using several dynamically generated drop downs would probably work okay. However, what is the best solution when the objects/records being selected are not hierarchical in nature? In the past, of done this with a popup with a search box, and a list, but this seems clunky and dated. In today's web 2.0 world, what is a good way to find one object amongst many for ones forms?
I've considered using an Ajaxifed search box, but this seems to work best for free text, and falls apart a little when the data to be saved is just a reference to another object or record.
Feel free to cite specific libraries with generic solutions to this problem, or simply share what you have done in your projects in a more general way
I think an auto-completing text box is a good approach in this case. Here on SO, they also use an auto-completing box for tags where the entry already needs to exist, i.e. not free-text but a selection. (remember that creating new tags requires reputation!)
I personally prefer this anyways, because I can type faster than select something with the mouse, but that is programmer's disease I guess :)
Auto-complete is usually the best solution in my experience for searches, but only where the user is able to provide text tokens easily, either as part of the object name or taxonomy that contains the object (such as a product category, or postcode).
However this doesn't always work, particularly where 'browse' behavior would be more suitable - to give a real example, I once wrote a page for a community site that allowed a user to send a message to their friends. We used auto-complete there, allowing multiple entries separated by commas.
It works great when you know the names of the people you want to send the message to, but we found during user acceptance that most people didn't really know who was on their friend list and couldn't use the page very well - so we added a list popup with friend icons, and that was more successful.
(this was quite some time ago - everyone just copies Facebook now...)
Different methods of organizing large amounts of data:
Hierarchies
Spatial (geography/geometry)
Tags or facets
Different methods of searching large amounts of data:
Filtering (including autocomplete)
Sorting/paging (alphabetically-sorted data can also be paged by first letter)
Drill-down (assuming the data is organized as above)
Free-text search
Hierarchies are easy to understand and (usually) easy to implement. However, they can be difficult to navigate and lead to ambiguities. Spatial visualization is by far the best option if your data is actually spatial or can be represented that way; unfortunately this applies to less than 1% of the data we normally deal with day-to-day. Tags are great, but - as we see here on SO - can often be misused, misunderstood, or otherwise rendered less effective than expected.
If it's possible for you to reorganize your data in some relatively natural way, then that should always be the first step. Whatever best communicates the natural ordering is usually the best answer.
No matter how you organize the data, you'll eventually need to start providing search capabilities, and unlike organization of data, search methods tend to be orthogonal - you can implement more than one. Filtering and sorting/paging are the easiest, and if an autocomplete textbox or paged list (grid) can achieve the desired result, go for that. If you need to provide the ability to search truly massive amounts of data with no coherent organization, then you'll need to provide a full textual search.
If I could point you to some list of "best practices", I would, but HID is rarely so clear-cut. Use the aforementioned options as a starting point and see where that takes you.

How do I data mine text?

Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).
Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).
Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)
The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??
I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.
Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html
I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

Resources