Finding personal information in documents (hard problem) - algorithm

I am tasked with trying to create an automated system that removes personal information from text documents.
Emails, phone numbers are relatively easy to remove. Names are not. The problem is hard because there are names in the documents that need to be kept (eg, references, celebrities, characters etc). The author name needs to be removed from the content (there may also be more than one author).
I have currently thought of the following:
Quite often personal names are located at the beginning of a document
Look at how frequently the name is used in the document (personal names tend to be written just once)
Search for words around the name to find patterns (mentions of university and so on...)
Any ideas? Anyone solved this problem already??

With current technology, doing what what you are describing in a fully automated way with a low error rate is impossible.
It might be possible to come up with an approximate solution, but it would still make a lot of errors...... either false positives or false negatives or some combination of the two.
If you are still really determined to try, I think your best approach would be Bayseian filtering (as used in spam filtering). The reason for this is that it is quite good at assigning probabilities based on relative positions and frequencies of words, and could also learn which names are more likely / less likely to be celebrities etc.

The area of machine learning that you would need to learn about to make an attempt at this would be natural language processing. There are a few different approaches that could be used, bayesian networks (something better then a naive bayes classifier), support vector machines, or neural nets would be areas to research. Whatever system you end up building would probably need to use an annotated corpus (labeled set of data) to learn where names should be. Even with a large corpus, whatever you build will not be 100% accurate, so you would probably be better off setting flags at the names for deletion instead of just deleting all of the words that might be names.

This is a common problem in basic cryptography courses (my first programming job).
If you generated a word histogram of your entire document corpus (each bin is a word on the x-axis whose height is frequency represented by height on the y-axis), words like "this", "the", "and" and so forth would be easy to identify because of their large y-values (frequency). Surnames should at the far right of your histogram--very infrequent; given names towards the left, but not by much.
Does this technique definitively identify the names in each document? No, but it could be used to substantially constrain your search, by eliminating all words whose frequency is larger than X. Likewise, there should be other attributes that constrain your search, such as author names only appear once on the documents they authored and not on any other documents.


How do I get a quick and dirty recognition of possible typos in .net?

I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".

Fuzzy logic text transformation methodologies?

I have a large set of data (several hundred thousand records) that are unique entries in a CSV. These entries are essentially products that are being listed in a store from a vendor that offers these products. The problem is that while they offer us rights to copy these verbatim or to change wording, I don't want to list them verbatim obviously since Google will slap the ranking for having "duplicate" content. And then, also obviously, manually editing 500,000 items would take a ridiculous amount of time.
The solution, it would seem, would be to leverage fuzzy logic that would take certain phraseology and transform it to something different that would not then be penalized by Google. I have hitherto been unable to find any real library to address this or a solid solution that addresses such a situation.
I am thinking through my own algorithms to perhaps accomplish this, but I hate to reinvent the wheel or, worse, be beaten down by the big G after a failed attempt.
My idea is to simply search for various phrases and words (sans stop words) and then essentially map those to phrases and words that can be randomly inserted that still have equivalent meaning, but enough substance to hopefully not cause a deranking situation.
A solution for Ruby would be optimal, but absolutely not necessary as any language can be used.
Are there any existing algorithms, theories or implementations of a similar scenario that could be used to model or solve such a scenario?

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Latent Semantic Indexing

It is said that through LSI, the matrices that are produced U, A and V, they bring together documents which have synonyms. For e.g. if we search for "car", we also get documents which have "automobile". But LSI is nothing but manipulations of matrices. It only takes into account the frequency, not semantics. So whats the thing behind this magic that I am missing? Please explain.
LSI basically creates a frequency profile of each document, and looks for documents with similar frequency profiles. If the remainder of the frequency profile is enough alike, it'll classify two documents as being fairly similar, even if one systematically substitutes some words. Conversely, if the frequency profiles are different, it can/will classify documents as different, even if they share frequent use of a few specific terms (e.g., "file" being related to a computer in some cases, and a thing that's used to cut and smooth metal in other cases).
LSI is also typically used with relatively large groups of documents. The other documents can help in finding similarities as well -- even if document A and B look substantially different, if document C uses quite a few terms from both A and B, it can help in finding that A and B are really fairly similar.
According to the Wikipedia article, "LSI is based on the principle that words that are used in the same contexts tend to have similar meanings." That is, if two words seem to be used interchangeably, they might be synonyms.
It's not infallible.

how to get the similar texts from a lot of pages?

get the x most similar texts from a lot of texts to one text.
maybe change the page to text is better.
You should not compare the text to every text, because its too slow.
The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case.
By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas.
Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise.
Some of the simpler metrics are:
length of the longest common sequence of words
number of common words
number of common sequences of words of more than n words
number of common words for the top n most frequent words within each document.
length of the document
Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages)
More complicated and/or computationally expensive measurements are:
Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)
Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)
Linear algebra-based measurements
Statistical methods such as Bayesian fitlers
In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs.
Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to...
normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)
exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.
identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.
Tokenize texts, remove stop words and arrange in a term vector. Calculate tf-idf. Arrange all vectors in a matrix and calculate distances between them to find similar docs, using for example Jaccard index.
All depends on what you mean by "similar". If you mean "about the same subject", looking for matching N-grams usually works pretty well. For example, just make a map from trigrams to the text that contains them, and put all trigrams from all of your texts into that map. Then when you get your text to be matched, look up all its trigrams in your map and pick the most frequent texts that come back (perhaps with some normalization by length).
I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.
You will have to define a function to measure the "difference" between two pages. I can imagine a variety of such functions, one of which you have to choose for your domain:
Difference of Keyword Sets - You can prune the document of the most common words in the dictionary, and then end up with a list of unique keywords per document. The difference funciton would then calculate the difference as the difference of the sets of keywords per document.
Difference of Text - Calculate each distance based upon the number of edits it takes to turn one doc into another using a text diffing algorithm (see Text Difference Algorithm.
Once you have a difference function, simply calculate the difference of your current doc with every other doc, then return the other doc that is closest.
If you need to do this a lot and you have a lot of documents, then the problem becomes a bit more difficult.
