similarity between 2 symbols oracle - oracle

Table with rows more than 3 million names (Name, surname, father name). I want check similarity more than 90%.
I used many fuzzy algorithms and also utl_match similarities (jaro_winkler , edit_distance). Performance of these algorithms is not good. (more than 20 sec.)
and i want check with changing places, but it works long. Like:
Name Surname Fathername,
Name Fathername Surname,
Surname Name Fathername,
.........
I couldnt find any algorithm with good performance, it works on transactional system.

You don't need to check against all 3 million names every time because you have duplicates in your database. What you can do as well is precluster your entries in these with most distance and then use different entry points.
So in the first step create your entry points:
Miller
Smith
Yang
...
And check versus this entry points with > for example 70% and then go deeper to the clusters where you have a quite good match. This should prune most of the searches and make your algorithmn much faster.

Related

Get a list of user matching different terms with a specified ratio

Let's say I have the following simple document structure.
{
username: string,
hobby: string
}
I want to get, in one request, a list of users containg 80% of users with football as hobby, 10% with rugby, 5% with volley, 5% with tennis.
Is this possible ? How can you achieve that ?
If so, is it possible to say that i want a percentage of user with a random hobby value.
Thanks a lot,
Julien
No. Elasticsearch does not give partially calculated results.
Another flaw is that the numbers might not match the exact percentage (in any database).
For example, you have 4 users in total, one with each hobby you specified. So here you cannot achieve the desired list with exact percentage. And there are infinite possibilities of such combinations.
Another improvement: If you have exactly this structure, consider Relational Database (like SQL).

How to filter a very, very large file

I have a very large unsorted file, 1000GB, of ID pairs
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABA122
ID:ABC124 ID:ABC123
ID:ABC124 ID:ABC126
I would like to filter the file for
1) duplicates
example
ABC123 ABC124
ABC123 ABC124
2) reverse pairs (discard the second occurrence)
example
ABC123 ABC124
ABC124 ABC123
After filtering, the example file above would look like
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABA122
ID:ABC124 ID:ABC126
Currently, my solution is this
my %hash;
while(my $line = <FH>){
chomp $line; #remove \n
my ($id1,$id2) = split / /, $line;
if(exists $hash{$id1$1d2} || exists $hash{$id2$id1}){
next;
}
else{
$hash{$id1$id2} = undef ; ## store it in a hash
print "$line\n";
}
}
which gives me the desired results for smaller lists, but takes up too much memory for larger lists, as I am storing the hash in memory.
I am looking for a solution that will take less memory to implement.
Some thoughts I have are
1) save the hash to a file, instead of memory
2) multiple passes over the file
3) sorting and uniquing the file with unix sort -u -k1,2
After posting on stack exchange cs, they suggested an external sort algorithm
You could use map reduce for the tasks.
Map-Reduce is a framework for batch-processing that allows you to easily distribute your work among several machines, and use parallel processing without taking care of synchronization and failure tolerance.
map(id1,id2):
if id1<id2:
yield(id1,id2)
else:
yield(id2,id1)
reduce(id1,list<ids>):
ids = hashset(ids) //fairly small per id
for each id2 in ids:
yield(id1,id2)
The map-reduce implementation will allow you to distribute your work on several machines with really little extra programming work required.
This algorithm also requires linear (and fairly small) number of traversals over the data, with fairly small amount of extra memory needed, assuming each ID is associated with a small number of other IDs.
Note that this will alter the order of pairs (make first id second in some cases)
If the order of original ids does matter, you can pretty easily solve it with an extra field.
Also note that the order of data is altered, and there is no way to overcome it when using map-reduce.
For better efficiency, you might want to add a combiner, which will do the same job as the reducer in this case, but if it will actually help depends a lot on the data.
Hadoop is an open source library that implements Map-Reduce, and is widely used in the community.
Depending on the details of your data (see my comment on the question) a Bloom filter may be a simple way to get away with two passes. In the first pass insert every pair into the filter after ordering the first and the second value and generate a set of possible duplicates. In the second pass filter the file using the set of possible duplicates. This obviously requires that the set of (possible) duplicates is not itself large.
Given the characteristics of the data set - up to around 25 billion unique pairs and roughly 64 bit per pair - the result will be on the order of 200 GB. So you either need a lot of memory, many passes or many machines. Even a Bloom filter will have to be huge to yield a acceptable error rate.
sortbenchmark.org can provide some hints on what is required because the task is not to different from sorting. The 2011 winner used 66 nodes with 2 quadcore processors, 24 GiB memory and 16 500 GB disks each and sorted 1,353 GB in 59.2 seconds.
As alternative to rolling your own clever solution, you could add the data into a database and then use SQL to get the subset that you need. Many great minds have already solved the problem of querying 'big data', and 1000GB is not really that big, all things considered...
Your approach is almost fine, you just need to move you hashes to disk instead of keeping them in memory. But let's go step by step.
Reorder IDs
It's inconvenient to work with records with different order of IDs in them. So, if possible, reorder IDs, or, if not, create additional keys for each record that holds correct order. I will assume you can reorder IDs (I'm not very good in Bash, so my code will be in Python):
with open('input.txt') as file_in, open('reordered.txt', 'w') as file_out:
for line in file_in:
reordered = ' '.join(sorted(line.split(' '))) # reorder IDs
file_out.write(reordered + '\n')
Group records by hash
You cannot filter all records at once, but you can split them into reasonable number of parts. Each part may be uniquely identified by hash of records in it, e.g.:
N_PARTS = 1000
with open('reordered.txt') as file_in:
for line in file_in:
part_id = hash(line) % N_PARTS # part_id will be between 0 and (N_PARTS-1)
with open('part-%8d.txt' % part_id, 'a') as part_file:
part_file.write(line + '\n')
Choice of has function is important here. I used standard Python's hash() (module N_PARTS), but you may need to use another function, that gives distribution of number of records with each hash close to uniform. If hash function work more or less ok, instead of 1 large file of 1Tb you will get 1000 small files of ~100Mb. And the most important thing is that you have a guarantee that there are no 2 same records in different parts.
Note, that opening and closing part files for each line isn't really a good idea, since it generates countless system calls. In fact, better approach would be to keep files open (you may need to increase your ulimit -f), use batching or even write to database - this is up to implementation, while I will keep the code simple for the purpose of demonstration.
Filter each group
100Mb file are much easier to work with, aren't they? You can load them into memory and easily remove duplicates with hash set:
unique = set([])
for i in range(N_PARTS): # for each part
with open('part-%8d.txt') as part_file:
file line in part_file: # for each line
unique.add(line)
with open('output.txt', 'w') as file_out:
for record in unique:
file_out.write(record + '\n')
This approach uses some heavy I/O operations and 3 passes, but it is linear in time and uses configurable amount of memory (if your parts are still too large for a single machine, just increase N_PARTS).
So if this were me I'd take the database route as described by #Tom in another answer. I'm using Transact SQL here, but it seems that most of the major SQL databases have similar windowing/ranking row_number() implementations (except MySQL).
I would probably run a two sweep approach, first rewriting the id1 and id2 columns into a new table so that the "lowest" value is in id1 and the highest in id2.
This means that the subsequent task is to find the dupes in this rewritten table.
Initially, you would need to bulk-copy your source data into the database, or generate a whole bunch of insert statements. I've gone for the insert here, but would favour a bulk insert for big data. Different databases have different means of doing the same thing.
CREATE TABLE #TestTable
(
id int,
id1 char(6) NOT NULL,
id2 char(6) NOT NULL
)
insert into
#TestTable (id, id1, id2)
values
(1, 'ABC123', 'ABC124'),
(2, 'ABC123', 'ABC124'),
(3, 'ABC123', 'ABA122'),
(4, 'ABC124', 'ABC123'),
(5, 'ABC124', 'ABC126');
select
id,
(case when id1 <= id2
then id1
else id2
end) id1,
(case when id1 <= id2
then id2
else id1
end) id2
into #correctedTable
from #TestTable
create index idx_id1_id2 on #correctedTable (id1, id2, id)
;with ranked as
(select
ROW_NUMBER() over (partition by id1, id2 order by id) dupeRank,
id,
id1,
id2
from #correctedTable)
select id, id1, id2
from ranked where dupeRank = 1
drop table #correctedTable
drop table #TestTable
Which gives us the result:
3 ABA122 ABC123
1 ABC123 ABC124
5 ABC124 ABC126
I'm no trying to answer the question, merely adding my 0.02€ to other answers.
A must-to-do to me is to split the task into multiple smaller tasks as was already suggested. Both the control flow and the data structures.
The way that Merge Sort was used with Tape Drives to sort big data volumes (larger than memory and larger then random-access-disk). In nowaday terms it would mean that the storage is distributed accross multiple (networked) disks or networked disk-sectors.
There are already languages and even operating systems that support this kind of distribution with different granularity. Some 10 years ago I had my hot candidates for this kind of tasks but I don't remember the names and things had changed since then.
On of the first was the distributed Linda Operating System with parallel processors attached/disconnected as needed. Basic coordination structure was huge distributed Tuple Space data structure where processors read/wrote tasks and wrote results.
More recent approach with similar distribution of work are the Multi agent systems (Czech Wikipedia article perhaps contains more links)
Related wikipedia article are Parallel Computing, Supercomputer Operating Systems and List of concurrent and parallel programming languages
I don't mean to say that you should buy some processor time on a supercomputer and run the computation there. I'm listing them as algorithmic concepts to study.
As there will be many times some free or open source software solutions available that will allow you to do the same in the small. Starting with cheap software and available hardware. e.g. back at university in 1990 we used the night time in computer lab to calculate ray-traced 3D images. It was very computationally expensive process as for every pixel you must cast a "ray" and calculate its collisions with the scene model. On 1 machine with scene with some glasses and mirrors it ran like 1 pixel per second (C++ and optimized assembly language code). At the lab we had some ~15 PCs available. So the final time might be reduced ~15 times (I386, I486 and image of 320x200 256 colors). The image was split into standalone tasks, computed in parallel and them merged into one. The approach scaled well at that time and similar approach would help you also today.
There always was and always will be something like "big data", that big that it does not fit into RAM and it does not fit on disk and it can't be computed on 1 computer in finite time.
Such tasks were solved successfully since the very first days of computing. Terms like B-Tree, Tape drive, Seek time, Fortran, Cobol, IBM AS/400 come from that era. If you're like engineers of those times than you'll for sure come out with something smart :)
EDIT: actually, you are probably looking for External Sorting

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database,
Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
I have tried to solve it by doing these steps:
a. Putting the street names into a Key,Value Hash
b. Split each sentences into words
c. Test whether words are in the hash
This is fast, but will not work with multiple words
Another alternatives that I thought of is to do these steps:
a. Split each sentences into words
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )
c. If query returned a row, it means that the sentence contains street names
Now, this solution is going to be a very IO intensive.
So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============EDIT 1 =================
Will this be periodical ?
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).
This algorithm should be fairly easy to implement and should perform pretty good too.
Using nlp, you can determine the proper noun in a sentence. Please refer to the link below.
http://nlp.stanford.edu/software/lex-parser.shtml
The standford parser is accurate in its calculation. Once you have the proper noun, you can decide the approach to follow.
So you have a document and want to seach if it contains any of your list of streetnames?
Turbo Boyer-Moore is a good starting point for doing that.
Here is more information on turbo boyer moore
But, i strongly believe, you will have to do something about the organisation of your list of street names. there should be some bucket access to it, i.e. you can easily filter for street names:
Here an example:
Street name: Asia-Pacific-street
You can access your list by:
A (getting a starting point for all that start with an A)
AS (getting a starting point for all that start with an AS)
and so on...
I believe you should have lots of buckets for that, at least 26 (first letter) * 26 (second letter)
more information about bucketing
The Aho-Corasick algorithm could be pretty useful. One of it's advantages is that it's run time is independent of how many words you are searching for (only how long the text is you are searching through). It will be especially useful if your list of street names is not changing frequently.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

Privacy and Anonymization "Algorithm"

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Matching people based on names, DoB, address, etc

I have two databases that are differently formatted. Each database contains person data such as name, date of birth and address. They are both fairly large, one is ~50,000 entries the other ~1.5 million.
My problem is to compare the entries and find possible matches. Ideally generating some sort of percentage representing how close the data matches. I have considered solutions involving generating multiple indexes or searching based on Levenshtein distance but these both seem sub-optimal. Indexes could easily miss close matches and Levenshtein distance seems too expensive for this amount of data.
Let's try to put a few ideas together. The general situation is too broad, and these will be just guidelines/tips/whatever.
Usually what you'll want is not a true/false match relationship, but a scoring for each candidate match. That is because you never can't be completely sure if candidate is really a match.
The score is a relation one to many. You should be prepared to rank each record of your small DB against several records of the master DB.
Each kind of match should have assigned a weight and a score, to be added up for the general score of that pair.
You should try to compare fragments as small as possible in order to detect partial matches. Instead of comparing [address], try to compare [city] [state] [street] [number] [apt].
Some fields require special treatment, but this issue is too broad for this answer. Just a few tips. Middle initial in names and prefixes could add some score, but should be kept at a minimum (as they are many times skipped). Phone numbers may have variable prefixes and suffixes, so sometimes a substring matching is needed. Depending on the data quality, names and surnames must be converted to soundex or similar. Streets names are usually normalized, but they may lack prefixes or suffixes.
Be prepared for long runtimes if you need a high quality output.
A porcentual threshold is usually set, so that if after processing a partially a pair, and obtaining a score of less than x out of a max of y, the pair is discarded.
If you KNOW that some field MUST match in order to consider a pair as a candidate, that usually speeds the whole thing a lot.
The data structures for comparing are critical, but I don't feel my particular experience will serve well you, as I always did this kind of thing in a mainframe: very high speed disks, a lot of memory, and massive parallelisms. I could think what is relevant for the general situation, if you feel some help about it may be useful.
HTH!
PS: Almost a joke: In a big project I managed quite a few years ago we had the mother maiden surname in both databases, and we assigned a heavy score to the fact that = both surnames matched (the individual's and his mother's). Morale: All Smith->Smith are the same person :)
You could try using Full text search feature maybe, if your DBMS supports it? Full text search builds its indices, and can find similar word.
Would that work for you?

Resources