developing a db schema for efficient searching - algorithm

I want to implement a search similar to as seen in http://maps.google.com/. If I type a name of place or something i can see matching places. I know it uses AJAX.
But the major concern is fast retrieval of matching data from the database in quick time, as the user can type in almost anything. He can type a name of popular shop or something , or a name of a place ,or a shop followed by place name.
How can I design a database structure to make such a search? I just need pointers.
So, any pointers about search algorithms?

There's a whole field called spatial databases, or GIS (geospatial INformation services). Some major players are
Oracle Spatial
PostGIS
ESRI
Mapinfo
As for data structures k-d tree's are the typical spatial data structure. Lecture 3 here http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-851-advanced-data-structures-spring-2010/lecture-notes/ describes k-d trees nicely if briefly
hth

Related

What structure and algorithm use for building genealogy tree?

I found in one book, that for presenting genealogy (family) tree good to use DAG (directed acyclic graph) with topological sorting, but this algorithm is depending on order of input data.
Genealogy databases typically use what's called a lineage-linked structure.
This means that partners (husbands/wives) are linked and called a family. And a family is linked to it's children with a link back from the children to its parent family.
I do not know of a specific graph type that represents this. Most programs custom program it with a family table and an individual table with the appropriate links between them.
Genealogy databases generally follow this structure to match the GEDCOM (Genealogy Data Communications) standard that was developed to allow transfer of data between programs.
In that standard, you'll specifically see FAM and INDI records. FAM records are connected to INDI records with HUSB, WIFE and CHIL links. INDI records are connected to FAM records with FAMS (spouse) and FAMC (parent) links.
Using this data structure will allow you easily to read a GEDCOM file and import data from other genealogy software, and also export your data to a GEDCOM file so that other genealogy programs can read it.
In genealogy, the so-called Ahnentafel indexing (German for "ancestor table") is used for representation of the ancestors of a single person; basically this is a suitable linearization of a binary tree.
To present the relations between people found in historic record, Open Archives uses a flexible force-directed graph layout implementation. In this graph every node is a person, and there are two types of vertices: one depicting a marriage (orange) and one depicting a parent relation (the red 'blood' line). An example of a graph can be seen here.
DAGs will not work. Might look at prior post using GEDCOM model in Neo4j
The lineages can have complex relationships such as double cousins, step-sibling marriages, consangienity, etc. These are easily managed in a non-sql data base such as Neo4j.

Algorithm to recognize keywords' categories in a One-search-box-for-all model query

I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value.
Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
Search the index with the query as-it-is.
Iterate through results up to some extent and recognize first document with a new dimension.
Problems
Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.
Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.
Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.
Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.
Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.
what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!!
give this algorithm a look aswell
http://en.wikipedia.org/wiki/Okapi_BM25

NoSQL or YesSQL

I have a huge dictionary of words:
"word1" => [value1]
"word2" => [value2]
"word3" => [value3, value2]
...
"word400000000" => [value455, value3435, ..., value3423]
number of words is really big.
Now I want to be able to retrieve, really fast, all the values which are being pointed by word. word is string value.
What are the best tools to use? I thought of simple DB solution, but DBA guys said that it will not work really fast.
So, before I open Cormen's book, is there some ready solutions for that problem?
Look at key/value storage engines such as Berkeley DB. They are very fast at that sort of thing.
In RDMSs (YesSQL) you will most probably search values with LIKE or = operators on all records, i.e. search will take O(n). What you actually need is a data structure called inverted index, which allows you to find list of needed values in O(1). For description of structure and algorithms see Wikipedia article, for ready-to-use tools keep reading.
There's plenty of implementations of inverted index in search engines like Lucene/Solr, Sphinx (which, by the way, supports several databases as data source), and also in some key-value stores like Berkeley DB or Apache Cassandra. Distinction between search engines and key-value stores is in that:
Search engines implement inverted index more directly (AFAIK, key-value DBs use BigTable-like structures, that are much more complex then inverted index itself).
Search engines have a plenty of tools for text analysis (parsing, stemming). I don't know, if you actually need it, but if you do, use search engines.
Key-value DBs are real databases. I.e., unlike search engines they have real data types, not only strings. Moreover, some of such DBs (e.g. Berkeley DB) can store programming language native data types without converting them to any inner format. So, if you need a real database with all features, use key-value stores.
Also note, that inverted index is really simple structure, so you can easily implement it by yourself, if none of previous options is suitable for you.
It really depends on what behavior you want. If you just want to be able to do an exact text search, then a hash table is probably a really great idea. It has expected O(1) lookup, which is about as fast as you're going to get.
If you need the elements in sorted order (for example, so you can iterate across them in a reasonable order), then one of the myriad balanced search trees might be a good candidate; for example, a red-black tree or an AVL tree.
If you're working with a huge data set that can't all fit into main memory, then a very good choice might be a B-tree, which is a type of balanced binary search tree that minimizes the number of disk reads required to find a given element. Most database systems use some flavor of B-trees for their lookups.
You can use cassandra (http://cassandra.apache.org/). Is Easy to start, has pretty much documentation and is a really fast solution for your problem.
Hope this helps,
If you know that you will only want to search for values based on words and not the other way around, use a simple Key-Value store. Maybe Redis would be best.
If you think you will ever need to search based on the values, then you'll likely need Secondary Indices or off-line MapReduce jobs. Maybe Cassandra would be best.

Can anyone point me toward a content relevance algorithm?

A new project with some interesting requirements has arrived on my desk. I need to develop a searchable directory of businesses, with a focus on delivering relevant results based on arbitrary search queries. The businesses can be of any niche; there's no one area that is more represented than another.
When googling for things like "search algorithm" or "content relevance algorithm," all I get are references to Google's "Mystical Algorithm of the Old Gods" and SEO firms.
Does the relevance value of MySQL's full text Match() function have what it takes for the task? I've never used it, but I'm definitely going to do some testing. Also, since this will largely be a human edited directory, I can assume that we can add weighted factors like tagging and categories. What would be a good way to combine these factors with MySQL's Match() relevancy?
I'm also open to ideas that I've not discussed here.
For an example of information retrieval based techniques lookup TF-IDF or BM25.
For machine learning based techniques, lookup RankNet and its variants from MSR.
If you have hand edited data, have a look at Oracle text search. In one of my previous projects we had some good results.
I was not directly involved in the database setups, but I know that the results were very welcome. (Before this they had just keyword based search).
Use a search engine like Solr to index the data. You can still use MySql to hold the data, but for searches use a search engine.

Multi-dimensional data structure

Which of the following data structure
R-tree,
R*-tree,
X-tree,
SS-tree,
SR-tree,
VP-tree,
metric-trees
provide reasonably good performance in insert, update and searching of multidimensional data stored in its corresponding form?
Is there a better data structure out there for handling multidimensional data?
What kind of multi-dimensional data are you talkign about? In the R-tree wiki, it states that itis used for indexing multi-dimensional data, but it seems clear that it will be primarily useful for data which is multi-dimensional in the same kind of feature -- i.e. vertical location and horizontal location, longitude and latitude, etc.
If the data is multi-dimensional simply because there are a lot of attributes for the data and it needs to be analyzed along many of these dimensions, then a relational representation is probably best.
The real issue is how do you optimize the relations and indices for the type of queries you need to answer? For this, you need to do some domain analysis beforehand, and some performance analysis after the first iteration, to determine if there are better ways to structure and index your tables.

Resources