Keyword search over a collection of OWL ontologies - full-text-search

I have a collection of OWL ontologies. Each ontology is stored in a dataset of a triple store database (e.g OWLIM, Stardog, AllegroGraph ). Now I need to develop an application which supposes searching these ontologies based on keywords, i.e., given a keyword, the application should return ontologies that contains this keyword.
I have checked OWLIM-SE and Stardag, they only provide full text search over one dataset but not the whole database. I also have considered Solr(Lucene). But in this case the ontologies will be indexed twice (once by Lucene, another one by triple store database.)
Is there any other solution for this problem?
Thanks in advance.

Stardog's full text indexing works over an entire database and can be done transparently with SPARQL which will allow you to easily access other properties of the concepts matching your search criteria in a single query. This will get you precisely what you're describing.
For some information on administering the search indexes, and Stardog in general, check out these docs

Related

How to calculate relevance in Elasticsearch based on associated documents

Main question:
I have one data type, let's call them People, and another associated data type, let's say Reports. A person can have many reports associated with them and vice versa in our relational database. These reports can be pretty long, sometimes over 1000 words mostly in English.
We want to be able to search people by a keyword, where the results are the people whose reports are most relevant to the keyword. For example if Person A's reports mention "art director" a lot more than any other person, we want Person A to show up high in the search results if someone searched "art director".
More details:
The key thing here, is that we don't want to combine all the reports together and add them as a field for the Person model. I think that with 100,000s of our People records and 1,000,000s of long reports, this would make the index super big. (And I think there might be limits on how long the text of a field can be.)
The reports are also indexed on their own, so that people can do full text searches of all the reports without considering the People records. This works great already.
To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
Is this possible, and if so how?
P.S. I am using the Searchkick Ruby gem to generate the Elasticsearch queries through an API. But I can also use the Elasticsearch DSL directly if necessary.
Answering to your questions.
1.(...) we want Person A to show up high in the search results if someone searched "art director".
That's exactly what Elasticsearch does, so I would recommend you to start with a simple match query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
From there you can start adding up more complexity.
Elasticsearch uses TF-IDF which means:
TF(Term Frequency): The most frequent a term is within a document, more relevant it is.
IDF(Inverse Document Frequency): The most frequent a term is across the entire dataset the less relevant it is.
2.(...) To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
You are right. The recommendation is not indexing a book as a field, but index the different chapter/pages/etc.. as documents.
https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html
There are some structures you can use. Which one to use will depend on how big is the scale of your data, en how do you want to show this data to your users.
The structures are:
Joined field type (parent=author child=report pages)
Nested field type (array of report pages within an author)
Collapsed results (each doc being a book page, collapse by author)
We can discuss a lot about the best one, but I invite you to try yourself.
Some guidelines:
If the number of reports outnumber for a lot to the author you can use joined field type.

ElasticSearch vs Relational Database

I'm creating a microservice to handle the contacts that are created in the software. I'll need to create contacts and also search if a contact exists based on some information (name, last name, email, phone number). The idea is the following:
A customer calls, if it doesn't exist we create the contact asking all his personal information. The second time he calls, we will search coincidences by name, last name, email, to detect that the contact already exists in our DB.
What I thought is to use a MongoDB as primary storage and use ElasticSearch to perform the query, but I don't know if there is really a big difference between this and querying in a common relational database.
EDIT: Imagine a call center that is getting calls all the time from mostly different people, and we want to search fast (by name, email, last name) if that person it's in our DB, wouldn't ElasticSearch be good for this?
A relational database can store data and also index it.
A search engine can index data but also store it.
Relational databases are better in read-what-was-just-written performance. Search engines are better at really quick search with additional tricks like all kinds of normalization: lowercase, รค->a or ae, prefix matches, ngram matches (if indexed respectively). Whether its 1 million or 10 million entries in the store is not the big deal nowadays, but what is your query load? Well, there are only this many service center workers, so your query load is likely far less than 1qps. No problem for a relational DB at all. The search engine would start to make sense if you want some normalization, as described above, or you start indexing free text comments, descriptions of customers.
If you don't have a problem with performance, then keep it simple and use 1 single datastore (maybe with some caching in your application).
Elasticsearch is not meant to be a primary datastore so my advice is to use a simple relational database like Postgres and use simple SQL queries / a ORM mapper. If the dataset is not really large it should be fast enough.
When you have performance issues on searches you can use a combination of relation db and Elasticsearch. You can use Elasticsearch feeders to update ES with your data in you relational db.
Indexed RDBMS works well for search
If your data is structured i.e. columns are clearly defined, searching 1 million records will also not be a problem in RDBMS.
When to use Elastic
Text Search: Searching words across multiple properties (e.g. description, name etc.)
JSON Store and search: If data being stored is in json format and later needs to be searched
Auto Suggestions: Elastic is better at providing autocomplete suggestions
Elastic as an application data provider
Elastic should not be seen as data store, even if you storing data in it. It is about how you perceive elastic. Elastic should be used to store and setup data for the application. It is the application which decides how and when to use elastic (search and suggestions). Elastic is not a nosql storage alternative if compared to RDBMS, you should use a nosql database instead.
This perception puts elastic in line with redis and kafka. These tools are key components of an application design and they are used to serve as events stores, search engines and cache etc. to the applications.
Database with Elastic
Your design should use both. For storing the contacts use the database, index the contacts for querying. Also make the data available in elastic for searching, autocomplete and related matches.
As always, it depends on your specific use case. You briefly described it, but how are you acually going to use the data?
If it's just something simple like checking if a customer exists and then creating a new customer, then use the RDMS option. Moreover, if you don't expect a large dataset, so that scaling isn't an issue (hence the designation that Elasticsearch is for BigData), but you have transactions and data integrity is important, then a RDMS will be the right fit. Some examples could be for tax, leasing, or financial reporting systems.
However, if you have a large dataset, you need a wide range of query capabilities, such as a fuzzy search or searches where the user
can select multiple filters on the data or you want to do some predictive analysis on the data, then Elasticsearch is the clear choice.
For example, I worked on an web based app with a large customer base: 11 million, with 200+ hits per second at peak time for a find a doctor application. The customer could check some checkboxes to determine, specialty, spoken languages, ratings, hospitals, etc. all sorted by the distance from the users location with a 2 second or less response time. It would be very difficult for a RDMS to match that.

How to search for multiple strings in very large database

I want to search for multiple strings in a very large database. These strings are part of different attributes of database table. I have tried string search using LIKE in sql query. But it is taking a lot of time to get results. I have used Oracle database.
Should I use indexing of database? I found that Lucene can be used for it.
I also got some suggestions of using big data concepts. Which approach should I use?
The easiest way is:
1.) adding an index to the columns you like to search trough
2.) using oracle text as #lalitKumarB wrote
The most powerful way is:
3.) use an separate search engine (solr, elaticsearch).
But, probably you have to change you application in order to explicit use the search index for searching trough the data,...
I had the same situation some years before. Trying to search text in an big database. After a wile I found out, that database based search will never reach the performance of an dedicate search engine. And: you will have much more search features working out of the box, if you use solr (for example), like spelling correction, "More like this", ...
One option is to hold the data on orcale, searching in solr and return the ID of the document in order to only load the one row form oracle, the is referenced by the ID.
2nd option is to keep oracle as base datapool for your search engine and search in solr (or elasticsearch) in order to return the whole document/row from solr, not only the ID. So you don't need to load the data from the database any more.
The best option depends on your needs.
You have the choice between elasticsearch, solr or lucene

Elastic search and "databases"

Sorry for the ambiguous title, couldn't thing of anything better fitting.
I 'm exploring Elastic Search and it looks very cool. My question is conceptual since I 'm used to sql.
In Sql, you have different databases and you store the data for each application there. Does the same concept exist in ES? Or is all data from all my application going to end up in the same place? In that case, what are the best practices to avoid unwanted results from unfitting data?
Schemaless doesn't mean structureless:
In elastic search you can organize your data into document collections
A top-level document collection is roughly equivalent to a database
You can also hierarchically create new document collections inside top-level collections, which is a very rough equivalent of a database table
When you search documents, you search for documents inside specific document collections (such as search for all posts inside blog1)
Individual documents can be viewed as equivalent to rows in a database table
Also please note that I say roughly equivalent -- data in SQL is often normalized into tables by relations, while documents (in ES) often hold large entities of data. For instance, it generally makes sense to embed all comments inside a blog post document, whereas in SQL you would normalize comments and blogposts into individual tables.
For a nice tutorial, I recommend taking look at "ElasticSearch in 5 minutes" tutorial.
Switching from SQL to a search engine can be challenging at times. Elasticsearch has a concept of index, that can be roughly mapped to a database and type that can, again very roughly, mapped to a table. Elasticsearch has very powerful mechanism of selecting records (rows) of a single type and combining results from different types and indices (union). However, there is no support for joins at the moment. The only relationship that elasticsearch supports is has_child, but it's not suitable for modeling many-to-many relationships. So, in most cases, you need to be prepared to denormalize your data, so it can be stored in a single table.

apache cassandra query/full text search

I've been playing around with apache's cassandra project. Done a fair bit of readin and i have some fairly complex examples that i've done, including inserting single and batch sets of data, retrieving a single and multiple data sets based on keys.
Some of the articles i've looked at include
http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example
http://github.com/digg/lazyboy
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
http://www.sodeso.nl/?p=80
I've got a fairly good grasp of the concepts explained and have even implemented a simple app.
None of the articles describe how one would go about performing a query where, for eg, the query is a search term a user has typed in.
Does anyone know how or can suggest how i'd go about performing such a query?
Or perhaps a way to create a searchable index, full text search or anything even remotely close?
You will probably split text into words, and than use these words as keys to your "index". Each word will contain timestamp ordered column family with list of IDs to your articles, messages etc. So you can only perform simple searches over keys (words).
When searching more than one word, use intersection over these column families.
This is very simple approach, if you need more complex queries look at Lucandra - http://github.com/tjake/Lucandra - Lucandra is a fulltext search engine with Cassandra as backend storage.

Resources