Dbpedia indexing for named entity linking (chatbot) - elasticsearch

I'm working on a project for a chatbot. The chatbot must answer users' questions using dbpedia, and was initially trained with the IBM watson assistant service. However, in this service it is necessary to manually fill in the dictionaries in which the dbpedia entities and their synonyms are defined. The entities defined in the dictionaries are those that are recognized in the user's natural language questions.
For example, in the question "Who is the director of spiderman?" the chatbot recognizes dbo:director and Spiderman entity, because they are defined in the dictionary.
Manually inserting all the dbpedia entities in the dictionaries is limiting and for the moment the chatbot recognizes only the few entities included in the dictionary.
I therefore want to recognize dbpedia entities present in natural language questions written by the user by exploiting an indexing of the dbpedia rdf datasets on something like Elasticsearch or Lucene, and then using the full-text search. I thought of indexing entities using only the literal properties of dbpedia (to use full-text search). Before continuing I would like to know if it is a right approach, and have some advice on how to proceed in setting the indexes and on how to effectively exploit the full-text search.
Thank you

Related

How to calculate relevance in Elasticsearch based on associated documents

Main question:
I have one data type, let's call them People, and another associated data type, let's say Reports. A person can have many reports associated with them and vice versa in our relational database. These reports can be pretty long, sometimes over 1000 words mostly in English.
We want to be able to search people by a keyword, where the results are the people whose reports are most relevant to the keyword. For example if Person A's reports mention "art director" a lot more than any other person, we want Person A to show up high in the search results if someone searched "art director".
More details:
The key thing here, is that we don't want to combine all the reports together and add them as a field for the Person model. I think that with 100,000s of our People records and 1,000,000s of long reports, this would make the index super big. (And I think there might be limits on how long the text of a field can be.)
The reports are also indexed on their own, so that people can do full text searches of all the reports without considering the People records. This works great already.
To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
Is this possible, and if so how?
P.S. I am using the Searchkick Ruby gem to generate the Elasticsearch queries through an API. But I can also use the Elasticsearch DSL directly if necessary.
Answering to your questions.
1.(...) we want Person A to show up high in the search results if someone searched "art director".
That's exactly what Elasticsearch does, so I would recommend you to start with a simple match query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
From there you can start adding up more complexity.
Elasticsearch uses TF-IDF which means:
TF(Term Frequency): The most frequent a term is within a document, more relevant it is.
IDF(Inverse Document Frequency): The most frequent a term is across the entire dataset the less relevant it is.
2.(...) To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
You are right. The recommendation is not indexing a book as a field, but index the different chapter/pages/etc.. as documents.
https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html
There are some structures you can use. Which one to use will depend on how big is the scale of your data, en how do you want to show this data to your users.
The structures are:
Joined field type (parent=author child=report pages)
Nested field type (array of report pages within an author)
Collapsed results (each doc being a book page, collapse by author)
We can discuss a lot about the best one, but I invite you to try yourself.
Some guidelines:
If the number of reports outnumber for a lot to the author you can use joined field type.

Elastic Enterprise Search - Is it a best practice to index data of two different json schema in a single index

Hi I'm trying out Elastic Enterprise Search with Elasticsearch. I have a couple of questions on data indexing.
When referring to Elasticsearch documentation, I read that there is a limit to the number of fields that an Elasticsearch index could have. Since Elasticsearch is used with Elastic Enterprise Search I believe there is no arguing that the same applies here. In that case lets say I have multiple document types with various fields. For an example Person.json and Dog.json, they both have different properties. So when indexing I use one search engine in Elastic Enterprise Search to index both Person and Dog so that when I query using the Elastic Enterprise Search API I'll get results which are both Person and Dog depending on the search term.
Is this the way to go,or should I specify a seperate search engine for each schema type?
I am assuming that your person.json and dog.json contains different fields as your heading suggest and weather to create a separate index for these entities or have them in a single index, depends on the various use-cases you have in your application and you will not find elasticsearch marking one approach better than other and mainly will explain the pros/cons based on a particular context(like relevance, performance, management etc).
Please refer to my this SO answer, where I talked about various pros/cons of both the approach and discussion in chat to get more context why OP chose an approach based on his use-case, after knowing the pros/cons.

NLP and context based search using Elastic search

I have been using ES to handle regular text/keyword search, is there a way to use elastic search to handle context based search i.e when user have given a search text "articles between 10 august and 24 September" and such similar scenarios, ES should be able identify what user is asking and present results. I suppose we are supposed to involve ML to handle such scenarios, If any NLP or ML integrations need to be done where should i start to up the search experience.
Any insight over this is much appreciated
This is called semantic parsing. What you need to do is to map the sentence to a logical form. This is a challenging task, since the computer needs to understand your sentence. You may create your own Semantic Parser(e.g., SEMPRE) to do the translation, or use existing methods to do such translations (translate human language to elastic search queries).

Natural Language Processing Using Elasticsearch and Google Cloud Api

I want to use NLP with elasticsearch. I have been able to achieve one level by using Open NLP plugin as mentioned in comments of this question. I am getting entities like person, organization, location etc indexed while inserting documents.
I have a doubt while searching the same information.Since, I need to process the terms entered by the user during query time. Following is what I have thought of:
Process the query entered by user using apache NLP as specified here.
Extract Person, location and organisation Names from the previous and then run a query against the entities stored in index.
I am also thinking of using Google Knowledge Graph Search Api to fetch related information about the extracted entities in the previous steps and then include them in search query as well. (Reason to do this is because we want to show results of Delhi in case some one searches for Capital Of India). We are not going with Synonyms Search approach in this case as we want the information to be dynamically available.
My question is that-
Is there something better we can do to achieve the same, because lot of processing at query time is going to increase the response time?

Keyword search over a collection of OWL ontologies

I have a collection of OWL ontologies. Each ontology is stored in a dataset of a triple store database (e.g OWLIM, Stardog, AllegroGraph ). Now I need to develop an application which supposes searching these ontologies based on keywords, i.e., given a keyword, the application should return ontologies that contains this keyword.
I have checked OWLIM-SE and Stardag, they only provide full text search over one dataset but not the whole database. I also have considered Solr(Lucene). But in this case the ontologies will be indexed twice (once by Lucene, another one by triple store database.)
Is there any other solution for this problem?
Thanks in advance.
Stardog's full text indexing works over an entire database and can be done transparently with SPARQL which will allow you to easily access other properties of the concepts matching your search criteria in a single query. This will get you precisely what you're describing.
For some information on administering the search indexes, and Stardog in general, check out these docs

Resources