From AEM documents I can figure it out how to write queries for Aem content search, but How search feature works in AEM? Which bundle or framework does the magic of searching the content and present back. How internally content is being traversed when I use search queries ?
AEM uses OAK indexes to implement the search engine. AEM repository is a database and like every other database, it needs indexes to perform speedy searches. You can read more on: https://docs.adobe.com/docs/en/aem/6-2/deploy/platform/queries-and-indexing.html
In general, you define indexes (in case OOTB indexes are not enough) under /oak:indexes node. These indexes, in a broad sense, contain list of properties and nature (async, full text, property, lexical rules) of index and the path to be indexed (or excluded from index).
AEM generates a lot of lucene index data in your repository and data store and that is used to quickly lookup the nodes for your queries. Whenever, a query is fired the AEM instance loops through the indexes and finds the index which will provide the results with least traversal cost. If no such index is found it will resort to node traversal which is normally bad for performance but has some limited edge case uses.
You can integrated Solr and ElasticSearch with your AEM instance to use other advanced features but that is simply an extension to the built-in engine.
Search and promote (which is more of an external search) is not related to internal index and is more like a site crawler.
Queries and searches is a very broad topic so I suggest you read this reply as a summary and more details can be found from the link above.
I agree with the previous answer from Imran.
Question is very general and if you are interested in more details like how does Apache Lucene works in AEM, what options exist for integration with external search engines and how to do it, it is available here:
GitHub repository and six write ups - step by step how to use search engines in AEM.
Related
Hi I'm trying out Elastic Enterprise Search with Elasticsearch. I have a couple of questions on data indexing.
When referring to Elasticsearch documentation, I read that there is a limit to the number of fields that an Elasticsearch index could have. Since Elasticsearch is used with Elastic Enterprise Search I believe there is no arguing that the same applies here. In that case lets say I have multiple document types with various fields. For an example Person.json and Dog.json, they both have different properties. So when indexing I use one search engine in Elastic Enterprise Search to index both Person and Dog so that when I query using the Elastic Enterprise Search API I'll get results which are both Person and Dog depending on the search term.
Is this the way to go,or should I specify a seperate search engine for each schema type?
I am assuming that your person.json and dog.json contains different fields as your heading suggest and weather to create a separate index for these entities or have them in a single index, depends on the various use-cases you have in your application and you will not find elasticsearch marking one approach better than other and mainly will explain the pros/cons based on a particular context(like relevance, performance, management etc).
Please refer to my this SO answer, where I talked about various pros/cons of both the approach and discussion in chat to get more context why OP chose an approach based on his use-case, after knowing the pros/cons.
Couchbase FTS is now an official feature in version 5. Why would one still use ElasticSearch along with Couchbase?
Quoting from the documentation:
Couchbase FTS is similar in purpose to other search software such as
ElasticSearch or Solr. Couchbase FTS is not intended as a replacement
for third party search software if search is at the core of your
application. It is a simple and lightweight way to add search to your
Couchbase data without deploying additional software and servers. If
you have many queries which look like SELECT ... field1 LIKE %pattern% OR field2 LIKE %pattern, then full-text search may be right for you.
It will depend on your specific use case, but there is a reason why search is a complicated problem and some products spent years and years on working on that (and continue).
Full text search NOT EQUAL Search engine. Full Text Search does support a lot of functions that ElasticSearch provides. For example in ElasticSearch you can set weight of fields in result set, do geo search etc. Couchbase full text search is just full text search implementation, i.e. basic string matching function in specially indexed field only.
So, if your task is to do basic search on sub string as a part of a query, then you don't need ElasticSearch anymore. It make development quicker and infrastructure cheaper. However, if you are building system that need proper search engine, then you need ElasticSearch as much as before.
I'm just wanting to know what is exactly Elastic Search.
It is said it helps to search data but when I see some webinars it feels like I have to replicate my data in a kind of Elastic datastore... which not means very otpimized to me. In that way all modification done on left hand will have to be reported on right hand and data returned by Elastic Search may not be in the right format.
Can Elastic Search can directly search in my database?
It's to use with a Neo4J graph database. Does somebody already did something like that? Does that only replace the Cypher queries?
Thanks for advices, helping me on realize on what Elastic Search can really helps on our project.
Elasticsearch is a database, however it's not a relational database like you may be used to. It is a NoSQL database.
You insert JSON documents into an index. You query that index to find documents that match a particular criterion.
It is also sharded and node distributed, which gives it resilience and scalability, and also - if you set it up right - performance.
This means it's really good at 'search engine' style database queries, but because it's not relational, it cannot do the equivalent of a SQL JOIN operation very easily.
One example use case is logstash and kibana - known as the ELK stack - where system event logs (syslog, httpd logs, that kind of thing) are processed by logstash to parse metadata - like log source, referrer, URL, session ID, etc. - and then inserted into elasticsearch.
As each event is a self contained piece of information, this is what elasticsearch does particularly well.
You can then use Kibana as a visualisation engine to display your logs, but also perform analysis - most hit pages, geographic distribution of requests, incoming referrers, time based distribution of requests, etc.
But it also collates these logs, so if you run a really large, geographically distributed website with multiple webserver nodes - or maybe you just have a lot of servers in your computer room and want to summarise the system logs - you can feed the whole lot into elastic search.
It's design is such that it's good at handling near-real-time data insertion and analysis. It also works quite well for 'forum style' data models, as essentially all you're doing is querying a list of posts with a particular forum name, and finding replies to a particular parent node - but they're standalone 'documents'.
So yes, you probably could use it to search an existing database, but you'll have to think about your data model - you can't just translate a conventional relational model, you would have to flatten it. Denormalisation is something of a sin in RDBMS terms, but it's actually quite good for search engines, because you can execute queries in parallel more efficiently.
There exist some way to combine both approaches. Have a look at this blog post:
http://graphaware.com/neo4j/2015/09/30/recommendations-with-neo4j-and-graph-aided-search.html
Databases cannot be optimized for all use cases, but luckily there are many databases available so we can choose the best one for each task.
Elasticsearch is optimized for:
Filtering of documents (exact match)
Search ranking of documents (relevance of search terms)
Aggregation of results (sums, distinct counts, percentiles, ...)
Neo4j is optimized for:
Graph traversal (naturally)
High performance when operated on a "local" graph neighborhood (context)
Actually both databases use the same underlying library Lucene to "index" data to be searched later.
ES is an open source, distributed, RESTful, JSON-based search engine. It is easy to use, scalable and flexible. The indexing feature helps in fast retrieval of search queries.
I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.
I user Hibernate as well as the Grails Searchable Plugin which is based on Lucene and Compass. I was wondering when I should use what for querying objects from the database.
Is there a rule of thumb when to use Hibernate and when to user Searchable?
Searcable plugin will be highly useful when you think of free form text search through out your application.
To cite an example, if you are working on a banking application and you are building a portal with a search feature. And you want the search to be free form for all the key elements like customer name, ssn, phone number and/or email id, then you would like to index those using searchable and provide the search talking to searchable to get immediate search results. For this to happen you would have to index those key elements at the least. The indices would grow as ans when you add more key search elements.
On the other hand, hibernate will help you provide the detail information if you do not want to index lot of elements. To extend the above example, once you did a search on SSN and you got a hit, on selecting that entry you can use hibernate to fetch the detail information from the underlying persistence layer using hibernate.
Inference:
For speedy, high performance, free form search searhable is an option.
For gathering detailed information, post the search, I think hibernate is the way to go unless you want to use searchable for the detail info as well in which case the size of the indices will be in Gigs.
Follow here in elastic search which might help to understand.
My point is to make elastic/searchable lighter keeping the heavy lifting part taken care by hibernate.
NOTE
On a side note, I would suggest using elastic instead of searchable. It has also got a groovy API which is useful. Also note that elastic plugin uses v0.20.0 version of elastic search right now, the latest one being v0.90.2 I guess. If required you can directly use elastic search as a dependency and get the latest feature.