I'm writing a diploma thesis about full-text search engines, and I'm struggling with one fundamental thing.
If we want to have searching on our website, the reason we need dedicated full-text search engines, alongside our classic MySQl/PostgreSQL/Oracle databases, is because databases in general don't have the full-text search capabilities that are needed for quality search. However, full-text search engines need to have a lot of features that classic databases have, in order to be able to search by specific fields, sort results, have proper scalability etc.
I saw people having a classic database, and additionally maintain e.g. Elasticsearch. I was reading a lot about all of the features full-text search engines need to have, it's really overlapping with features of classic databases. So, my question is: do we even need classic databases alongside full-text search engines? Do databases have some additional features that full-text search engines don't? Can we just keep all of our data in a single full-text search database like Elasticsearch?
Related
If DocumentDB can do its own indexing and Azure Search can also do indexing, then why would I want to use them together? Any use cases?
Also, using DocumentDB is already expensive and if I use Azure Search with it, how does this affect my DocumentDB performance and cost?
DocumentDB shines as a general purpose document database, while Azure Search shines as a full text search (FTS) engine.
For example, Azure Search provides:
Linguistically-aware indexing and search that takes into account word forms (e.g., singular vs. plural, verb tenses, and many other kinds of grammatical inflections) in ~60 languages.
High quality lemmatization and tokenization. For example, word-breaking Chinese text is hard because white space is optional
Synonyms
Proximity search, similar pronunciation search (soundex / metaphone), wildcard and regex search using Lucene query syntax
Customizable ranking, so that you can boost newer documents, for example
Suggestions
... dozens of other text processing and natural language-related features
If all you need are simple numerical filters or exact string comparisons, just use DocumentDB.
If you need natural language search for some of your content, use Azure Search together with DocumentDB. Connecting them is easy with DocumentDB indexer.
In terms of cost implications, using Azure Search with DocumentDB doesn't change the cost of DocumentDB. If you use the DocumentDB indexer, it will consume a certain amount of Read Units - how much depends on your data and the query you use, as well as your indexing schedule.
Our company has several products and several teams. One team is in charge of searching, and is standardizing on Elasticsearch as a nosql db to store all their data, with plans to use Neo4j later to compliment their searches with relationship data.
My team is responsible for the product side of a social app (people have friends, and work for companies, and will be colleagues with everyone working at their companies, etc). We're looking at graph dbs as a solution (after abandoning the burning ship that is n^2 relationships in rdbms), specifically neo4j (the Cypher query language is a beautiful thing).
A subset of our data is similar to the data used by the search team, and we will need to make sure search can search over their data and our data simultaneously. The search team is pushing us to standardize on ElasticSearch for our db instead of Neo4j or any graph db. I believe this is for the sake of standardization and consistency.
We're obviously coming from very different places here, search concerns vs product concerns. He asserts that ElasticSearch can cover all our use cases, including graph-like queries to find suggestions. While that's probably true, I'm really looking to stick with Neo4j, and use an ElasticSearch plugin to integrate with their search.
In this situation, are there any major gotchas to choosing ElasticSearch over Neo4j for a product db (or vice versa)? Any guidelines or anecdotes from those who have been in similar situations?
We are heavy users of both technologies, and in our experience you would better use both to what they are good for.
Elasticsearch is a super good piece of software when it comes to search functionalities, logs management and facets.
Despite their graph plugin, if you want to use a lot of social network and alike relationships in elasticsearch indices, you will have two problems :
You will have to update documents everytime a relationship changes, which can come to a lot when a single entity changes. For example, let's say you have organizations having users which are doing contributions on github, and you want to search for organizations having the top contributors in a certain language, everytime a user is doing a contribution on github you will have to reindex the whole organization, compute percentage of contributions of languages for all users etc... And this is a simple example.
If you intend to use nested fields and partent/child mapping, you will loose performance during search, in reference, the quote from the "tuning for search" documentation here : https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#_document_modeling
Documents should be modeled so that search-time operations are as cheap as possible.
In particular, joins should be avoided. nested can make queries
several times slower and parent-child relations can make queries
hundreds of times slower. So if the same questions can be answered
without joins by denormalizing documents, significant speedups can be
expected.
Relationships are very well handled in a graph database like neo4j. Neo4j on the contrary lacks search features elasticsearch provides, doing full_text search is possible but not so performant and introduces some burden in your application.
Note apart : when you talk about "store", elasticsearch is a search engine not a database (while being used a lot as it), while neo4j is a database fully transactional.
However, combining both is the winning process, we have actually written an article describing this process that we call Graph-Aided Search with a set of open source plugins for both Elasticsearch and Neo4j providing you a powerful two-way integration out of the box.
You can read more about it here : http://graphaware.com/neo4j/2016/04/20/graph-aided-search-the-rise-of-personalised-content.html
I'm building a product search platform. I used Solr search engine before, and i found its performance is fine but doesn't generate a user interface. Recently I found Algolia has more features, easy setup, and generates a User Interface.
So if someone used Algolia before:
Is Algolia performance better than Solr?
Is there any difference between Algolia and Websolr ?
I'm using Algolia and SolR in production for an e-commerce website.
You're right about what you say on Algolia. It's fast (really) and has a lot of powerful features.
You have a complete dashboard to manage your search engine.
For SolR, it's ok but it's also a black box. You can fine tune your search engine, but it exhibits poor performance for semantic searches (I tested it).
If you have to make a choice, it depends on a lot of things.
With Algolia, there are no servers to manage, easy configuration and integration. It's fast with 20 millions records for me (less than 15ms per search).
With SolR, you can customise a little bit more. But it's a lot of work. If I had to make a choice, it would be more between Algolia and ElasticSearch. SolR is losing velocity; it's hard to imagine it growing again in the next few years.
As a resume, if you want to be fast and efficient, choose Algolia. If you want to dive deep into a search engine architecture and you have a lot of time (count it in months), you can try ElasticSearch.
I hope that I was helpful with my answer, ask me if you have more questions.
Speed is a critical part of keeping users happy. Algolia is aggressively designed to reduce latency. In a benchmarking test, Algolia returned results up to 200x faster than Elasticsearch.
Out-of-the-box, Algolia provides prefix matching for as-you-type search, typo-tolerance with intelligent result highlighting, and a flexible, powerful ranking formula. The ranking formula makes it easy to combine textual relevance with business data like prices and popularity metrics. With Lucene-based search tools like Solr and Elasticsearch, the ranking formula must be designed and built from scratch, which can be very difficult for teams without deep search experience to get right.
Algolia’s highly-optimized infrastructure is distributed across the world in 15 regions and 47 datacenters. Algolia provides a 99.99% reliability guarantee and can deliver a fast search to users wherever in the world they’re connecting from. Elasticsearch and Solr do not automatically distribute to multiple regions, and doing so can incur significant server costs and devops resources
What are the uses of Semantic Web in the information Retrieval. Semantic Web here i mean, the structured created like DBPedia, Freebase.
I have integrated information in RDF with Lucene in several projects and i think a lot of the value you can get from the integration is that you can go beyond the simple keyword search that Lucene would normally enable. That opens up possibilities for full text search over your RDF information, but also semantically enriched full-text search.
In the former case, there is no 'like' operator in SPARQL, and the regex function while similarly capable to the SQL like, is not really tractable to evaluate against a dataset of any appreciable size. However, if you're able to use lucene to do the search instead of relying on regex, you can get better scale and performance out of a single keyword search over your RDF.
In the latter case, if the query engine is integrated with the lucene text/rdf index, think LARQ, (both Jena and Stardog suppot this), you can do far more complex semantic searches over your full-text index. Queries like 'get all the genres of movies where there are at least 10 reviews and the reviews contain the phrase "two thumbs up"' That's difficult to swing with a lucene index, but becomes quite trivial in the intersection of Lucene & SPARQL.
You can use DBpedia in Information Retrieval, since it has the structured information from Wikipedia.
Since Wikipedia has knowledge of almost every topic of interest in terms of articles, categories, info-boxes that is being used in the information retrieval systems to extract the meaningful information in the form of triples i.e. Subject, Predicate & Object.
You can query the information via SPARQL using the following endpoint:Endpoint to query the information from DBpedia
I want to develop google desktop search like application, I want to know that which Indexing Techniques/ Algorithms I should use so I can get very fast data retrival.
In general, what you want is an Inverted Index. You can do the indexing yourself, but it's a lot of work to get right - you need to handle stemming, stop words, extending the posting list to include positions in the document so you can handle multi-word queries, and so forth. Then, you need to store the index, probably in a B-Tree on disk - or you can make life easier for yourself by using an existing database for the disk storage, such as BDB. You also need to write a query planner that interprets user queries, performs query expansion and converts them to a series of index scans. Wikipedia's article on Search Engine Indexing provides a good overview of all the challenges, too.
Or, you can leverage existing work and use ready-made full text indexing solutions like Apache Lucene and Compass (which is built on Lucene). These tools handle practically everything detailed above (and more), which just leaves you writing the tool to build and update the index by feeding all your documents into Lucene, and the UI to allow users to search it.
The Burrows-Wheeler transform, used to compress data in bzip2, can be used to make substring searching of text a constant time function.
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
I haven't seen a simple introduction online, but here is a lot of detail:
http://www.ddj.com/architect/184405504