How do websites do fulltext search and sort? - full-text-search

How do websites implement search and sort? (example: ecommerce search for a product and sort by price)
I've been wrestling with this for a while. I'm using MySQL and after long discussions here it seems that MySQL can't handle this. I've also asked here here whether posgres can do this and again it seems like the answer is no.
So how do websites do it?
EDIT: To be clear, I'm asking how websites do it in a way that uses both fulltext search and some sort of BTREE index for the sorting. To do fulltext search and sort without using one of the indexes would be easy (albeit slow).

I worked for a large ecommerce site that used SQL Server full-text search to accomplish this. Conceptually, the full-text search engine would produce a list of ids, which would be joined against the b-tree indexes to return sorted results. Performance was acceptable, but we pushed it as far as we could go with the largest hardware available at the time (80 cpu, 512 GB RAM, etc). With 20-25 million documents a simple full-text query (2-3 terms) would have response times in the 3-5 second range. That was for the historical data. The live data set (around 1 million documents) would average 200ms with a wide distribution. We were able to handle 150-200 queries per second.
We eventually ended up moving away from SQL Server for search because we wanted additional full-text functionality that SQL Server didn't offer, specifically highly tunable relevance sorting for results. We researched various options and settled on elastic search hosted on aws.
Elastic search offered substantial improvements in features. Performance was great. We went live with 4 xlarge instances on aws. Query response times were right around 150-175 ms, and very, very consistent. We could easily scale up/down the number of nodes to keep performance consistent with varying amounts of load.
SQL Server was still the system of record. We had to develop several services to push changes from SQL Server to ES (incremental loading, bulk loading, etc). Translating the SQL search logic to ES was straight forward.
In conclusion, if your database can't meet your search needs, then use a tool (elasticsearch) that does.

Related

cassandra vs elastic search vs any other design suggestions

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables.
For example we have following 3 tables in RDS :
alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)
each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.
Now we want to run aggregation queries like :
select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip
We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.
So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?
Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.
For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.
Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)
If you have high volumes of data and you need aggregation, you pretty much have two options:
if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently
if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query
Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.
One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.
Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions
Store in parquet and use spark, partition efficiently

Solr Range Query Slow (SolrBridge in Magento CE)

I'm developing a Magento Community Edition site using Solr via the SolrBridge extension. The system is fast when there are only a few thousand SKUs, but after importing ~100k products the searches have slowed significantly. The page load went from under a second to over two seconds, and New Relic monitoring identified this time as waiting for a response from Solr.
Noticing that search suggestions were still lightning fast, I decided to investigate what the difference between the autocomplete search and the full search listings were. After experimenting with altering different aspects of the search to bring it in line with the autocomplete search.
The system sped up immensely when I disabled the range fields part of the query which looks like the following:
facet.range=GBP_0_price_decimal&f.GBP_0_price_decimal.facet.range.start=0&f.GBP_0_price_decimal.facet.range.end=1000000&f.GBP_0_price_decimal.facet.range.gap=100&f.GBP_0_price_decimal.facet.mincount=1
With this code included, the search takes in the region of 1.7-1.8 seconds. Without it, the search takes only a few miliseconds.
This is, I believe, schema definition for the field. It does seem to be indexed:
<dynamicField name="*_decimal" type="float" indexed="true" stored="true" />
Any idea what the slowing factor is? Solr is running a single core. It's on the same physical box as Magento and the database. The box's specs are relatively high - 64GB ram and dual Xeon E5620s.
Thanks for any assistance. If you need any more information to provide assistance, let me know.
The facet gap (100) is perhaps too small.
You are generating (potentially) 10000 buckets for faceting in ranges (facet.end divided by facet.gap), which is kinda heavy.
I could increase the gap size or use a smaller range end.

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

elasticsearch vs hbase/hadoop for realtime statistics

I'm loggin millions of small log documents weekly to do:
ad hoc queries for data mining
joining, comparing, filtering and calculating values
many many fulltext-search with python
run this operations with all millions of docs, some times every day
My first thought was put all docs in HBase/HDFS and run Hadoop jobs generating stats results.
The problem is: some of results must be near real-time.
So, after some research I discovered ElasticSearch and Now I'm thinking about transfer all millions of documents and use DSL-Queries to generate stats results.
Is this a good idea? ElasticSearch seems to be so easy to handle with millions/billions of documents.
For real-time search Analytics Elastic Search is a good choice.
Definitely easier to setup and handle than Hadoop/HBase/HDFS.
Elastic-Search vs HBase Good Comparison: http://db-engines.com/en/system/Elasticsearch%3BHBase

Is excessive use of lucene good?

In my project, entire searching and listing of content is depend on Lucene. I am not facing any performance issues. Still, the project is in development phase and long way to go in production.
I have to find out the performance issues before the project completed in large structure.
Whether the excessive use of lucene is feasible or not?
As an example, I have about 3 GB of text in a Lucene index, and it functions very quickly (milliseconds response times on searches, filters, and sorts). This index contains about 300,000 documents.
Hope that gave some context to your concerns. This is in a production environment.
Lucene is very mature and has very good performance for what it was designed to do. However, it is not an RDBMS. The amount of fine-tuning you can do to improve performance is more limited than a database engine.
You shouldn't rely only on lucene if:
You need frequent updates
You need to do joined queries
You need sophisticated backup solutions
I would say that if your project is large enough to hire a DBA, you should use one...
Performance wise, I am seeing acceptable performance on a 400GB index across 10 servers (a single (4GB, 2CPU) server can handle 40GB of lucene index, but no more. YMMV).
By excessive, do you mean extensive/exclusive?
Lucene's performance is generally very good. I recently ran some performance tests for Lucene on my Desktop with QuadCore # 2.4 GHz 2.39 GHz
I ran various search queries against a disk index composed of 10MM documents, and the slowest query (MatchAllDocs) returned results within 1500 ms. Search queries with two or more search terms would return around 100 ms.
There are tons of performance tweaks you can do for Lucene, and they can significantly increase your search speed.
What would you define as excessive?
If your application has a solid design, and the performance is good, I wouldn't worry too much about it.
Perhaps you could get a data dump to test the performance in a live scenario.
We use lucence to enable type-ahead searching. This means for every letter typed, it hits the lucence index to get the results. Multiple that to tens of textboxes on multiple interfaces and again tens of employees typing, with no complaints and extremely fast response times. (Actually it works faster than any other type-ahead solution we tried).

Resources