Elasticsearch index design - performance

I am maintaining a years of user's activity including browse, purchase data. Each entry in browse/purchase is a json object:{item_id: id1, item_name, name1, category: c1, brand:b1, event_time: t1} .
I would like to compose different queries such like getting all customers who browsed item A, and or purchased item B within time range t1 to t2. There are tens of millions customers.
My current design is to use nested object for each customer:
customer1:
customer_id,id1,
name: name1,
country: US,
browse: [{browseentry1_json},{browseentry2_json},...],
purchase: [{purchase entry1_json},{purchase entry2_json},...]
With this design, I can easily compose all kinds of queries with nested query. The only problem is that it is hard to expire older browse/purchase data: I only wanna keep, for example, a years of browse/purchase data. In this design, I will have to at some point, read the entire index out, delete the expired browse/purchase data, and write them back.
Another design is to use parent/child structure.
type: user is the parent of type browse and purchase.
type browse will contain each browse entry.
Although deleting old data seems easier with delete by query, for the above query, I will have to do multiple and/or has_child queries,and it would be much less performant. In fact, initially i was using parent/child structure, but the query time seemed really long. I thus gave it up and tried to switch to nested object.
I am also thinking about using nested object, but break the data into different index(like monthly index) so that I can easily expire old data. The problem with this approach is that I have to query across those multiple indexes, and do aggregation on that to get the distinct users, which I assume will be much slower.(havn't tried yet). One requirement of this project is to be able to give the count of the queries in acceptable time frame.(like seconds) and I am afraid this approach may not be acceptable.
The ES cluster is 7 machines, each 8 cores and 32G memory.
Any suggestions?
Thanks in advance!
Chen

Instead of creating a customers index I would create a "Browsing" indices (indexes) and a "Purchasing" indices separated by a timespan (EG: Monthly, as you mentioned in your last paragraph).
In each struct I would add the customer fields. Now you are facing two different approaches:
1. You can add only a reference to the customer (such as id) and make another query to get his details.
2. If you don't have any storage problem you can keep all the customer's data in each struct.
if this doesn't enough for performance you can combine it with "routing" and save all specific user's data on the same shard. and Elasticsearch won't need to fetch data between shards (you can watch this video where Shay Benon explains about "user data flow")
Niv

Related

Elasticsearch - Modelling video catalogue information into one index vs multiple indexes

I need to model a video catalogue composed of movies, tv shows, episodes, TV channels and live programs information into elasticsearch. Some of these entities are correlated, some not.
The attributes of these entities are quite different, even if there are some common ones.
Now since I may need to do query cross-entity, imagine the scenario of a customer searching for something that could be a movie, a tv channel or a live event program, is it better to have 1 single index containing a generic entity marked with a logical type attribute, or is it better to have multiple indexes, 1 for each entity (movie, show episode, channel, program) ?
In addition, some of these entities, like movies, can have metadata attributes into multiple languages.
Coming from a relational data model DB, I would create different indexes, one for every entity and have a language variant index for every language. Any suggestion or better approach in order to have great search performance and usability?
Whether to use several indexes or not very much depends on the application, so I cannot provide a definite answer, rather a few thoughts.
From my experience, indexes are rather a means to help maintenance and operations than for data modeling. It is, for example, much easier to delete an index than delete all documents from one source from a bigger index. Or if you support totally separate search applications which do not query across each others data, different indexes are the way to go.
But when you want to query, as you do, documents across data sources, it makes sense to keep them in one index. If only to have comparable ranking across all items in your index. Make sure to re-use fields across your data that have similar meaning (title, year of production, artists, etc.) For fields unique to a source we usually use prefix-marked field names, e.g. movie_... for movie-only meta data.
As for the the language you need to use language specific fields, like title_en, title_es, title_de. Ideally, at query time, you know your user's language (from the browser, because they selected it explicitly, ...) and then search in the language specific fields where available. Be sure to use the language specific analyzers for these fields, at query and at index time.
I see a search engine a bit as the dual of a database: A database stores data but can also index it. A search engine indexes data but can also store it. A database tends to normalize the schema to remove redundancy, a search engine works best with denormalized data for query performance.

SQL Server Full Table Scan and Load

For the purpose of this question, let's pretend I have the following table:
Transaction:
Id
ProductId
ProductName
City
State
Country
UnitCost
SellAmount
NumberOfTimesPurchased
Profit (NumberOfTimesPurchased * (SellAmount - UnitCost))
Basically, a single de-normalized table with a million plus rows in it. It is important to note that only two columns will ever by updated: Profit and NumberOfTimesPurchased. When a sale is made, the NumberOfTimesPurchased will be updated and the new profit amount will be re-calculated.
Now, I need to do some minimal reporting on this table, which consists of queries that aggregate and group. As an example:
SELECT
City, AVG(UnitCost), AVG(SellAmount),
SUM(NumberOfTimesPurchased), AVG(Profit)
FROM
Transaction
GROUP BY
City
SELECT
State, AVG(UnitCost), AVG(SellAmount), SUM(NumberOfTimesPurchased),
AVG(Profit)
FROM
Transaction
GROUP BY
State
SELECT
Country, AVG(UnitCost), AVG(SellAmount), SUM(NumberOfTimesPurchased),
AVG(Profit)
FROM
Transaction
GROUP BY
Country
SELECT
ProductId, ProductName, AVG(UnitCost), AVG(SellAmount),
SUM(NumberOfTimesPurchased), AVG(Profit)
FROM
Transaction
GROUP BY
ProductId, ProductName
These queries are quick: ~1 second. However, I've noticed that under load, performance significantly drops (from 1 second up to a minute when there are 20+ concurrent requests), and I'm guessing the reason is that each query performs a full table scan.
I've attempted to use indexed views for each query, however my update statement performance takes a beating since each view needs to be rebuilt. On the same note, I've attempted to create covering indexes for each query, but again my update statement performance is not acceptable.
Assuming full table scans are the culprit, do I have any realistic options to get the query time down while keeping update performance at acceptable levels?
Note that I cannot use column store indexes (I'm using the cheaper version of Azure SQL Database). I'd also like to stay away from any sort of roll-up implementation, as I need the data available immediately.
Finally - the example above is not a completely accurate representation of my table. I have 20 or so different columns that can be 'grouped', and 6 columns that can be updated. No inserts or deletes.
Because there are no WHERE clauses on your queries, the database engine can nothing but a table scan (or clustered index scan which is really the same thing). If there were covering indexes with containing all the columns from your query, then the engine would prefer those. If your real queries have WHERE clauses, then appropriate indexing with those columns as the leading columns of the index might help.
But I think your problem lies elsewhere. As far as concurrency goes you haven't put enough money in the meter. According to the main service tiers doc, the Basic tier for Azure SQL Database is for:
... supporting typically one single
active operation at a given time. Examples include databases used for
development or testing, or small-scale infrequently used applications.
Therefore you might want to think about splashing out for Premium edition to support both your concurrency requirement and columnstore indexes, which are perfectly suited to this type of query. Just for fun, I created a test-rig based on AdventureWorksDW2012 to try and recreate your problem which is here. Query performance was atrocious (> 20 secs). I'd be surprised if you weren't getting DTU warnings on your portal:
An upgrade to Standard (S0-S2) did boost performance so you should experiment. You could look at scaling up for busy query times and down when not required.
This table also looks a bit like a fact table, so you might want to consider refactoring this as a fact / dimensional model then use Azure Analysis Services on top to bring that sub-second performance.
Coincidentally there is a feedback item you can vote for to bring columnstore to Standard tier:
https://feedback.azure.com/forums/217321-sql-database/suggestions/6878001-make-sql-column-store-feature-available-for-standa
Recent comments suggest it is "in the work queue" as at May 2017;

Compound rowkey in Azure Table storage

I want to move some of my Azure SQL tables to Table storage. As far as I understand, I can save everything in the same table, seperating it using PartitionKey and keeping it unique within each partition using Rowkey.
Now, I have a table with a compound key:
ParentId: (uniqueidentifier)
ReportTime: (datetime)
I also understand RowKeys have to be strings. Will I need to combine these in a single string? Or can I combine multiple keys some other way? Do I need to make a new key perhaps?
Any help is appreciated.
UPDATE
My idea is to put data from several (three for now) database tables and put in the same storage table seperating them with the partition key.
I will query using the ParentId and a WeekNumber (another column). This table has about 1 million rows that's deleted weekly from the db. My two other tables has about 6 million and 3.5 million
This question is pretty broad and there is no right answer.
The specific question - can you use Compound Keys with Azure Table Storage. Yes, you can do that. But this involves manual Serializing / Deserializing of your object's properties. You can achieve that by overriding the TableEntity's ReadEntity and WriteEntity methods. Check this detailed blog post on how can you override these methods to use your own custom serialization/deserialization.
I will further discuss my view on your more broader question.
First of all, why you want to put data from 3 (SQL) tables into one (Azure Table)? Just have 3 Azure tables.
Second thought, as Fabrizio points out is how are you going to query the records. Because Windows Azure Table service has only one index, and that is PartitionKey + RowKey properties (columns). If you are pretty sure you will mostly query data by known PartitionKey and RowKey, then Azure Tables is perfectly suiting you! However you say that your combination for RowKey is ParentId + WeekNumber! That means that a record is uniquely identified by this combination! If it is true, then you are even more ready to go.
Next you say you are going to delete records every week! You should know that DELETE operation acts on a single entity. You can use Entity Group Transactions to DELETE multiple entities at once, but there is a limit of (a) All entities in batch operation must have the same PartitionKey, (b) The maximum number of entities per batch is 100, and (c) The maximum size of batch operation is 4MB. Say you have 1M records like you say. In order to delete them, you have to first retrieve them in groups by 100, then delete in groups by 100. These are, in best possible case 10k operations on retrieval and 10k operations on deletion. Event if it will only cost 0.002 USD, think about time taken to execute 10k operations against a REST API.
Since you have to delete entities on a regular basis, which is fixed to a WeekNumber let's say, I can suggest that you dynamically create your tables and include the week number in its name. Thus you will achieve:
Even better partitioning of information
Easier and more granular information backup / delete
Deleting millions of entities requires just one operation - delete table.
There is not an unique solution for your problem. Yes, you can use ParentID as PartitionKey and ReportTime as Rowkey (or invert the assignment). But the big 2 main questions re: how do you query your data, with what conditions? and how many data do you store? 1000, 1 million items, 1000 millions items? The total storage usage is important. But it's also very important to consider the number of transaction you will generate to the storage.

How to optimize "text search" for inverted index and relational database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Update 2022-08-12
I re-thought about it and realized I was overcomplicating it. I found the best way to enhance this system is by using good old information retrieval techniques ie using 'location' of a word in a sentence and 'ranking' queries to display best hits. The approach is illustrated in this following picture.
Update 2015-10-15
Back in 2012, I was building a personal online application and actually wanted to re-invent the wheel because am curious by nature, for learning purposes and to enhance my algorithm and architecture skills. I could have used apache lucene and others, however as I mentioned I decided to build my own mini search engine.
Question: So is there really no way to enhance this architecture except by using available services like elasticsearch, lucene and others?
Original question
I am developing a web application, in which users search for specific titles (say for example : book x, book y, etc..) , which data is in a relational database (MySQL).
I am following the principle that each record that was fetched from the db, is cached in memory , so that the app has less calls to the database.
I have developed my own mini search engine , with the following architecture:
This is how it works:
a) User searches a record name
b) The system check what character the query starts with, checks if query there : get record. If not there, adds it and get all matching records from database using two ways:
Either query already there in the Table "Queries" (which is a sort of history table) thus get record based on IDs (Fast performance)
Or, otherwise using Mysql LIKE %% statement to get records/ids (Also then keep the used query by the user in history table Queries along with the ids it maps to).
-->Then It adds records and their ids to the cache and Only the ids to the inverted index map.
c) results are returned to the UI
The system works fine, however I have Two main issues, that i couldn't find a good solution for (been trying for the past month):
First issue:
if you check point (b) , case where no query "history" is found and it has to use the Like %% statement : this process becomes time consuming when the query matches numerous records in the database (instead of one or two):
It will take some time to get records from Mysql (this is why i used INDEXES on the specific columns)
Then time to save query history
Then time to add records/ids to cache and inverted index maps
Second issue:
The application allows users to add themselves new records, that can immediately be used by other users logged in the to application.
However to achieve this, inverted index map and table "queries" have to be updated so that in case any old query matches to the new word. For example if a new record "woodX" is being added, still the old query "wood" does map to it. So in order to re-hook query "wood" to this new record, here is what i am doing now:
new record "woodX" gets added to "records" table
then i run a Like %% statement to see which already existing query in table "queries" does map to this record(for example "wood"), then add this query with the new record id as a new row: [ wood, new id].
Then in memory, update inverted index Map's "wood" key's value (ie the list), by adding the new record Id to this list
--> Thus now if a remote user searches "wood" it will get from memory : wood and woodX
The Issue here is also time consumption. Matching all query histories (in table queries) with the newly added word takes a lot of time (the more matching queries, the more time). Then the in memory update also takes a lot of time.
What i am thinking of doing to fix this time issue, is to return the desired results to the user first , then let the application POST an ajax call with the required data to achieve all these UPDATE tasks. But i am not sure if this is a bad practice or an unprofessional way of doing things?
So for the past month ( a bit more) i tried to think of the best optimization/modification/update for this architecture, but I am not an expert in the document retrieval field (actually its my first mini search engine ever built).
I would appreciate any feedback or guidance on what i should do to be able to achieve this kind of architecture.
Thanks in advance.
PS:
Its a j2ee application using servlets.
I am using MySQL innodb (thus i cannot use full-text search option)
I would strongly recommend Sphinx Search Server, wchich is best optimized in full-text searching. Visit http://sphinxsearch.com/.
It's designed to work with MySQL, so it's an addition to Your current workspace.
I do not pretend to have THE solution but here is my ideas.
First, I though like you for time-consuming queries LIKE%% : I would execute a query limited to a few answers in MySQL, like a dozen, return that to user, and wait to see if user wants more matching records, or launch in background the full-query, depending on you indexation needs for future searches.
More generally, I think that storing everything in memory could lead, one day, to too-much memory consumption. And althrough the search-engine becomes faster and faster when it keeps everything in memory, you'll have to keep all these caches up-to-date when data is added or updated and it will certainly take more and more time.
That's why I think the solution I saw a day in an "open-source forum software" (I couldn't remember its name) is not too bad for text searching in posts : each time a data is inserted, a table named "Words" keeps tracks of every existing word, and another table (let's say "WordsLinks") the link between each word and posts it appears in.
This kind of solution has some drawbacks:
Each Insert, Delete, Update in database is a lot slower
Data selection for search engine must be anticipated : if you choose to keep two letter words you never kept, it is too late for already recorded data, unless you launch a complete data re-processing.
You must take care of DELETE as well as UPDATE and INSERT
But I think there are some big advantages:
Computing time is probably the same than the "memory solution" (eventually), but it is divided in each database Create/Update/Delete, rather than at query time.
Looking for a whole word, or words "starting with" is instantaneous : when indexed, searching in "Words" table is dichotomic. And "WordLinks" table query is very fast either with an index.
Looking for multiple words at the same time could be simple : gather a group of "WordLinks" for each found Word, and execute an intersection on them to keep only "Database Ids" common to all these groups. For example with the words "tree" and "leaf", the first one could give Table records {1, 4, 6}, and the second one could give {1, 3, 6, 9}. So with an intersection it is simple to keep only common parts : {1, 6}.
A "Like %%" in a single-column table is probably faster than a lot of "Like %%" in different fields of different tables. And each database engine handles some cache : "Words" table could be little enough to be kept in memory
I think there is a small risk of performance and memory problems if data becomes huge.
As every search is fast, you can even look for synonyms. For example search "network" if user didn't find anything with "ethernet".
You can apply rules, like splitting camel case words to generate for example the 3 words "wood", "X", "woodX" from "woodX". Each "word" is very lightweight to store and find, so you can do a lot of things.
I think the solution you need could be a blend of methods : for example you can keep lightweight UPDATE, INSERT, DELETE, and launch "Words" and "WordsLinks" feeding from a TRIGGER.
Just for anecdote, I saw a software developped by my company in which it was decided to keep "everything" (!) in memory. It leads us to recommend to our customers to buy servers with 64GB RAM. A little bit expensive. It explains why I am very prudent when I see solutions that could lead, eventually, to memory filling.
I have to say, I don't think your design fits the problem very well. The issues that you see now are consequences of that. And apart from that, your current solution doesn't scale.
Here is a possible solution:
Redesign your database to only contain authoritative data, but no derived data. So all cache entries must vanish from MySQL.
Keep data only for the duration of a request in memory within your application. This makes the design of your application much simpler (think race conditions) and enables you to scale to a sensible number of clients.
Introduce a caching layer. I'd strongly recommend to use an established product, rather than building this yourself. This frees you of all the custom built caching logic in your application and even does the job much better.
You can take a look at Redis or Memcached for the caching layer. I think an LRU strategy should fit here. Depending on how complex your queries become, a dedicated indexed search mechanism like Lucene might make sense as well.
I'm sure this can be implemented in MySQL but it would be a lot less effort to just use an existing search-oriented database such as Elasticsearch. It uses Lucene library to implement the inverted index, has extensive documentation, supports horizontal scaling, fairly simple query language and so forth. I guess it has been quite a lot of work to get this far, and it will be even more work to handle caches, race conditions, bugs, performance issues etc. to make the solution "production grade".

CouchDB query performance

If the number of documents is more will the querying of data gets slower in CouchDB?
Example Scenario:
I have a combobox in a form for customer name. When the user types the customer name, I have to do autofilling.
There will be around 10k customer documents in the CouchDB. I understand that i have to create a view to do the same.
CouchDB database is in the local machine where the application resides.
Question:
Will it take more than 2 - 3 seconds to query the DB for matching customer names?
Will querying take more time for each query if there are many documents in the CouchDB (say around 100000 documents)?
Any pointers on how to create views/index will be helpful.
Thanks in advance.
The view runs on every document, but only once. After that, the document's view value(s) are stored forever. Fetching a customer by name will be very fast because you would normally have only a few new documents to process in the view at query time.
Query time will not increase noticeably if you have more documents. Technically, access times grow logarithmically with the number of documents. However, in practice fetching documents is basically constant time and very unlikely to be a problem.

Resources