How to setup CosmosDB when need to search for "like" in string tags - full-text-search

I have a 3 tables structure, Customer, Invoice, InvoiceItem that I would like to try to move from the relational DB and store it in CosmosDB. Currently, there are quite intensive queries being run on the InvoiceItem table. This InvoiceItem table has up to 10 optional TagX columns that are basically text that might include the brand, group, type, or something that would group this InvoiceItem and make it searchable by saying (simplified):
SELECT * FROM InvoiceItem WHERE Tag1 LIKE '%shirt%' AND Tag2 LIKE '%training%'
A query like this on a multi-million table can take more than 8 minutes. We are working on the archiving strategy and indexes to speed up the process but it looked to me like CosmosDB could be worth trying in this case, since all of the data is write-once-read-many scenario.
Back to CosmosDB, how do I deal with those string tags in CosmosDB. As a start, I thought about having Invoice and InvoiceItem in the same partition with "type" property that would differ them. But then I cannot stick the tags anywhere so they would be easily searchable. Any ideas on how to set it up?
Thanks!

Textbook database performance issue caused by either lack of, or inefficient indexing.
With that many rows, index cardinality becomes important. You don't want to index the entire field, you only want to index the first n characters of the columns you're indexing, and only index columns you are searching, whether by join or direct where clauses.
The idea is to keep the indexes as small as possible, while still giving you the query performance you need.
With 18 million rows you probably want to start with an index cardinality of the square root of 18m.
That means to hit the index segment you need, you only want to search no more than 5000 index rows, each of which have 400-5000 rows in their segment, at least for sub-second result times.
indexing the first 3-4 letters would be a good starting point. Based on the square root of 18000000 being 4242 and the nearest exponent of 26(3) (assuming alpha characters only) overshooting that. Even if alpha-numeric, 3 characters is still a good starting point.
If the queries then run super fast, but the index takes forever to build, drop a character. This is called "index tuning". You pick a starting point and find the largest cardinality (lowest number of characters indexed) that gives you the performance you need.
If I'm way off because index performance in this DB is way off the mark of a relational db, you'll need to experiment.
As far as I'm concerned, a select query that takes more than a few seconds is unacceptable, except in rare cases. I once worked for a security company. Their license management system took minutes to pull large customers.
After indexing the tables correctly the largest customer took less than 2 seconds. I had to sift through a table with billions of rows for number of downloads, and some of these queries had 7 joins.
If that database can't do this with 18m rows, I'd seriously consider a migration to a better architecture, hardware, software or otherwise.
As index cardinality increases, the performance gains drop to negative as the index cardinality approaches table cardinality, as compared to no index.
As in all things in life, moderation. At the other end of the spectrum, an index with a cardinality of 2 is just about useless. Half of 8 minutes is 4 minutes, assuming a nearly equal distribution.... useless, so indexing a boolean field isn't a great thing to do, usually. There are few hard and fast rules though. Lots of edge cases. Experimentation is your friend.

Related

Cassandra Modeling for filter and range queries

I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.
I want to be able to write queries like these:
get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds
or
get all users who are men who are 6'0" are ages 31-37 and have black hair
How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.
EDIT:
Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?
IF I UNDERSTAND THINGS CORRECTLY
If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?
So if you are trying to model you data inside Cassandra then the general rule is that you need to make a table per query. There are also significant restrictions on what you can filter your query by. If you want to understand some of the restrictions I suggest you take a look at this post:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
or my long answer here:
cassandra - how to perform table query?
All of the above only applies if you are running fixed queries that are known ahead of time. If instead you are looking to perform some sort of analytical analysis on your data (it sounds like you might be) than I would look at using Spark in conjunction with Cassandra. This will provide you a fast tool to do in-memory processing of your data. If you look at using Datastax (Community or Enterprise) then Spark also has a connector that makes reading and writing data to and from Cassandra easy.
Edited with Additional Information
Based on the query "get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds" you would need to build a table with following:
CREATE TABLE user_by_haircolor_weight_height (
haircolor text,
weight float,
height_in int,
user varchar,
PRIMARY KEY ((haircolor), weight, height_in)
);
You could then query this by:
SELECT * from user_by_haircolor_weight_height where haircolor='red' and weight>100 and height_in>61 and height_in<73;
For the query "get all users who are men who are 6'0" are ages 31-37 and have black hair" you would need to build a similar table with a
PRIMARY KEY ((haircolor, sex), height_in, age)
In the end if what you are trying to do is perform either ad-hoc or a set number analytics (i.e. can have a bit more latency than a straight CQL query) on the data stored in you cassandra table than I suggest you look at using Spark. If you need something a bit more real-time to handle ad-hoc queries you can look at using Solr to perform Lucene powered searches on your table.
my recommendation is :
1) keep main table with proper partition key, so that million records being spread across cluster, don't here use any cluster column which will cross row key limitation of 2gb etc.,
2) depending on query pattern you may better create additional tables(like index) as much as possible to keep inverted index data in it. coz write is cheap.
3) use multiple query to get what you need.
4) last option is, use DSE solr search capability.
Just to reiterate the end of the conversation:
"Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"
Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.

Poor Performance of Mondrian w/ Degenerate Dimensions

I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.

Full table scan when querying against an indexed column

I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.
If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.
If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.
You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm
A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

Is a globally partitioned index better (faster) than a non-partitioned index?

I'm interested to find out if there is a performance benefit to partitioning a numeric column that is often the target of a query. Currently I have a materialized view that contains ~50 million records. When using a regular b-tree index and searching by this numeric column I get a cost of 7 and query results in about 0.8 seconds (with non-primed cache). After adding a global hash partition (with 64 partitions) for that column I get a cost of 6 and query results in about 0.2 seconds (again with non-primed cache).
My first reaction is that the partitioned index has improved the performance of my query. However, I realize that this may just be a coincidence and could be totally dependent on the values being searched on, or others I'm not aware of. So my question is: is there a performance benefit to adding a global hash partition to a numeric column on a large table or is the cost of determining which index partitions to scan out-weighed by the cost of just doing a full range scan on a non-indexed partition?
I'm sure this, like many Oracle questions, can be answered with an "it depends." :) I'm interested in learning what factors I should consider to determine the benefits of each approach.
Thanks!
I'm pretty sure you have found this reference in your research - Partitioned Tables and Indexes. However I give a link to it if somebody is interested, this is a very good material about partitioning.
Straight to the point - Partitioned index just decomposes the index into pieces (16 in your situation) and spread the data depending on their hashed partitioning key. When you want to use it, Oracle "calculates" the hash of the key and determine in which section to continue with searching.
Knowing how index searching works, on really huge data I think it is better to choose the partitioned index in order to decrease the index tree you traverse (regular index). It really depends on the data, which is in the table (how regular index tree is composed) and is hashing and direct jump to lower node faster than regular tree traverse from the start node.
Finally, you must be more confident with the test results. If one technique gives better results on your exact data than some other don't worry to implement it.

Resources