I work at a company that is thinking about migrate his BI structure ( SSAS Olap and Power BI ) to the ElasticSearch/Kibana. Our BI work basically with aggregations as sum, max, min, count, first, last, average at aggregation in multiples hierarchy levels( organization level, customer level, unit level, etc...), and expressions between multiples measures( getting by aggregation and find first/last based on datetime ) like sum, subtraction, multiplication, division, percentage, we also do some date manipulations. We work with more than 1 billion rows at total. Before SSAS Olap we worked with SQL Server OLTP, and our queries were taking many minutes. Because that, we change to OLAP and we now have our aggregate measures in a few seconds. But the Microsoft License is bursting our new budget and now the company requires a free software. At the moment, our Business Intelligence solution doesn't have data mining, neither machine learning, only aggregations measures at multiples hierarchy levels, and operations between these aggregations.
And our solution is the entire on-premises.
I have 2 questions:
1-With the free/basic Elasticsearch license, we will be able to migrate our business intelligence solution to the ElasticEngine and do all these hierarchies, aggregation and search in multiple levels operations?
2-With the free/basic Elasticsearch engine, we will have these measures aggregations between these millions registers and operations between measures at a similar time as our OLAP cube?
Best Regards,
Luis
In general, it is possible to use ElasticSearch in the way you described. However, this has real sense only if you often use 'like'-based filtering by text columns, this use-case is perfect for ElasticSearch. All facts needed for the reports should be included into one ES document; avoid usage of sub-collections, aggregate queries on fields from sub-collections (nested queries) may kill the performance.
If all you need is damn-fast aggregations and 'exact match' filtering it might be much better to use columnar SQL-compatible databases:
MemSQL - free version should be enough for 1 billion of rows, this is good choice if you need to support both OLAP/OLTP DB usage (say, often update rows)
Yandex ClickHouse - free/open source, suitable if your data is append-only (no frequent updates/deletes). Its aggregate queries performance is really ultimate, and it can handle queries to 1 billion of rows fast enough even in the single-node configuration
You mentioned that your existing BI infrastructure uses SSAS OLAP, and I may assume that many reports are in fact pivot tables. Most free BI tools (including Kibana) don't support pivot tables at all or only very-very primitive basic pivot tables. Fortunately, good alternative to PowerBI/Excel PivotTable exists - but it is not completely free. Here I mean SeekTable: it can connect to all DBs I mentioned (MemSQL, ClickHouse) and even to ElasticSearch [Disclaimer: I'm affiliated with this BI tool].
Related
Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.
The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.
However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)
A Specimen has an edge relating it to a Donor.
The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:
MATCH (s: Specimen)-[e: DONOR]->(d: Donor)
WITH d.sex AS sex, COUNT(s.id) AS count
RETURN count, sex
The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.
We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.
We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.
Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?
You will more than likely need to refactor your graph model. For example, you may want to investigate if using multiple labels (e.g. something like Specimen:Male/Specimen:Female) if it is appropriate to do so, as this will act as a pre-filter before scanning the db.
You may find the following blog posts helpful:
Modelling categorical variables
Modelling relationships
Modelling flights, which talks about dealing with dense nodes
Currently, I've been involved in an warehouse based intelligent transaction analysis banking system featuring customer churn behavior, fraud detection & CRM analysis. We've been using Oracle as the database & it's completely a data warehousing project with data mining algorithms used for analysis.
We have records of about 1000 customers of a bank. For modeling, whether it is better to use the star schema or snowflake schema or constellation schema? I know the basic difference of star and snowflake schema- normalization of dimension table occurs in snowflake (a.k.a. snowflaking) schema which may be problematic for joining in case of large-sized database.
So, which schema would be better for my case? Answers from experienced programmers involved in data warehousing are highly welcomed!
Thanks in advance!
In brief, my assumption going into a project like this would be that a star schema would be appropriate. I might modify that if it appeared that a dimension was getting too large to efficiently full scan and the efficiency of queries against it could be meaningfully improved by snowflaking unless that dimension joined to the fact table on a partitioning key (due to difficulties in applying partition pruning on a predicate placed on a snowflaked dimension).
I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."
I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).
As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.
What differences in architecture would explain this?
First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.
Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)
But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).
You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.
Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.
Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.
I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.
-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
(
lineno serial PRIMARY KEY,
linetext text NOT NULL,
tsv tsvector
);
-- Load from downloaded data into database.
COPY war_and_peace (linetext)
FROM '/home/kgrittn/Downloads/war-and-peace.txt';
-- "Digest" data to lexemes.
UPDATE war_and_peace
SET tsv = to_tsvector('english', linetext);
-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
ON war_and_peace
USING gist (tsv);
-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;
Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:
-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'gentlemen');
84 rows, gist: 2.006 ms, gin: 0.194 ms
-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies');
184 rows, gist: 3.549 ms, gin: 0.328 ms
-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies & gentlemen');
1 row, gist: 0.971 ms, gin: 0.104 ms
Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.
Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.
The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.
The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.
With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.
UPDATE (Dec 9 '16)
I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.
Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.
As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"
The answer is a qualified yes, and I refer you to the following:
https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr search index as a database - is this "wrong"?
For the guardian solr is the new database
My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.
This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
Please read this and this.
Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.
The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"
I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.
What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.