I'm designing a new application, which will use Cassandra (I'm new in Cassandra). This database will contain only 2-4 column families. The problem is that, I have to provide opportunity to filter based on almost every column attributes. Could you give me some helpful suggestion that I have to keep in mind during planning? What about data redundancy?
Cassandra isn't optimized for this use-case. The preferred way to query data is using the primary key.
Filtering by arbitrary columns is possible
using the ALLOW FILTERING query modifier
creating a secondary index for each column, which could not be combined in a single query
creating lookup tables with different primary key variants based on the column you want to filter
All of those options have their limitations.
Related
I'm not an absolute expert of Cassandra, but what I know (correct me if I'm wrong) is that creating a secondary index for all the fields in a data model is an anti-pattern.
I'm using Elassandra and my data model looks like this :
A users object that represents a user, with : userID, name, phone, e-mail, and all kind of infos on users (say these users are selling things)
A sales object that represent a sale made by the user, with : saleID, userID, product name, price, etc. (There can be a lot more fields)
Given that I want to make complex searches on the user (search by phone, search by e-mail, etc etc) only on name, e-mail and phone, is it a good idea to create the 3 following tables from this data model :
"User core" table with only userID, name, phone and e-mail (fields for search) [Table fully indexed and mapped in Elasticsearch]
"User info" table with userID + the other infos [Table not indexed or mapped in Elasticsearch]
"Sales" table with userID, saleID, product name, price, etc. [Table not indexed or mapped in Elasticsearch]
I see at least one advantage : Any kind of indexation (or reindexation when changes happen) and associated costs will happen only if there is a change in the "User core" table, which should not change too frequently.
Also, if I need to get all other infos (User other infos or sales), I can just make 2 queries: 1 in "User core" to get the userID and 1 in the other table (with the userID) to get the other data.
But I'm not sure this is a good pattern, or maybe I should not worry about secondary indexation and just basically index any other table ?
In a more summarized way, what are the key reasons to chose - a secondary index like Elasticsearch in Elassandra - VS - denormalizing tables and use partition&clustering keys - ?
Please feel free to ask if you need more examples on my use case.
You should not normalise the tables when you're using Cassandra. The most important aspect of data modelling for Cassandra is to design one table for each application query. To put it another way, you should always denormalise your tables.
After you've modelled a table for each query, index the table with Elassandra which contains the most columns that you need to query.
It's important to note that Elassandra is not a magic bullet. In a lot of cases, you do not need to index the tables if you have modelled them against your application queries correctly.
The use case for Elassandra is to take advantage of features such as free-form text search, faceting, boosting, etc., but it will not be as performant as a native table. The fact is index lookups require more "steps" than a straight-forward single-partition Cassandra read. Of course, YMMV depending on your use case and access patterns. Cheers!
I dont think Erick´s answer is fully correct in case of Elassandra.
It is correct that native Cassandra queries will outperform elastic and in pure cassandra you should wrap your tables around the queries.
But if you prefer flexibility over performance (and this is why you mainly choose to use elassandra), you can use cassandra as primary storage and benefit from cassandra´s replication performance and index the tables for search in elastic.
This enables you to be flexible on the search side and still be sure not to lose data, in case something goes wrong on the elastic side.
In fact on production we use a combination of both: tables have its partition / clustering keys and are indexed in elastic (when necessary). In backend you can decide, if you can query by cassandra keys or if elastic is required.
We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.
I had earlier created a project of storing daily data of particular entity in RDMS by creating a single table for each day and than storing data of that day in this table.
But now i want to shift my database from RDMS to HBase. So my question is whether I should create a single table and store data of all days in that table or I should use my earlier concept of creating a individual table for each day.I want to compare both cases on basis of performance of hbase.
Sorry if that question seems foolish to you.Thank you
As you mentioned there are 2 options
Option 1: Single table of all days data
Option 2: multiple tables
I would prefer Namespaces (introduced in version 0.96 is a very important feature) with option 2 if you have huge data for single day. This will support multi tenancy requirements also...
See Hbase Book
A namespace is a logical grouping of tables analogous to a database in relation database systems. This abstraction lays the groundwork for
upcoming multi-tenancy related features: Quota Management (HBASE-8410)
Restrict the amount of resources (ie regions, tables) a namespace can consume.
Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.
Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of - RegionServers thus guaranteeing a course level of
isolation.
below are commands w.r.t. namespaces
alter_namespace, create_namespace, describe_namespace,
drop_namespace, list_namespace, list_namespace_tables
Advantage :
Even if you use column filters, since its less data(per day data), data retrieval will be fast for full table scan compared to single table approach(full scan on big table is costly)
If you want authentication and authorization on a specific table then it could also be achived.
Limitation : you will end up with multiple scripts to manage tables rather single script(option 1)
Note : In any afore mentioned options above,your rowkey design is very imp for better performance & prevent hotspoting.
For more details look at hbase-series
Suppose I have a huge database (table a) about employees in a certain department which includes the employee name in addition to many other fields. Now in a different databse (or a different table, say table b) I have only two entries; the employee name and his ID. But this table (b) contains entries not only for one department but rather for the whole company. The raw format for both tables is text-files so I parse them with logstash into Elasticsearch and then I visualize the results with Kibana.
Now after I created several visualizations from table (a) in Kibana where the x-axis shows the employee name, I realize it would be nice if we have the employee IDs instead. Since I know I have this information in table (b), I search for someway to tell Kibana to translate the employee name in the graphs generated from table (a) to employee ID based on table (b). My questions are as follows:
1) Is there a way to do this directly in Kibana? If yes, can we do it if each table is saved in a separate index or do we have to save them both in the same idnex?
2) If this cannot be done directly in Kibana and has to be done when indexing the data, is there a way to still parse both text files separately with logstash?
I know Elasticsearch is a non-relational database and therefore is not designed for SQL-like functionalities (join). However there should be an equivalent or a workaround. This is just a simple use case but of course the generic question is how to correlate data from different sources. Otherwise Elasticsearch would be honestly not that powerful.
Similar questions have been asked and answered.
Basically the answer is that -- no you can't do joins in Kibana, you have to do them at indexing time. Space is cheap and elasticsearch handles duplicate data nicely, so just create any fields you need to display at indexing time.
You might want to give Kibi a try.
The answer, unfortunately that I know of, is either write your own plug-in OR as we have had to do, downgrade to ES 2.4.1 and install Kibi
(https://siren.solutions/new-release-siren-join-2-4-1-compatible-with-es-2-4-1/)
and then install the kibi join plugin
(http://siren.solutions/relational-joins-for-elasticsearch-the-siren-join-plugin/)
This will allow you to get the joins you seek from a relational DB.
In my case,we defined the row key for the init set of queries, we are querying against the row key and leave the column family and columns alone.
eg. Row Key is something like:
%userid%_%timestamp%
we are doing some queries like
select columnFamily{A,B,C} from userid=blabla and blabla < timestamp < blabla
The performance is pretty ok, because that's what hbase is built for - row key look up.
But since the new requirement builds up, we will need to query against more fields: the columns. like:
select * from userid=blabla and blabla < timestamp < blabla and A=blabla and B=blabla and c=blabla
We started using hbase filters. We tried EqualFilter on one of the columns - A, it works ok from functionality point of view.
I have a general concern here, given the row key we have,
can we just keep adding filters against all columns A,B,C to meet different query needs? Does number of the filters added in the hbase query slow down the reading performance?
how dramatic is the impact if there is one?
Can somebody explain to me how we should use the best of hbase filters from performance perspective?
1) can we just keep adding filters against all columns A,B,C to meet different query needs? Does
number of the filters added in the hbase query slow down the reading performance?
Yes you can do this. It will affect performance depending on the size of the data set and what filters you are using.
2) how dramatic is the impact if there is one?
The less data you return the better. You don't want to fetch data that you don't need. Filters help you return only the data that you need.
3) Can somebody explain to me how we should use the best of hbase filters from performance perspective?
It is best to use filters such as prefix-filters, filters that match exactly a specific value (or qualifier, column, etc), or does something like a greater-than/less-than type comparison to the data. These types of filters do not need to look at all the data in each row or table to return the proper results. Avoid regex filters because the regex expression must be performed on every piece of data that the filter is looking at, and that can be taxing over a large data set.
Also, Lars George, the author of the HBase book, mentioned that people are moving more toward coprocessors than toward filters. Might also want to look at coprocessors.
1) can we just keep adding filters against all columns A,B,C to meet different query needs? Does
number of the filters added in the HBase query slow down the reading performance?
-Yes, you can add the filter for all columns but it will surely affect the performance of your query if you having huge data stored.
try to avoid the column filters because whenever you are adding any column filters ultimately you are increasing the number of comparisons based on columns.
2) how dramatic is the impact if there is one?
-Filter helps you to recuce your resultset , so you will have required data only while fetching.
3) Can somebody explain to me how we should use the best of hbase filters from performance perspective?
-In HBase rowFilter(it will include prefix) are most efficient filters because they don't need to look all record for that.So make your rowkey as it will include components on which you need to query frequently.
-Value filters are most inefficient filters because it have to compare the values of the columns.
-In HBase filters the sequence of filters matters, if you have multiple filters to be added to the filterlist then the sequence of the filters added will have impact on performance.
I will explain with example
If you need three different filters to be added to a query.Then when the first filter is applied the next filter will have the smaller data to be query on and after that same for third one.
So try to add efficient filter first ie.rowkey related filters and after that others