Suppose I have a huge database (table a) about employees in a certain department which includes the employee name in addition to many other fields. Now in a different databse (or a different table, say table b) I have only two entries; the employee name and his ID. But this table (b) contains entries not only for one department but rather for the whole company. The raw format for both tables is text-files so I parse them with logstash into Elasticsearch and then I visualize the results with Kibana.
Now after I created several visualizations from table (a) in Kibana where the x-axis shows the employee name, I realize it would be nice if we have the employee IDs instead. Since I know I have this information in table (b), I search for someway to tell Kibana to translate the employee name in the graphs generated from table (a) to employee ID based on table (b). My questions are as follows:
1) Is there a way to do this directly in Kibana? If yes, can we do it if each table is saved in a separate index or do we have to save them both in the same idnex?
2) If this cannot be done directly in Kibana and has to be done when indexing the data, is there a way to still parse both text files separately with logstash?
I know Elasticsearch is a non-relational database and therefore is not designed for SQL-like functionalities (join). However there should be an equivalent or a workaround. This is just a simple use case but of course the generic question is how to correlate data from different sources. Otherwise Elasticsearch would be honestly not that powerful.
Similar questions have been asked and answered.
Basically the answer is that -- no you can't do joins in Kibana, you have to do them at indexing time. Space is cheap and elasticsearch handles duplicate data nicely, so just create any fields you need to display at indexing time.
You might want to give Kibi a try.
The answer, unfortunately that I know of, is either write your own plug-in OR as we have had to do, downgrade to ES 2.4.1 and install Kibi
(https://siren.solutions/new-release-siren-join-2-4-1-compatible-with-es-2-4-1/)
and then install the kibi join plugin
(http://siren.solutions/relational-joins-for-elasticsearch-the-siren-join-plugin/)
This will allow you to get the joins you seek from a relational DB.
Related
I'm not an absolute expert of Cassandra, but what I know (correct me if I'm wrong) is that creating a secondary index for all the fields in a data model is an anti-pattern.
I'm using Elassandra and my data model looks like this :
A users object that represents a user, with : userID, name, phone, e-mail, and all kind of infos on users (say these users are selling things)
A sales object that represent a sale made by the user, with : saleID, userID, product name, price, etc. (There can be a lot more fields)
Given that I want to make complex searches on the user (search by phone, search by e-mail, etc etc) only on name, e-mail and phone, is it a good idea to create the 3 following tables from this data model :
"User core" table with only userID, name, phone and e-mail (fields for search) [Table fully indexed and mapped in Elasticsearch]
"User info" table with userID + the other infos [Table not indexed or mapped in Elasticsearch]
"Sales" table with userID, saleID, product name, price, etc. [Table not indexed or mapped in Elasticsearch]
I see at least one advantage : Any kind of indexation (or reindexation when changes happen) and associated costs will happen only if there is a change in the "User core" table, which should not change too frequently.
Also, if I need to get all other infos (User other infos or sales), I can just make 2 queries: 1 in "User core" to get the userID and 1 in the other table (with the userID) to get the other data.
But I'm not sure this is a good pattern, or maybe I should not worry about secondary indexation and just basically index any other table ?
In a more summarized way, what are the key reasons to chose - a secondary index like Elasticsearch in Elassandra - VS - denormalizing tables and use partition&clustering keys - ?
Please feel free to ask if you need more examples on my use case.
You should not normalise the tables when you're using Cassandra. The most important aspect of data modelling for Cassandra is to design one table for each application query. To put it another way, you should always denormalise your tables.
After you've modelled a table for each query, index the table with Elassandra which contains the most columns that you need to query.
It's important to note that Elassandra is not a magic bullet. In a lot of cases, you do not need to index the tables if you have modelled them against your application queries correctly.
The use case for Elassandra is to take advantage of features such as free-form text search, faceting, boosting, etc., but it will not be as performant as a native table. The fact is index lookups require more "steps" than a straight-forward single-partition Cassandra read. Of course, YMMV depending on your use case and access patterns. Cheers!
I dont think Erick´s answer is fully correct in case of Elassandra.
It is correct that native Cassandra queries will outperform elastic and in pure cassandra you should wrap your tables around the queries.
But if you prefer flexibility over performance (and this is why you mainly choose to use elassandra), you can use cassandra as primary storage and benefit from cassandra´s replication performance and index the tables for search in elastic.
This enables you to be flexible on the search side and still be sure not to lose data, in case something goes wrong on the elastic side.
In fact on production we use a combination of both: tables have its partition / clustering keys and are indexed in elastic (when necessary). In backend you can decide, if you can query by cassandra keys or if elastic is required.
My problem is that I need to perform sorting from data coming from two different datasource, a MySQL database which contains information about some products and a PostgreSQL that contains some metrics linked to these products.
Because the data resides in two different datasources I cannot out of the box come up with a single performant query that would make the ordering (pagination) at database level.
I need to make two different queries and then manually merge the data and perform sorting and pagination code side.
I would like to avoid as much as possible having to create a custom pagination system and a manual data merging and as much as possible delegate this job to the underlying database.
This is where I thought a system such as ElasticSearch (or Solr, but ES seems to be easier to use) could help.
1) Does ES provide tools or mechanism to merge 2 datasource into 1 document ? Or this job needs to be done by a 3rd party tool that will peridocally pull the data from both datasource and create / update the documents?
2) I'm correct to assume that having 2 indices (or 2 different doc type) is pointless in my case since ES cannot perform join queries ?
3) Apart from creating one single document what other solution do I have that ES can help with? Is it possible 'somehow' that with having datasource1 data in an index1 and datasource2 data in an index2 I can perform multiple search queries using both the index at the same time (since join is a no go).
Does ES provide tools or mechanism to merge 2 datasource into 1 document ? Or this job needs to be done by a 3rd party tool that will peridocally pull the data from both datasource and create / update the documents?
There are two approaches to accomplish this :
An ETL process (Extract, Transform, Load) to load data from both sources into one single document. In the Elastic world you can use logstash to accomplish this
Data Virtualization is supposed to do this without the need to copy the data
3) Apart from creating one single document what other solution do I have that ES can help with? Is it possible 'somehow' that with having datasource1 data in an index1 and datasource2 data in an index2 I can perform multiple search queries using both the index at the same time (since join is a no go).
It's very easy to perform a single query through multiple indices. Answers here
We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.
Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update
I am developing an enterprise application with an Oracle backend. I am designing a core part of the DB architecture now and im having some questions on it.
First and most important thing is, most of my tables needs to preserve old data. For example
Consider a table with the fields
Contract No, Contract Name, Contract Person, Contract Email
I have a records like
12, xxx, yyy, xxx#zzz.ccc
and some one modifies it to
12, xxx, zzz, xxx#zzz.ccc
at any point of time i need to display the new record while still have copy of the old record.
So what i thought was to put a duplicate record of the old data and update the fields that was changed and have a flag to keep track of active records with something like "is active" as 1.
The downside is that this creates redundancy in the table and seems like a bad design. But any other model seems unnecessarily complex and this seems cleaner to me. Also i dont see any performance issues having a duplicate record too. So please let me know if this is ok or am i missing something here.
Some times where there is a one to many relationship my assumption is to have a mapping table where i map the multiple entity in individual records by repeating master ID and changing child ID in each record. Is this a right way to do it or is there a better way to do it.
Is there a book on database best practices.
Thanks.
The database im dealing with is Oracle 11g on a two node RAC cluster
Also i dont see any performance issues having a duplicate record too.
Assume you have a row that, over time, has 15 updates to it. If you don't store any temporal data (if you don't store different versions of the row), you end up storing one row. If you do store temporal data, you end up storing 15 rows.
You also need more indexes, because the id number is no longer sufficient to identify a single row.
If you have only relatively small tables, you probably won't see any performance difference. (There will be one, but it probably won't be noticeable to users.) But a table that has 10 million rows will perform differently than a table that has 150 million rows. (15 versions per row, times 10 million rows.)
Some times where there is a one to many relationship my assumption is
to have a mapping table where i map the multiple entity in individual
records by repeating master ID and changing child ID in each record.
Is this a right way to do it or is there a better way to do it.
You probably need to know which child rows belong to which parent rows. So you need more than a single master id for the key. The master id alone doesn't tell you which version of that row in the parent table applies to a given child row.
Is there a book on database best practices.
There are books on temporal databases. The first one that I know of is Snodgrass's Developing Time-Oriented Database Applications in SQL. It's available in several formats, and it's free. It's also kind of old, but the information in it is important to understand if you're going to be building a temporal database. Also, think about reading Date's book Temporal Data and the Relational Model.
Wikipedia has an article that summarizes the ideas behind temporal databases.
Is normalization completely mandatory.
That's a meaningless question. You will have different issues with tables normalized to 2NF than you'll have with tables normalized to 5NF or 6NF.
I would keep the old/history records in a separate table. Create an upd/del trigger to populate your audit/history table for you, and keep only the most current data in your main table.
See here for an example. Many other similar examples exists in SO.