What is hive, Is it a database? [closed] - hadoop

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I just started exploring Hive. It has all the structures similar to an RDBMS like tables, joins, partitions.. what i understand is Hive still uses HDFS for storage and it is an SQL abstraction of HDFS. From this I am not sure weather Hive itself is a database solution like HBase, Cassnadra.. or simply it is a query system on top of HDFS. I don't think it is simply a query language because it has tables, joins and partitions..

Hive is a data warehousing package/infrastructure built on top of Hadoop. It provides an SQL dialect called Hive Query Language (HQL) for querying data stored in a Hadoop cluster. Like all SQL dialects in widespread use, HQL doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for row level inserts, updates, and deletes. Hive doesn’t support transactions. So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence,
Predictive modeling, hypothesis testing etc.
Hive is not designed for online transaction processing and does not offer real-time queries.

Related

Copy datas from SQLDB into hadoop [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am studying a use case where we are going to move datas from a SQL database (600TB ~100 tables) into a transformed format into hadoop. We don't have logs enabled in the SQL DB. We decided to copy the datas as a datamart view and to refresh this view every week. The copied datas will be erased every week to be rewritten.
This SQL DB is used for reporting purposes that is derived from the datalake. This OLTP database is an old system we are replacing progressivly. The dataset that is copied is deleted every week and copied again (refreshed).
80% of data copy is straight with no transformation.
20% has redesign.
We identified 3 options :
AirFlow + Beam for the processing
ETL (informatica) was excluded
Kafka (connect, stream, sink into hadoop) with optionnaly CDC Debezium
What do you think is a best approach regarding : performance, overall time to deliver, data architecture ?
Thanks for help !
My thoughts - for what they are worth:
I would definitely not be looking to copy 600TB per week. Given that the majority of this data will not have changed from week to week (I assume) then you should be looking to only copy across the data that has changed. As your data in Hadoop will be partitioned then you would mainly be inserting new data into new partitions - for those records that have changed you will just be dropping/reloading a few partitions
I would copy all the necessary data into a staging area in Hadoop as-is (without transformation) and then process it on the Hadoop platform to produce the data you actually need - you can then drop the staging area data if you want
Data processing tool - if you already have experience of a specific toolset within your company then use that; don't multiply the toolsets in use unless there is critical functionality required that is not available within existing tools. If this one process is all you are going to be using this toolset for then it probably doesn't matter which one you use - pick one that is quickest to learn/deploy. If this toolset is going to be expanded to other use cases then I would definitely use a dedicated ETL/ELT tool rather than use a coding solution (why have you discarded Informatica as a solution?)
The following is definitely an opinion...
If you are building a new analytical platform, I am surprised that you are using Hadoop. Hadoop is legacy technology that has been superseded by more modern and capable Cloud data platforms (Snowflake, etc.).
Also, Hadoop is a horrible platform to try and run analytics on (it's ok as just a data lake to hold data while you decide what you want to do with it). Trying to run queries on it that don't align with how that data is partitioned gives really bad performance (for non-trivial dataset sizes). For example, if your transactions are partitioned by date then running a query to sum transaction values in the last week will run quickly. However, running a query to sum transactions for a specific account (or group of accounts) will perform very badly

Export all the tables at once with data from oracle sql developer [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to export all the 100 tables with data from one schema at once from oracle sql developer... Like we export one table and that table get's saved where we want to save as an excel file. Is there any way to do this?... Instead of exporting one table at a time with data.
There's Data Pump Export (and Import) which does that. However, the result is a DMP file which is certainly not recognizable by Excel; think of it as of a binary file readable only by Data Pump Import.
So, if you want Excel files (actually, a CSV format), you'll have to either export them one-by-one (what a tedious job!) or write your PL/SQL procedure which would use UTL_FILE package. Note that (generally speaking) the result resides in a directory located on the database server, not your local PC, so you'll have to talk to your DBA about it. Shouldn't be a problem (in my opinion), you should be granted read/write access to a directory designed for such purposes.
Tools, Database Export
Select your tables. Select your output method (Excel), hit go.
Bigger question, what are you gonna do with these 100 Excel files?
Also, how big are these tables? Exporting to CSV might be better, but again we don't know why you want Excel files...
Finally, if you want to take this data and use it to put in another Oracle Database at some point, you should be using Data Pump.
You can try writing a scheduler for following task using PL/SQL
Use oracle documentation for help

Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters).
Have zeroed in on using PySpark / Spark for the initial ETL phase.
Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard.
Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) -
Spark SQL (in memory dynamic querying)
AWS Athena (Serverless SQL querying, based on Presto)
Elastic Search (search engine)
Redis (Key Value DB)
Feel free to suggest alternative tools, if you know of a better option.
Based on the information you've provided, I am going to make several assumptions:
You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.
As you have pre-defined and indexed filters, you have well ordered, structured data.
Going through the options listed
Spark SQL - If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.
AWS Athena (Serverless SQL querying, based on Presto) - Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you'll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.
Elastic Search (search engine) - Is not really a query tool, so it is likely not part of the core of this pipeline.
Redis (Key Value DB) - Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.
I would also look into Amazon Redshift.
For further reading, read Big Data Analytics Options on AWS.
As #Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.
Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days.
I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.
you can use amazon quicksight , it has spice to do the query.. and can do visualisation at the same time..

Performance implications of using (DBMS_RLS) Oracle Row Level Security(RLS)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
If we use Oracle Row Level Security(RLS) to hide some records - Are there any Performance Implications - will it slow down my SQL Queries? The Oracle Package for this is: DBMS_RLS.
I plan to add: IS_HISTORICAL=T/F to some tables. And then using RLS, hide the records which have value of IS_HISTORICAL=T.
The SQL Queries we use in application are quite complex, with inner/outer joins, subqueries, correlated subqueries etc.
Of the 200 odd tables, about 50 of them will have this RLS Policy (to hide records by IS_HISTORICAL=T) applied on them. Rest of the 150 tables are child tables of these 50 Tables, so RLS is implicit on them.
Any License implications?
Thanks.
"Are there any Performance Implications - will it slow down my SQL
Queries? "
As with all questions relating to performance the answer is, "it depends". RLS works by wrapping the controlled query in an outer query which applies the policy function as a WHERE clause...
select /*+ rls query */ * from (
select /*+ your query */ ... from t23
where whatever = 42 )
where rls_policy.function_t23 = 'true'
So the performance implications rest entirely on what goes in the function.
The normal way of doing these things is to use context namespaces. These are predefined areas of session memory accessed through the SYS_CONTEXT() function. As such the cost of retrieving a stored value from a context is negligible. And as we would normally populate the namespaces once per session - say by an after logon trigger or a similar connection hook - the overall cost per query is trivial. There are different ways of refreshing the namespace which might have performance implications but again these are trivial in the overall schem of things (see this other answer).
So the performance impact depends on what your function actually does. Which brings us to a consideration of your actual policy:
"this RLS Policy (to hide records by IS_HISTORICAL=T)"
The good news is the execution of such a function is unlikely to be costly in itself. The bad news is the performance may still be Teh Suck! anyway, if the ratio of live records to historical records is unfavourable. You will probably end up retrieving all the records and then filtering out the historical ones. The optimizer might push the RLS predicate into the main query but I think it's unlikely because of the way RLS works: it avoids revealing the criteria of the policy to the general gaze (which makes debugging RLS operations a real PITN).
Your users will pay the price of your poor design decision. It is much better to have journalling or history tables to store old records and keep only live data in the real tables. Retaining historical records alongside live ones is rarely a solution which scales.
"Any License implications?"
DBMS_RLS requires an Enterprise Edition license.

MDX support for Hive (Hadoop)

Is there any support for Multidimensional Expressions (MDX) for Hadoop's Hive ?
Connecting an OLAP solution with Hadoop's data is possible. In icCube it's possible to create your own data sources (check documentation), you'll need a Java interface (like JDBC).
This solution is bringing the data to the OLAP server. To bring the processing to Hadoop is another question and at my knowledge nobody does it. Aggregating the facts in parallel is possible. Another step is to have the dimensions in the nodes. This is a complicated problem (algos are not easy to transform in a parallel version).
You can use Mondrian (Pentaho Analysis Services), it connects via JDBC and uses specific dialects for databases. I've seen reference to a Hive dialect, but have not tried it myself - best to search the forums.
There is a bit of a learning curve: you need to create a schema that defines the cubes in XML, but fortunately there is a GUI tool (schema workbench) that helps.
There is Simba MDX provider which claims to convert MDX queries to HiveQL. I have not tried it myself to comment on the features and limitations of this.

Resources