which Hadoop component can handle all the oracle queries.?

which Hadoop component can handle all the oracle queries.? - oracle

Which hadoop component can handle all the oracle functions & which has low latency..
Am thinking to use the components like Presto, Drill and Shark..
Can anyone tell which of the above technology can handle all the functions in oracle with low latency..
or atleast which has more compatibility & which can handle all the functions of oracle..
I have the flexibility to use more than one technology but am confused to use which one for which like functions compatible for which technology & which technology can give low latency..?

Presto is designed to implement ANSI SQL and to execute queries with low latency (under 100ms for connectors that support it). Queries against Hive can execute in ~1s, depending on the speed of the Hive metastore (zero time if cached due to repeated access) and HDFS latency.
Regarding Oracle functionality, nothing in open source comes close. Oracle is a huge product with a ton of functionality. However, no one uses all of the functionality. Most people use a small subset. You will need to evaluate the different alternatives and decide which has the functionality subset that best meets your needs.
Disclosure: I am one of the creators of Presto.

Related

HANA CDS Views vs Calculation Views vs Table Functions

In SAP HANA I am used to create Calculation Views.
Previously I learned that Calculation Views (which after compilation are column-views) are to be prefered over Database-SQL-Views.
Now with CDS-Views I am not sure if this is still the case. Especially with regards to performance.
Also what is now the difference between a table function (which replaced scripted calculation views) and CDS Views?

Ok, this is a question that I believe requires some background to be answered.
A long, long time ago...
When SAP HANA was first developed, it heavily reused concepts and technology from other, already existing SAP products (TREX, P*TIME, MaxDB, Business Warehouse Accelerator).
One of the fundamental elements of the high query performance was (and is) the column store data-storage, which came in large parts from the TREX/BWA products. These products, in turn, had been solutions to very specific problems (full-text search for catalogs and speed-up of analytical queries from the SAP Business Warehouse data warehouse product).
Especially the BWA use case reflects in the column views of SAP HANA. Due to the limited use case of supporting SAP BW queries, no general SQL/relational query support was required (e.g. no arbitrary join-chain optimizations, no SQL features beyond SQL:92 etc.) whereas other, rather exotic features (like "vertical join") that could be used by SAP BW, were built into a query tool/engine ("engine" clearly was a very popular term with the SAP developers).
Once HANA proved successful as a platform to run SAP BW on, the next step was to add flexibility and make more general platforms like SAP Netweaver (the software that SAP's business solution products run on/with) working on SAP HANA. Now, SQL features were added and those required additional capabilities from the query optimizer and execution "engines".
Query optimization had to be flexible and fast and should lead to query performance that would still beat the existing RDBMS vendors' offering (which had been around for 40+ years).
This, clearly, is a hard problem and throwing is operational aspects of DB development (scaling, solution deployment, data federation, etc.).
This led to an overlapping development of different tools addressing different aspects of DB development.
SQL support and the underlying SQL optimizer were made more powerful, so much so, that (some) SQL queries could be as fast or faster than those modeled in calculation views. And since both of these "query frontends" eventually had to talk to the same internal data structures (row/column store) it was desirable to have just a single query optimizer, that would support all the different use cases.
Somewhere around HANA 1 SPS11/12 most calculation views started to be "unrolled" internally to feed into the common optimizer (that was what the "Execute in SQL Engine" flag was about).
I'd say, since then, the performance argument for using calculation views only holds in very specific circumstances.
I mentioned the overlapping developments and CDS (core data services) is one of them. The idea here is a very different one from SQL. While SQL gives you "the way to talk to the database", CDS wants to give your application a single data definition, that is used by the UI, the program logic and the data storage/query execution.
SQL != CDS
This probably needs some context (again): a major usage pattern of how SQL databases are used by application developers is that the application is written in some form of OO-implementation and the talking to the DB is left to a mapping layer/library (e.g. O/R-mappers). This means, that the knowledge of what the application is about (aka business process knowledge), is spread out in the application.
There is some information about it in the UI (labels, formatting, visibility, ...), some of it is in the application-object model (object dependencies, hierarchies, value domains...) and then some of it is in the queries against the database.
Such scattered knowledge/definition makes it hard to make changes consistent, which in turn, slows the development process and in turn prolongs the time until the application can run and deliver some positive outcome.
"Time-to-value" is the thing under optimization here as this is important for companies that give the promise of "success through innovation".
Ok, so this CDS thing is now part of the development models proposed by SAP and nearly en-passant also addresses topics like schema evolution and deployment of the data model. It is, in fact, independent of the actual database platform as shown in the CDS for ABAP variety.
How does this lead back to query performance? It does not really.
CDS' advantage is that one can provide more information about the data model than what is possible in HANA SQL.
Associations and joins with cardinality declaration (albeit now retrofitted to plain SQL) can enable the optimizer to use additional optimizations. Yet, the same optimizer and the same query execution "engines" are used here.
So, from a (query execution) performance point of view, it does not make a big difference, as long as no query semantics are required for which CDS does not have syntax (e.g. some window functions).
The main point of CDS really is about application development process performance and whether that works well with how you do development really depends on how much of it you can use.
Now for the question "scripted calc view" vs. "table function" vs. "CDS view".
Looking at these different object types from the point of "what can I do with them functionally?" will result in the observation "basically, the same".
The difference lies in how these can be optimized (scripted calc views cannot be generally unrolled into the global query to be optimized), and what one can do with the object once created.
Table functions allow for very easy reuse across multiple views and queries. They also provide the option to provide parameters into the function (similar to parameterized views) and in addition allow for imperative coding.
Functionally speaking, table functions are a kind of swiss-army knife; one can do nearly anything with them and they still can be part of global query optimization.
CDS views, as mentioned above, are nothing "special" in terms of query runtime or optimization. The main reason why CDS views are "a thing" is that with HANA SAP started to develop development models (such as XS, XSA, CAM) that revolve around "virtual data models".
The idea for those is that the structure of tables very often is stable and changes only little over time.
In a way, this is the "write-schema" of applications that enter the data into tables.
The "read-schema" is most of the time different from that. Queries re-combine the normalized data into records that the application can map into objects. This allows applications to look at the data differently than the original application.
With "virtual data models" these queries are baked into tangible development artifacts (the views) that can be reused across the application. In fact, these can be treated as if this was the database with its tables, presented in a way that makes sense for the application.
Once again, if that is something that is beneficial for your application development depends on how your application development looks like.
Can you use HANA without CDS? Absolutely, and there are many areas where CDS lacks (i.e. the limited syntax and feature mapping to HANA features) but it does have its merits.
Should you abandon calculation views?
I would not necessarily change existing developments if they still serve their purpose, but calculation views certainly are an odd development object. Training folks in using those and SQL most likely is overly expensive compared to just sticking to SQL.
Personally, I prefer the code-based SQL development (better tooling available, allows for easier comparison with other DBMS, doesn't require WEB IDE/HANA Studio).
The only thing, SQL based development does not provide is the extended annotations/semantic information used by the SAP analytic frontend tools (SAC & BO) - these really are specific to CDS and Information Models (calculation views) but barely used by other analytic tools.
And that's my take on it.

I would add that
Calculation Views are semantically richer. A SQL View does not know about measures, dimensions, hierarchies. https://blogs.sap.com/2019/08/26/what-is-the-difference-calcview-versus-sql-view/
The difference from the execution plan point of view is getting less and less. In Hana 2.0 SP4 most graphical calc views are turned internally into a single SQL statement to be executed by the SQL engine. So in that sense, using a CalcView gives you the additional information about the model plus the query performance of the SQL engine.
Lars' explanation of CDS is perfect. Nothing to add there.

But Imagine the situation when you can't create a table function because of limited license (aka runtime version). Just stay with scripted views.
The main advantage of Hana artifacts over CDS at present is the ability to use input parameters in complex cases to optimize resources and query performance - when your logic is pushed down into DB instead of AS / app. But many native SQL features are still not available in graphical views (for example - exists, JOIN on BETWEEN), so I think that 10 years later HANA artifacts will become "very rare".
So learn CDS syntax :)

Always a glad experience reading an article or pov from Lars, on any media (StackOverflow, SAP blog, article, twitter).
I just want to point out that another thing that I miss from the SQL scripting (SP, TF, SF) is the join optimization and SQL propagation that Information View has.
This is for me the focus to flexible models (apart from dynamic join that is only relevant for certain scenarios), to deliver one view that will perform depending on which columns the user or app will request.
For the semantics use, I can simply expose a TF inside an information view to add some.
You can tell me that CDS have both options available (join optimization, SQL propagation, and annotation) but for advanced or complicated scenarios (window functions not present at CDS), and also for non-SAP developers, it will be more simple and the go-to approach for beginners

How can we do data analysis for DB replication project

We are facing one issue in our project i.e. Data verification issue.
The project is about Replication of data from Sybase to oracle DBs.
The table structures for Table A across Sybase, Oracle is same.
Same column and primary key combination across all the databases.
e.g. If Sybase has Table A with columns a, b and C
same table with same name and same columns will be available in different databses.
We are done with replication stuff part.But we faced some silent failure like data discrepancy just wondering if there will any tool already available for this.
Any information on his would be helpful. Thanks.

Sybase (now SAP) has a couple products that can be used for data comparisons and reconciliation:
rs_subcmp - an older, 32-bit tool that comes with the Sybase Replication Server product that can be used to compare data between
source and target; SQL reconciliation scripts can be generated from
the differences and then applied to the target to bring it in sync
with the source; if your tables are more than 1GB in size you can
still use rs_subcmp but you'll need to create multiple comparison
jobs (via where clauses) to work on different subsets of your tables
[I don't recall if rs_subcmp can be use for heterogeneous
replication setsup, eg, ASE-Oracle.]
Data Assurance (DA) - the newer, 64-bit product ... also from
Sybase ... which can also compare data and (re)sync the target(s)
from the source (either via SQL reconciliation scripts or directly);
DA is capable of handling comparisons between a handful of
different RDBMS products (eg, ASE-Oracle); I'm currently working on a
project where one of the requirements is to validate (and reconcile
where needed) 200+TB of data being migrated from Oracle to HANA and
I'm using DA for the validation/reconciliation portion of the project
As #TenG has hinted at with his answer, there's a good bit of effort involved to compare data and generate code to reconcile the differences. Rolling your own code is doable but will entail a lot of work. If you've got the money you'll likely find 3rd party tools can get most/all of the work done for you.
If you used a 3rd party product to replicate your data from Sybase to Oracle, you may want to see if the same vendor has a comparison/validation/reconciliation tool you could use.

I've worked on a few migration projects and a key part has always been data reconciliation.
I can only talk about the approaches we took, based on constraints around tools available and minimising downtime, and constraints of available space.
In all cases I took to writing scripts that worked on two levels - summary view and "deep dive". We couldn't find any tools readily available that did what we wanted in a timely enough manner. In fact even the migration tools we found had limitations (datapump, sqlloader, golden gate, etc) and hand coded scripts to handle the bits that we found to be lacking or too slow in the standard tools.
The summary view varied from project to project. It was part functional based (do the accounting figures for transactions match) for the users to verify, and part technical. For smaller tables we could just write simple reports and the diff was straight forward.
For larger tables we wrote technical reports that looked at bands of data (e.g group the PK into 1000s) collect all the column data and produce checksum, generating a report for each table like:
PK ID Range Start Checksum
----------------- -----------
100000 22773377829
200000 38938938282
.
.
Corresponding table pairs from each database were then were "diff"d against each other to highlight discrepancies. Any differences that were found could then be looked at in more detail.
The scripts were written in such a way to allow them to run in parallel looking at discrete bands. Te band ranges were tunable as well to get the best throughput. This obviously sped things up.
The scripts were shell scripts firing off sqlplus reports, and similar for the source database.
On one project there wasn't enough diskspace to do these reports, so I wrote a Java program that queried the two databases side by side, using block queues to fetch and compare rowsets. Being in memory meant this was super fast.
For the "deep dive" we looked at the details for key tables, or for tables that reports a checksum difference.
For the user reports, the users would specify what they wanted to see, and we wrote the reports accordingly.
On the last project, the only discrepancies found were caused by character set conversion issues (people names with accents weren't handled correctly).
On projects where the overall dataset was smaller we extracted the data to XML files and wrote a Java tool to processes pairs and report differences.

The SAP/Sybase rs_subcmp tool is pretty powerful and also pretty hard to use. For details see:
https://help.sap.com/viewer/075940003f1549159206fcc89d020515/16.0.3.3/en-US/feb58db1bd1c1014b134ef4efef25563.html?q=rs_subcmp
You have to pass it key field information, but once you do that, it can retry/restart the compare streams after transient differences. Pretty fancy.
rs_subcmp expects to work on Sybase data source. So to compare against Oracle, you'd probably have to setup one of those Sybase-to-Oracle gateway products ($$$$$).
Could you install the Oracle ODBC drivers and configure them to allow Sybase clients to access Oracle? I'm guessing not (but that's outside the range of my experience).
Note the "-h" option for rs_subcmp. The docs just say it runs a "fast comparison", but what it's actually doing is running queries using the hashbytes() function. Something like:
select keyfield1,keyfield2, hashbytes("Md5",datacol1,datacol2,datacol3)
from mytable
So this sort of query might be good for the "summary view" type comparison discussed above (if the Oracle STANDARD_HASH() function output matches up with the Sybase hashbytes() function (again, outside my experience))
Note, as of ASE 16, there was a bug with the hash() & hashbytes() functions running the Md5 hash option against large varbinary columns where they could use up all procedure cache, potentially crashing the server (CR 811073)

Database with low memory requirements and Ruby interface

I need a database with low memory requirements for a small virtual server with few memory. At the moment I'm stuck with SQLite and Kyoto Cabinet or Tokyo Cabinet. The database should have a Ruby interface.
Ideally I want to avoid key-value-stores, because I have “complex” queries (more complex than looking up a single key) and tuples as keys. On the other hand I don't want to have a fixed schema and avoid the planning and migration efforts of a SQL database. A database server is also not necessary because only a single application will use the database.
Do you have any recommendations and numbers for me?

There is schema-less Postgresql (Postgresql 9.2 + json). Not as hard/confusing to set up as I thought. You get lots of flexibility with queries while still getting the benefits of schema-less storage. PG 9.2 includes plv8js, a new language handler that allows you to create functions in JavaScript. Here is one example of how you can index and query JSON docs in PG 9.2: http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html
CouchDB (Use BigCouch. Based on CouchDB, but fewer bugs/problems.):
very low memory requirements.
schema-less.
HTTP-based interface. Ruby has plenty of HTTP clients. HTTP caching (like Varnish) can also speed reads.
creative/complex queries. You can create indexes and queries on any key in the document (record). You can get very creative with queries since the indexes are very programmable.
Downsides:
Learning curve of setting up your queries/indexes.
You have to schedule a type of cleanup operation called "compaction".
Data will take up more space compared to other databases.
More: http://www.paperplanes.de/2010/7/26/10_annoying_things_about_couchdb.html
If disk is cheap and memory expensive, it would make a good candidate for your needs.
"...another strength of CouchDB, which has proven to serve thousands of concurrent requests only needed about 10MB of RAM - how awesome is that?!?!" (From: http://www.larsgeorge.com/2009/03/hbase-vs-couchdb-in-berlin.html )

SQLite3 is a great fit for what you are trying to do. It's used by a lot of companies as their embedded app database because it's flexible, fast, well tested, and has a small footprint. It's easy to create and blow away tables so it plays well with testing or single-application-use data stores.
The SQL language it uses is rich enough to do normal things but I'd recommend using Sequel with it. It's a great ORM and easily lets you treat it as a full-blown ORM, or drop all the way down to talking raw SQL to the DBM.

You are looking for a solution that only has a database file and no running server, probably. In that case, Sqlite should be a good choice - If you don't need it, just close the connection and that's it. Sqlite has everything that you need from and RDMS (expect for enforcing FK's directly, but that can be done with triggers), with a very little memory footprint, so in that case you are probably worried more about the memory your ORM (if any) uses.
Personally, I use sqlite for that use case as well, as it is portable and easy to access and install (which shouldn't be the problem on a server anyway, but in a desktop application it is).

BerkeleyDB with SQLite API is what you need.
http://www.oracle.com/technetwork/database/berkeleydb/overview/sql-160887.html

oracle metrics monitoring and reporting in real time

I am stress testing a database table
I am looking for any software that can connect to my database and show me some metrics like no of rows in a table, time for inserts , inserts/time, table fragmentation[logical/physical] etc .
It would be great if the reporting tool can do the following:
1] Report in real time or atleast after some interval so that I do not have to wait for test to finish to get first look at the data
2] Ability to do stuff with the data later, like get 99.99 percentile, avg etc.
Is mostly freely available :)
Does anyone have any suggestion of something I can use with my Oracle table. Any pointers would be great.
I can actually write scripts to logg stuff like select count(*) etc .. but then I will have to spend a lot of time parsing and changing the data reporting rather than the tests.
I think some intelligent thing might already be out there ??
Thanks
Edit:
I am looking at a piece of design for
a new architecture
The tests are
"comparison" tests for different
designs and hence as far as I do it
on same hardware and same schema etc
they are comparable to some
granularity.
I want to monitor index
fragmentation, and response times
etc.
If you think there are other
things that can change please let me
know. I am trying to roll back the
table to particular state[basically
truncate] for each new iteration of
the test

First, Oracle has built-in functionality for telling you the number of rows in a table (either use count(*) or search 'gather statistics oracle' for another option).
But "stress testing a table" sounds to me like you're going down the wrong path. Most of the metrics you're mentioning ("time for inserts , inserts/time, table fragmentation[logical/physical] etc") are highly dependent on many factors:
what OS Oracle's running on
how the OS is tuned (i.e. other services running)
how the specific Oracle instance is configured
what underlying storage architecture Oracle's using (and how tablespaces are configured)
what other queries are being executed in the database at the exact same time as your test
But NONE of them would be related to the table design itself.
Now, if you're wondering if your normalized (or de-normalized) table schema is hurting your application, that's another matter. As is performance being degraded by improper/unneeded/missing indexes, triggers, or a host of other problems.
But if you really want an app that will give you real-time monitoring, check out Quest Software's Spotlight on Oracle. But it's definitely not free.

Just to add to the other comments, I believe what you really want is to stress test the queries you're running and not the table. The table is just a bunch of data blocks on a disk and the query is what will make the difference in performance as far as development is concerned. That will tell you if you need different indexes or need to redesign the query.
On the other hand, if you're looking at it as a DBA or system administrator, you're probably more interested in OS level statistics especially disk latency, memory paging, and CPU utilization.
All this is available in the enterprise manager which is my primary tuning tool for development and DBA. If you don't have that, read up on using sql_trace to profile your queries and your OS specific documentation on how to get those stats.

Free data warehouse - Infobright, Hadoop/Hive or what?

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Edit:
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?

Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.

You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.

If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio