How to dynamically expand segments in Greenplum DBMS - greenplum

For now, the only way I know to expand segments/hosts in greenplum is to use gpexpand utility. However, gpexpand stops the master server for quite a while(as I know) in the early expansion, and lock the table which is currently redistributing. I just want to know if there is any way that greenplum can work normally(no stop , no lock tables) when expand segments/hosts.Tks!

No, Greenplum must stop during the expansion phase but after it adds more nodes/segments, the redistribution of data can be done while users are active in the database.
Alternatively, Pivotal HDB (based on Apache HAWQ) does have dynamic virtual segments that you can even control at the query level. The optimizer controls how many segments are used for a query based on the cost of the query but you can also provide more segments to really leverage the resources available in the cluster.

Related

How to know when data has been inserted in clickhouse

I understood that clickhouse is eventually consistent. So once an insert call returns, it doesn't mean that the data will appear in a select query.
does that apply to stand-alone clickhouse (no distribution, no replication)?
I understand the concept of eventual consistency for data replication, but does it apply with distribution but no replication?
using a distributed+replicated clickhouse, what is a recommended way to know that some insert(s) can be safely looked up?
Basically I didn't find much information on this topic, so maybe I am not asking the best questions. Feel free to enlighten me.
No, but single-node setup shouldn't be considered reliable either.
By default yes, you'll insert to node the client is connected to (probably via some load balancer) and Distributed table will asynchronously forward each piece of data to node where it belongs. The insert_distributed_sync=1 setting will make the client wait synchronously.
On insert use ***MergeTree shard tables directly (not Distributed) with insert_quorum=2 setting (if there are 3 replicas) and retry infinitely with exactly same batch if there are some errors (can use different replicas on retry, since there's a deduplication based on batch hash). Then on reads use select_sequential_consistency=1 setting.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Dynamically List contents of a table in database that continously updates

It's kinda real-world problem and I believe the solution exists but couldn't find one.
So We, have a Database called Transactions that contains tables such as Positions, Securities, Bogies, Accounts, Commodities and so on being updated continuously every second whenever a new transaction happens. For the time being, We have replicated master database Transaction to a new database with name TRN on which we do all the querying and updating stuff.
We want a sort of monitoring system ( like htop process viewer in Linux) for Database that dynamically lists updated rows in tables of the database at any time.
TL;DR Is there any way to get a continuous updating list of rows in any table in the database?
Currently we are working on Sybase & Oracle DBMS on Linux (Ubuntu) platform but we would like to receive generic answers that concern most of the platform as well as DBMS's(including MySQL) and any tools, utilities or scripts that can do so that It can help us in future to easily migrate to other platforms and or DBMS as well.
To list updated rows, you conceptually need either of the two things:
The updating statement's effect on the table.
A previous version of the table to compare with.
How you get them and in what form is completely up to you.
The 1st option allows you to list updates with statement granularity while the 2nd is more suitable for time-based granularity.
Some options from the top of my head:
Write to a temporary table
Add a field with transaction id/timestamp
Make clones of the table regularly
AFAICS, Oracle doesn't have built-in facilities to get the affected rows, only their count.
Not a lot of details in the question so not sure how much of this will be of use ...
'Sybase' is mentioned but nothing is said about which Sybase RDBMS product (ASE? SQLAnywhere? IQ? Advantage?)
by 'replicated master database transaction' I'm assuming this means the primary database is being replicated (as opposed to the database called 'master' in a Sybase ASE instance)
no mention is made of what products/tools are being used to 'replicate' the transactions to the 'new database' named 'TRN'
So, assuming part of your environment includes Sybase(SAP) ASE ...
MDA tables can be used to capture counters of DML operations (eg, insert/update/delete) over a given time period
MDA tables can capture some SQL text, though the volume/quality could be in doubt if a) MDA is not configured properly and/or b) the DML operations are wrapped up in prepared statements, stored procs and triggers
auditing could be enabled to capture some commands but again, volume/quality could be in doubt based on how the DML commands are executed
also keep in mind that there's a performance hit for using MDA tables and/or auditing, with the level of performance degradation based on individual config settings and the volume of DML activity
Assuming you're using the Sybase(SAP) Replication Server product, those replicated transactions sent through repserver likely have all the info you need to know which tables/rows are being affected; so you have a couple options:
route a copy of the transactions to another database where you can capture the transactions in whatever format you need [you'll need to design the database and/or any customized repserver function strings]
consider using the Sybase(SAP) Real Time Data Streaming product (yeah, additional li$ence is required) which is specifically designed for scenarios like yours, ie, pull transactions off the repserver queues and format for use in downstream systems (eg, tibco/mqs, custom apps)
I'm not aware of any 'generic' products that work, out of the box, as per your (limited) requirements. You're likely looking at some different solutions and/or customized code to cover your particular situation.

which Hadoop component can handle all the oracle queries.?

Which hadoop component can handle all the oracle functions & which has low latency..
Am thinking to use the components like Presto, Drill and Shark..
Can anyone tell which of the above technology can handle all the functions in oracle with low latency..
or atleast which has more compatibility & which can handle all the functions of oracle..
I have the flexibility to use more than one technology but am confused to use which one for which like functions compatible for which technology & which technology can give low latency..?
Presto is designed to implement ANSI SQL and to execute queries with low latency (under 100ms for connectors that support it). Queries against Hive can execute in ~1s, depending on the speed of the Hive metastore (zero time if cached due to repeated access) and HDFS latency.
Regarding Oracle functionality, nothing in open source comes close. Oracle is a huge product with a ton of functionality. However, no one uses all of the functionality. Most people use a small subset. You will need to evaluate the different alternatives and decide which has the functionality subset that best meets your needs.
Disclosure: I am one of the creators of Presto.

Is it possible to temporarily turn off an Oracle cluster?

So that I can use direct pathing for data loading? Can you turn off a cluster temporarily until data is loaded and then turn everything back on?
I'm guessing this questions is related to your other question regarding using SQL*Loader direct path. I believe the restriction on using direct path in SQL*Loader is that the table should not be clustered. If the table you are inserting data is not clustered, you can use direct path whether your Oracle instance is clustered or not.
So, if your table is not clustered, you should be able to use direct path loading without turning Oracle clustering off. If your table is cluster, you are completely out of luck because converting it to a non-clustered table and then clustering it back after the data is loaded would negate any performance gains from the direct path loading.
You are mixing two completely different concepts:
- a database cluster and
- a table cluster.
A database cluster provides scalability and HA, while the other (table cluster) determines how and where to physically store the data. Turning off RAC will not help with table clusters.

Resources