Dump Import vs Parallel Processing vs Hadoop [duplicate] - oracle

This question already has an answer here:
How copy data from one database to another on different server?
(1 answer)
Closed 8 years ago.
Actually I want to copy 50 GB of database from one one server to another server, I just want to know that which one of three options is best.
Thanks

Assuming you don't have to transform the data, can't see any gain by using Hadoop (or parallel processing, for that matter) in this scenario.
See this question on copying Oracle data between servers

Related

How to lock a directory in unix for the shell script [duplicate]

This question already has answers here:
What is the best way to ensure only one instance of a Bash script is running? [duplicate]
(14 answers)
Closed 3 years ago.
I need to remove the contents of a directory D based on the below condition
Get the used space of directory D.
If the used space is above the threshold, then remove the contents of directory based on last modified time (using find mtime)
Have already written a shell script for it (clearSpace.sh), but the problem is that the script can be called by multiple processes simultaneously.
I want the steps 1 & 2 to be atomic so that I can get consistent results.
Is there a way where I first get a "lock" on directory D and execute clearSpace.sh and then give the lock? Permission based locking is not an option
We suggest refrain from locking storage or any other OS resources. The side effects could be devastating. And unrelated.
The responsibility of the OS is to sync, distribute and manage resources.
Resource locking is kernel level programming. Better not get there. There is a reason for not having locking features in the OS scripting tools.
You can implement script access locking. Using a lock.id file to signal that the script is running.
Forbid the script to run if lock.id exist.
And remove the lock.id file when the script operation is completed.
We suggest you change your script thresholds to be more flexible and tolerant.

mclapply and spark_read_parquet

I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...
Example: (the same code works when using one core, but fails when using 2)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2)
Warning message:
In mclapply(seq(file_paths), function(i) { :
all scheduled cores encountered errors in user code
Any suggestion???
Thanks in advance.
Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.

Checking when a folder was created in Bash? [duplicate]

This question already has answers here:
How to get file creation date/time in Bash/Debian?
(13 answers)
Closed 7 years ago.
I know you can't for files, but is there by any chance a way to check this for folders?
It is not possible to get creation time of any files in linux, since
Each file has three distinct associated timestamps: the time of last
data access, the time of last data modification, and the time the file
status last changed.
Also you as you wrote in your question "I know you can't for files, but is there by any chance a way to check this for folders?", in linux, all files and directories are considered as files, so you have your own answer in your question.
Source

Measuring Oracle DB connection bandwidth [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm looking for ideas regarding measuring the bandwidth of an existing Oracle DB connection (perl DBI) without any changes to database server side. Right now I'm running a
select * from table;
against a table with approximately known amount of data and timing the response. I'm running it manually from shell and I'm considering implementing similar functionality in the application's admin/debug section for admins to look at. Specifically, I'm looking at running prepare first and then running execute while timing performance using Time::HiRes.
Questions:
Is there a better SQL statement to use for the benchmark? Perhaps some query could generate a specific amount of non-meaningful data on the fly, much like dd if=/dev/zero bs=1k count=1k
Can someone think of another approach to measuring bandwidth that might be integratable to the web UI? A non-interactive shell command would work fine.
A little background. My application is accessing Oracle DB over an internal network. The network has bandwidth problems. On a bad day it's as bad as a dial-up.
Implementation environment is RHEL / Oracle Instantclient / Perl DBI.
You could try running a little script on the Oracle server that sits in a loop, reading from a specified port (using netcat) and discarding the data to /dev/null. Then have a script on your client that sends a known volume of data every now and then (using netcat) and times how long it takes. That pretty much measures network performance and is independent of Oracle or disk.
Something like this:
On Oracle Server
while /bin/true
do
nc -l 20000 > /dev/null
done
On client
time dd if=/dev/zero bs=1024k count=10 | nc <oracle_ip> 20000

how to restrict placement of client data to specific nodes in hadoop? [duplicate]

This question already has an answer here:
How to put files to specific node?
(1 answer)
Closed 9 years ago.
I am working on hadoop and I am currently trying to find out how to make the file given by the client to be stored in specific nodes in the cluster.
The client wants the files or chunks to be stored in particular nodes or particular files but not in all.
I am searching for a way to specify which nodes will be used to store the file we put in HDFS.
Can anyone suggest some ideas?
With the help of HDFS-385 you can do that. This feature provides a way to write code that specifies how HDFS should allocate replicas of blocks of a file. You can visit this if you need some help on how to write your own block placement policy.
HTH

Resources