I am trying to optimize the systemd based startup sequence of the applications that are run at startup. I have below question on the same
Does Target based dependency actually work for sequencing/ordering of process starts ?
ex : Take the below configuration
target t1
{
Wants a.service, b.service, c.service
After a.service, b.service, c.service
}
target t2
{
Wants t1.target, x.service, y.service,z.service
After t1.target, x.service, y.service, z.service
}
Here i have added dependency of t1.target on t2 target. So ideally I would expect services wanted in t2.target to be started after all services in t1.target are launched/spawned.
But as I have seen mostly if there is no dependency in the service file(of t2 target) specifically on other service (from t1 target), all services are attempted to start parallely irrespective of target dependency.
Is this expectation correct to have target based dependency as above?
Even if target dependency exists, is it mandatory to add the service dependency exclusively ?
Any help here is much appreciated.
Thanks in advance :)
Related
Setting 'common' properties for child tasks is not working
The SCDF version I'm using is 2.9.6.
I want to make CTR A-B-C, each of tasks does follows:
A : sql select on some source DB
B : process DB data that A got
C : sql insert on some target DB
Simplest way to make this work seems to define shared work directory folder Path "some_work_directory", and pass it as application properties to A, B, C. Under {some_work_directory}, I just store each of task result as file, like select.result, process.result, insert.result, and access them consequently. If there is no precedent data, I could assume something went wrong, and make tasks exit with 1.
================
I tried with a composed task instance QWER, with two task from same application "global" named as A, B. This simple application prints out test.value application property to console, which is "test" in default when no other properties given.
If I tried to set test.value in global tab on SCDF launch builder, it is interpreted as app.*.test.value in composed task's log. However, SCDF logs on child task A, B does not catch this configuration from parent. Both of them fail to resolve input given at launch time.
If I tried to set test.value as row in launch builder, and pass any value to A, B like I did when task is not composed one, this even fails. I know this is not 'global' that I need, it seems that CTR is not working correctly with SCDF launch builder.
The only workaround I found is manually setting app.QWER.A.test.value=AAAAA and app.QWER.B.test.value=BBBBB in launch freetext. This way, input is converted to app.QWER-A.app.global4.test.value=AAAAA, app.QWER-B.app.global4.test.value=BBBBB, and print well.
I understand that, by this way, I could set detailed configurations for each of child task at launch time. However, If I just want to set some 'global' that tasks in one CTR instance would share, there seems to be no feasible way.
Am I missing something? Thanks for any information in advance.
CTR will orchestrate the execution of a collection of tasks. There is no implicit data transfer between tasks. If you want the data from A to be the input to B and then output of B becomes the input of C you can create one Task / Batch application that have readers and writers connected by a processor OR you can create a stream application for B and use JDBC source and sink for A and C.
In particular, how do I add the spark-bigquery-connector so that I can query data from within dataproc's Jupyter web interface?
Key links:
- https://github.com/GoogleCloudPlatform/spark-bigquery-connector
Goal:
To be able to run something like:
s = spark.read.bigquery("transactions")
s = (s
.where("quantity" >= 0)
.groupBy(f.col('date'))
.agg({'sales_amt':'sum'})
)
df = s.toPandas()
There are basically 2 ways to achieve what you want:
1 At Cluster creation:
You will have to creat an initialization script (param --initialization-actions) to install you dependencies.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
2 At Cluster creation:
You can specify a customized image to be used when creating your cluster.
https://cloud.google.com/dataproc/docs/guides/dataproc-images
3 At job runtime:
You can pass the additional jar files when you run the job using the --jars parameter:
https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/jobs/submit/pyspark#--jars
I recommend (3) if you have a simple .jar dependency to run, like scoop.jar
I recommend (1) if you have lots of packages to install before running your jobs. It gives you much more control.
Option (2) definitely gives you total control, but you will have to maintain the image yourself (apply patches, upgrade etc) so unless you really need it I don't recommend.
By default my resource group contains 3 resources which are added in proper order as required.
[root#2 ~]# pcs resource
Resource Group: RES-1
RES_a1 (ocf::abc:cde): Started
RES_a1-p1 (ocf::f:I2): Started
RES_a2 (ocf::hjs:f4): Started
As per requirements new resource can be added, Now i want all those resource to be added before the last resource (RES_a2) so that during failover they start / stop in the order i need.
Working solutions found so far (but i feel this is not the correct way)
Solution 1 :
Before adding new resource , delete the last resource and then again add the new resource and then the last resource again. This is working and the order is also maintained.
Solution 2 :
Manually editing the cib.xml file using cibadmin --query and cibadmin --replace , This also works fine. But this is more of a hack kind and not the proper way todo.
I want this to be automated and hence require some stable commands.
Other things tried, but not working :
pcs constraint order start res1 then res2
You can reorder the resources inside a resourcegroup using this command (example based on your resource group), if you now want to add a resource RES_a1-p2, then just add that resource and it will go to the end after RES_a2 and then execute this command:
pcs resource group add RES-1 RES_a2 --after RES_a1-p2
or
pcs resource group add RES-1 RES_a1-p2 --after RES_a1-p1
or
pcs resource group add RES-1 RES_a1-p2 --before RES_a2
I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
Thanks,
Arun
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS)
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS)
If a task fails: ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS)
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.
We are running a Spring 3.0.x web application (.war) with a nightly #Scheduled job in a clustered WebLogic 10.3.4 environment. However, as the application is deployed to each node (using the deployment wizard in the AdminServer's web console), the job is started on each node every night thus running multiple times concurrently.
How can we prevent this from happening?
I know that libraries like Quartz allow coordinating jobs inside clustered environment by means of a database lock table or I could even implement something like this myself. But since this seems to be a fairly common scenario I wonder if Spring does not already come with an option how to easily circumvent this problem without having to add new libraries to my project or putting in manual workarounds.
We are not able to upgrade to Spring 3.1 with configuration profiles, as mentioned here
Please let me know if there are any open questions. I also asked this question on the Spring Community forums. Thanks a lot for your help.
We only have one task that send a daily summary email. To avoid extra dependencies, we simply check whether the hostname of each node corresponds with a configured system property.
private boolean isTriggerNode() {
String triggerHostmame = System.getProperty("trigger.hostname");;
String hostName = InetAddress.getLocalHost().getHostName();
return hostName.equals(triggerHostmame);
}
public void execute() {
if (isTriggerNode()) {
//send email
}
}
We are implementing our own synchronization logic using a shared lock table inside the application database. This allows all cluster nodes to check if a job is already running before actually starting it itself.
Be careful, since in the solution of implementing your own synchronization logic using a shared lock table, you always have the concurrency issue where the two cluster nodes are reading/writing from the table at the same time.
Best is to perform the following steps in one db transaction:
- read the value in the shared lock table
- if no other node is having the lock, take the lock
- update the table indicating you take the lock
I solved this problem by making one of the box as master.
basically set an environment variable on one of the box like master=true.
and read it in your java code through system.getenv("master").
if its present and its true then run your code.
basic snippet
#schedule()
void process(){
boolean master=Boolean.parseBoolean(system.getenv("master"));
if(master)
{
//your logic
}
}
you can try using TimerManager (Job Scheduler in a clustered environment) from WebLogic as TaskScheduler implementation (TimerManagerTaskScheduler). It should work in a clustered environment.
Andrea
I've recently implemented a simple annotation library, dlock, to execute a scheduled task only once over multiple nodes. You can simply do something like below.
#Scheduled(cron = "59 59 8 * * *" /* Every day at 8:59:59am */)
#TryLock(name = "emailLock", owner = NODE_NAME, lockFor = TEN_MINUTE)
public void sendEmails() {
List<Email> emails = emailDAO.getEmails();
emails.forEach(email -> sendEmail(email));
}
See my blog post about using it.
You don't neeed to synchronize your job start using a DB.
On a weblogic application you can get the instanze name where the application is running:
String serverName = System.getProperty("weblogic.Name");
Simply put a condition two execute the job:
if (serverName.equals(".....")) {
execute my job;
}
If you want to bounce your job from one machine to the other, you can get the current day in the year, and if it is odd you execute on a machine, if it is even you execute the job on the other one.
This way you load a different machine every day.
We can make other machines on cluster not run the batch job by using the following cron string. It will not run till 2099.
0 0 0 1 1 ? 2099