How to add jar dependency to dataproc cluster in GCP? - maven

In particular, how do I add the spark-bigquery-connector so that I can query data from within dataproc's Jupyter web interface?
Key links:
- https://github.com/GoogleCloudPlatform/spark-bigquery-connector
Goal:
To be able to run something like:
s = spark.read.bigquery("transactions")
s = (s
.where("quantity" >= 0)
.groupBy(f.col('date'))
.agg({'sales_amt':'sum'})
)
df = s.toPandas()

There are basically 2 ways to achieve what you want:
1 At Cluster creation:
You will have to creat an initialization script (param --initialization-actions) to install you dependencies.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
2 At Cluster creation:
You can specify a customized image to be used when creating your cluster.
https://cloud.google.com/dataproc/docs/guides/dataproc-images
3 At job runtime:
You can pass the additional jar files when you run the job using the --jars parameter:
https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/jobs/submit/pyspark#--jars
I recommend (3) if you have a simple .jar dependency to run, like scoop.jar
I recommend (1) if you have lots of packages to install before running your jobs. It gives you much more control.
Option (2) definitely gives you total control, but you will have to maintain the image yourself (apply patches, upgrade etc) so unless you really need it I don't recommend.

Related

Is there a way to reorder cluster resource group after pcs resource add

By default my resource group contains 3 resources which are added in proper order as required.
[root#2 ~]# pcs resource
Resource Group: RES-1
RES_a1 (ocf::abc:cde): Started
RES_a1-p1 (ocf::f:I2): Started
RES_a2 (ocf::hjs:f4): Started
As per requirements new resource can be added, Now i want all those resource to be added before the last resource (RES_a2) so that during failover they start / stop in the order i need.
Working solutions found so far (but i feel this is not the correct way)
Solution 1 :
Before adding new resource , delete the last resource and then again add the new resource and then the last resource again. This is working and the order is also maintained.
Solution 2 :
Manually editing the cib.xml file using cibadmin --query and cibadmin --replace , This also works fine. But this is more of a hack kind and not the proper way todo.
I want this to be automated and hence require some stable commands.
Other things tried, but not working :
pcs constraint order start res1 then res2
You can reorder the resources inside a resourcegroup using this command (example based on your resource group), if you now want to add a resource RES_a1-p2, then just add that resource and it will go to the end after RES_a2 and then execute this command:
pcs resource group add RES-1 RES_a2 --after RES_a1-p2
or
pcs resource group add RES-1 RES_a1-p2 --after RES_a1-p1
or
pcs resource group add RES-1 RES_a1-p2 --before RES_a2

Create Snapshot of FS from Spark Job

I would like to create a snapshot of the underlying HDFS, when running a spark job. The particular step involves deleting contents of some parquet files. I want to create a snapshot perform the delete operation, verify the operation results and proceed with next Steps.
However, I am unable to find a good way to access the HDFS API from my spark job. The directory I want to create a snapshot is tagged/marked snapshotable in HDFS. the command line method of creating the snapshot works, However I need to do this programmatically.
i am running Spark 1.5 on CDH 5.5.
any hints clues as to how I can perform this operation ?
Thanks
Ramdev
I have not verified this, but atleast I do not get Compile errors and in theory this solution should work.
This is scala code:
val sc = new SparkContext();
val fs = FileSystem.get(sc.hadoopConfig)
val snapshotPath = fs.createSnapshot("path to createsnapshot of","snapshot name")
.....
.....
if (condition satisfied) {
fs.deleteSnapshot(snapshotPath,"snapshot name")
}
I assume this will work in theory.

HDInsight Error

I am following the steps in the link shown below to use Hadoop 2.2 clusters with HDInsight. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-30/
In the "Run a Word Count Map Reduce Job"section I am having difficulty getting the message to take for step 4. In the PowerShell I type in the following commands:
Submit the job
Select-AzureSubscription $subscriptionName
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition
I keep getting an error that states there is a ParameterArgumentValidationError. What command could I use to avoid getting these errors?
I am new to using Azure and could really use some help :)
Those are two separate cmdlets:
The first one is:
Select-AzureSubscription $subscriptionName
If you only have only one subscription with your azure account, you can skip this cmdlet.
The second cmdlet is:
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition

Hive execution hook

I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
Thanks,
Arun
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS)
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS)
If a task fails: ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS)
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.

How RecommenderJob(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) will call my custom mappers and reducers?

I am running Mahout in Action example for 6 using command:
"hadoop jar target/mia-0.1-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData"
But the mappers and reducers in example of ch 06 are not working ?
You have to change the code to use the custom Mapper and Reducer classes you have in mind. Otherwise yes of course it runs the ones that are currently in the code. Add them, change the caller, recompile, and run it all on Hadoop. I am not sure what you refer to that is not working.

Resources