Ignore/Ovewrite .pigbootup configurations - hadoop

I have a .pigbootup file configured which configures SET DEFAULT_PARALLEL for all the pig jobs. The script on which I am working on doesn't require that many reducers and I don't want to use that configuration. How can I overwrite/ignore the configuration which are given in this file?

I found this a great way to ignore the .pigbootup file completely. But, if you want to overwrite the property values and still use the existing configurations, then I don't think this is a great idea. But, my case worked just fine.
pig -Dpig.load.default.statements=/tmp/.non-existent-pigboot -f test.pig
Reference - https://hadoopified.wordpress.com/2013/02/06/pig-specify-a-default-script/

Related

Hadoop - Managing multiple input/output files

I'm facing problems managing multiple input files.
I have many in .../input/ folder and a mapreduce job which I want to be executed for each of my input files, so that every input file has its own output (in .../output/).
Now, I tried searching on the Net but many pages are very old and I can't get a working method. Any methods/classes which I can use to make this work?
Thanks in advance.

Where to put oozie.launcher.* configuration?

While trying to use Oozie properly, I ended up setting a few parameters, namely:
oozie.launcher.mapreduce.map.memory.mb
oozie.launcher.mapreduce.map.java.opts
oozie.launcher.yarn.app.mapreduce.am.resource.mb
oozie.launcher.mapred.job..queue.name
If I set them in the worfklow configuration, they work as expected.
Is there a way/a place to set them globally, ie. not per workflow? I was expecting that custom-oozie-site.xml would be the right place but apparently not (they have no effect if put there). Is the workflow itself the only place where they can be configured?
If it is relevant, I am using hdp 2.5.
In the Oozie Parameterization of Workflows section of the documentation, they state
Workflow applications may define default values for the workflow job parameters. They must be defined in a config-default.xml file bundled with the workflow application archive... Workflow job properties have precedence over the default values.
Another option I've seen done is defining a parent workflow definition and propagating to child workflows. Granted, this only works in specific instances and isn't always a good idea.
In addition the documentation notes in the Workflow Deployment section
The config-default.xml file defines, if any, default values for the workflow job parameters. This file must be in the Hadoop Configuration XML format. EL expressions are not supported and user.name property cannot be specified in this file. Any other resources like job.xml files referenced from a workflow action action node must be included under the corresponding path, relative paths always start from the root of the workflow application.
This is a problem my team is currently trying to fix across 12 different ETL loads.

How can I batch Kafka reads to Elasticsearch

I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,
I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.
https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh
I used spark streaming and the it was quite a simple implementation using Scala.

Something like .hiverc for Hue

I would like to be able to configure hue/hive to have a few custom jar files added and a few UDFs created so that the user does not have to do this every time.
Ideally, I am hopeful that there might be a feature similar to the Hive CLI's ".hiverc" file where I can simply put a few HQL statements to do all of this. Does anyone know if Hue has this feature? It looks like it is not using the file $HIVE_HOME/conf/.hiverc.
Alternatively, if I could handle both the custom jar file and the UDFs separately, that would be fine too. For example, I'm thinking I could put the jar in $HADOOP_HOME/lib on all of the tasktrackers, and maybe also on Hue's classpath somehow. Not sure, but I don't think this would be too difficult...
But that still leaves the UDFs. It seems like I might be able to modify the Hive source (org.apache.hadoop.hive.ql.exec.FunctionRegistry probably) and compile a custom version of Hive, but I'd really rather not go down that rabbit hole if at all possible.
It looks like this jira: https://issues.cloudera.org/browse/HUE-1066 [beeswax] Preload jars into the environment.

Including a log4j.properties file in my jar, but different properties file at execution time (optionally)

I want to include a log4j.properties file in my maven build, but be able to use a different properties file at execution time (using cron on unix)
Any ideas?
You want to be able to change properties per environment.
There are number approach to address this issue.
Create directory in each environment which will contain different files (log4j.properties in your example). Add these directories to your classpath in each environment.
Use filter ability + profile ability of maven in order to populate log4j.properties with correct values in the build time.
Use build server (Jenkins, for example) which essentially will make p.2.
Each of these approaches has it's own drawbacks. I am currently using a bit weired combination of 2&3 because Jenkins limitations.

Resources