I would like to be able to configure hue/hive to have a few custom jar files added and a few UDFs created so that the user does not have to do this every time.
Ideally, I am hopeful that there might be a feature similar to the Hive CLI's ".hiverc" file where I can simply put a few HQL statements to do all of this. Does anyone know if Hue has this feature? It looks like it is not using the file $HIVE_HOME/conf/.hiverc.
Alternatively, if I could handle both the custom jar file and the UDFs separately, that would be fine too. For example, I'm thinking I could put the jar in $HADOOP_HOME/lib on all of the tasktrackers, and maybe also on Hue's classpath somehow. Not sure, but I don't think this would be too difficult...
But that still leaves the UDFs. It seems like I might be able to modify the Hive source (org.apache.hadoop.hive.ql.exec.FunctionRegistry probably) and compile a custom version of Hive, but I'd really rather not go down that rabbit hole if at all possible.
It looks like this jira: https://issues.cloudera.org/browse/HUE-1066 [beeswax] Preload jars into the environment.
Related
Good morning,
I would like to generate excel file from oracle, therefore I have imported poi 3.16 and all pre-requisits based on the bottom table in this link:
http://poi.apache.org/overview.html#components
Exctly the following files:
commons-logging, commons-codec, commons-collections, log4j ,poi.jar
The dbms command I have used:
dbms_java.loadjava('filename.jar -resolve');
Everything went fine but all the classes that are within "org/apache/poi/hssf/usermodel/" remained invalid. The most important part. :)
Anybody has any idea what can be the problem? Should I import any other classes? First I would like try solution that does not need to check log files on the harddisk or any action on the server itself. I have no access to the server, therefore I have to communicate with the administrators which is complicated in our company :(. Of course if there is no olution within oracle I have no other option...
Thanks in advance,
Sz.
I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,
I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.
https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh
I used spark streaming and the it was quite a simple implementation using Scala.
I have a .pigbootup file configured which configures SET DEFAULT_PARALLEL for all the pig jobs. The script on which I am working on doesn't require that many reducers and I don't want to use that configuration. How can I overwrite/ignore the configuration which are given in this file?
I found this a great way to ignore the .pigbootup file completely. But, if you want to overwrite the property values and still use the existing configurations, then I don't think this is a great idea. But, my case worked just fine.
pig -Dpig.load.default.statements=/tmp/.non-existent-pigboot -f test.pig
Reference - https://hadoopified.wordpress.com/2013/02/06/pig-specify-a-default-script/
I'm writing my first Avro job that is meant to take an avro file and output text. I tried to reverse engineer it from this example:
https://gist.github.com/chriswhite199/6755242
I am getting the error below though.
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
I looked around and found it was likely an issue with what jar files are being used. I'm running CDH4 with MR1 and am using the jar files below:
avro-tools-1.7.5.jar
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I can't post code for security reasons but it shouldn't need anything not used in the example code. I don't have maven set up yet either so I can't follow those routes. Is there something else I can try to get around these issues?
Try using avro 1.7.3
AVRO-1170 bug
when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar.
However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :(
What's the correct approach here? Must we manually call registerJar for every jar we depend on?
not sure what's the certified way, but here's some pointers:
when you use pigServer.registerFunction pig automatically detects the jar that contain the udfs and sends it to the jobTracker
pig also automatically detects the jar that contains PigMapReduce class (JarManager.createJar), and extracts from it only the classes that start with org/apache/pig, org/antlr/runtime, etc. and sends them to the jobTracker as well
so, if your UDF sits in the same jar as PigMapReduce your'e screwed, because it won't get sent
our conclusion: don't use jar-with-dependencies
HTH