Custom scheduler for Hadoop

Custom scheduler for Hadoop - hadoop

I want to create a replica of the existing fair / capacity scheduler in Hadoop and then make some changes. Any idea anybody? Please help.
I downloaded the hadoop-2.6.0-src source code, and see the FairScheduler.java source at the following path:
\hadoop-2.6.0-src\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-resourcemanager\src\main\java\org\apache\hadoop\yarn\server\resourcemanager\scheduler\fair
But, guess there are many related class files. Do I have to copy all of them, rename and create a new project? Please help as I am not a java guy.

Manisha, you could try to extend the already existent FairScheduler class, creating a MyFairScheduler and adding those changes you talk about in the required methods. Then, the new class can be build and packaged into a jar file that must be put into the classpath in order it can be used.

Related

How can I batch Kafka reads to Elasticsearch

I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,

I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.

https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh

I used spark streaming and the it was quite a simple implementation using Scala.

Is it possible to configure properties like jcr:PrimaryType from a maven install

I'm following the steps from the Adobe instructions on How to Build AEM Projects using Maven and I'm not seeing how to populate or configure the meta data for the contents.
I can edit and configure the actual files, but when I push the zip file to the CQ instance, the installed component has a jcr:primaryType of nt:folder and the item I'm trying to duplicate has a jcr:primaryType of cq:Component (as well as many other properties). So is there a way to populate that data without needing to manual interact with CQ?
I'm very new to AEM, so it's entirely possible I've overlooked something very simple.

Yes, this is possible to configure JCR node types without manually changing with CQ.
Make sure you have .content.xml file in component folder and it contains correct jcr:primaryType ( e.g. jcr:primaryType="cq:Component").
This file contains metadata for mapping JCR node on File System.
For beginners it may be useful take a look VLT, that used for import/export JCR on File System. Actually component's files in your project should be similar to VLT component export result from JCR.

Something like .hiverc for Hue

I would like to be able to configure hue/hive to have a few custom jar files added and a few UDFs created so that the user does not have to do this every time.
Ideally, I am hopeful that there might be a feature similar to the Hive CLI's ".hiverc" file where I can simply put a few HQL statements to do all of this. Does anyone know if Hue has this feature? It looks like it is not using the file $HIVE_HOME/conf/.hiverc.
Alternatively, if I could handle both the custom jar file and the UDFs separately, that would be fine too. For example, I'm thinking I could put the jar in $HADOOP_HOME/lib on all of the tasktrackers, and maybe also on Hue's classpath somehow. Not sure, but I don't think this would be too difficult...
But that still leaves the UDFs. It seems like I might be able to modify the Hive source (org.apache.hadoop.hive.ql.exec.FunctionRegistry probably) and compile a custom version of Hive, but I'd really rather not go down that rabbit hole if at all possible.

It looks like this jira: https://issues.cloudera.org/browse/HUE-1066 [beeswax] Preload jars into the environment.

embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar.
However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :(
What's the correct approach here? Must we manually call registerJar for every jar we depend on?

not sure what's the certified way, but here's some pointers:
when you use pigServer.registerFunction pig automatically detects the jar that contain the udfs and sends it to the jobTracker
pig also automatically detects the jar that contains PigMapReduce class (JarManager.createJar), and extracts from it only the classes that start with org/apache/pig, org/antlr/runtime, etc. and sends them to the jobTracker as well
so, if your UDF sits in the same jar as PigMapReduce your'e screwed, because it won't get sent
our conclusion: don't use jar-with-dependencies
HTH

Need help to find out why Websphere Application Server has many .lck files

The file names seesm to point to our WAS data sources. However, we're not sure what is creating them and why there are so many. The servers didn't seem to crash. Why is WAS 6.1.0.23 creating these andy why aren't they being cleaned?
There are many files like these, with some going up to xxx.43.lck
DWSqlLog0.0.lck
DWSqlLog0.0
TritonSqlLog0.0.lck
TritonSqlLog0.0
JTSqlLog0.0
JTSqlLog0.1
JTSqlLog0.3
JTSqlLog0.2
JTSqlLog0.4.lck
JTSqlLog0.4
JTSqlLog0.3.lck
JTSqlLog0.2.lck
JTSqlLog0.1.lck
JTSqlLog0.0.lck

WAS uses JDK Logging and JDK logger creates such files with extension .0,.1 etc along with the .lck file so that the WAS runtime has a lock to these files that it writes to.
Cheers
Manglu

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Custom scheduler for Hadoop - hadoop

Manisha, you could try to extend the already existent FairScheduler class, creating a MyFairScheduler and adding those changes you talk about in the required methods. Then, the new class can be build and packaged into a jar file that must be put into the classpath in order it can be used.

Related

How can I batch Kafka reads to Elasticsearch

Is it possible to configure properties like jcr:PrimaryType from a maven install

Something like .hiverc for Hue

embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

Need help to find out why Websphere Application Server has many .lck files

Categories

Resources