How can I batch Kafka reads to Elasticsearch - elasticsearch

I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,

I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.

https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh

I used spark streaming and the it was quite a simple implementation using Scala.

Related

Custom scheduler for Hadoop

I want to create a replica of the existing fair / capacity scheduler in Hadoop and then make some changes. Any idea anybody? Please help.
I downloaded the hadoop-2.6.0-src source code, and see the FairScheduler.java source at the following path:
\hadoop-2.6.0-src\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-resourcemanager\src\main\java\org\apache\hadoop\yarn\server\resourcemanager\scheduler\fair
But, guess there are many related class files. Do I have to copy all of them, rename and create a new project? Please help as I am not a java guy.
Manisha, you could try to extend the already existent FairScheduler class, creating a MyFairScheduler and adding those changes you talk about in the required methods. Then, the new class can be build and packaged into a jar file that must be put into the classpath in order it can be used.

Using DSE Hadoop/Mahout, NoClassDef of org.w3c.dom.Document

Trying to run a simple hadoop job, but hadoop is throwing a NoClassDef on "org/w3c/dom/Document"
I'm trying to run the basic examples from the "Mahout In Action" book (https://github.com/tdunning/MiA).
I do this using nearly the same maven setup but tooled for cassandra use rather than a file data model.
But, when I try to run the *-job.jar, it spits a NoClassDef from the datastax/hadoop end.
I'm using 1.0.5-dse of the driver as that's the only one that supports the current DSE version of Cassandra(1.2.1) if that helps at all though the issue seems to be deeper.
Attached is a gist with more info included.
There is the maven file, this brief overview, and the console output.
https://gist.github.com/zmarcantel/8d56ae4378247bc39be4
Thanks
try dropping the jar file for class of org.w3c.dom.Document to $DSE/resource/hadoop/lib/ folder as a work around.

How to install applications to a WebSphere 7.0 cluster using wsadmin?

I want to deploy to all four processes on a Websphere cluster with two nodes. Is there a way of doing this with one Jython command or do I have to call 'AdminControl.invoke' on each one?
Easiest way to install an application using wsadmin is with AdminApp and not AdminControl.
I suggest you download wsadminlib.py (Got the link from here)
it has a lot of functions, one of them is installApplication which works also with cluster.
Edit:
Lately I found out about AdminApplication which is a script library included in WAS 7 (/opt/IBM/WebSphere/AppServer/scriptLibraries/application/V70)
The docuemntation is not great in the info center but its a .py file you can look inside to see what it does.
It is imported automatically to wsadmin and you can use it without any imports or other configuration.
Worth a check.
#aviram-segal is right, wsadminlib is really helpful for this.
I use the following syntax:
arg = ["-reloadEnabled", "-reloadInterval '0'", "-cell "+self.cellName, "-node "+self.nodeName, "-server '"+ self.serverName+"'", "-appname "+ name, '-MapWebModToVH',[['.*', '.*', self.virtualHost]]]
AdminApp.install(path, arg)
Where path is the location of your EAR/WAR file.
You can find documentation here

Desktop SPARQL client for Jena (TDB)?

I'm working on an app that uses Jena for storage (with the TDB backend). I'm looking for something like the equivalent of Squirrel, that lets me see what's being stored, run queries etc. This seems like an obvious thing to need, but my (perhaps badly phrased) google queries aren't turning up anything promising.
Any suggestions, please? I'm on XP. Even a command line tool would be helpful.
Take a look at my Store Manager tool which is part of the dotNetRDF Toolkit which I develop as part of the wider dotNetRDF project I maintain.
It provides a fairly basic GUI through which you can connect to various Triple Stores including TDB provided that you expose your dataset via Joseki/Fuseki. You need to have .Net 3.5 installed to run the apps in the toolkit.
If you don't already expose your TDB dataset via HTTP try using Fuseki as it is ridiculously easy to use and can be run just on your local machine when necessary to make your TDB store available via HTTP for use with my tool e.g.
java -jar fuseki-0.1.0-server.jar --update --loc data /dataset
Please see the Fuseki wiki for more information on running Fuseki and the various options. In the above example Fuseki is run with SPARQL Update enabled (the --update flag), using the TDB dataset located in the directory data (the --loc data argument) and with a base URI of /dataset for the data.
Once running you can use my tool to connect to a Fuseki server by going to File > New Generic Store Manager, selecting the "Fuseki" tab from the dialog that appears, entering the URI http://localhost:3030/dataset/data and then clicking "Connect to Fuseki".
Twinkle is a handy SPARQL client : http://www.ldodds.com/projects/twinkle/
As it happens I'm working on something similar myself, but it still needs a lot of work (check back in a month :) http://hyperdata.org/wiki/Scute
first download jena fusaki from
https://jena.apache.org/download/index.cgi
un-zip the file and copy the "jena-fuseki-1.0.1" to c drive
open cmd
type for accesing the folder
"cd C:\jena-fuseki-1.0.1"
then type
"java -jar fuseki-server.jar --update --loc data /dataset"
at last open a browser and type
"localhost:3030/"
remember you must first declear the enviorment verible(located in system poperties then advance tab)
and edit variable name call "Path" in the "System verible" to
"C:\jena-fuseki-1.0.1"
I also develop a SPARQL client, Open Source in Java Swing: EulerGUI.
In fact it does a lot more, see the manual:
http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html
For the SPARQL feature, better take the EulerGUI minimal build:
http://sourceforge.net/projects/eulergui/files/eulergui/1.11/

Need help to find out why Websphere Application Server has many .lck files

The file names seesm to point to our WAS data sources. However, we're not sure what is creating them and why there are so many. The servers didn't seem to crash. Why is WAS 6.1.0.23 creating these andy why aren't they being cleaned?
There are many files like these, with some going up to xxx.43.lck
DWSqlLog0.0.lck
DWSqlLog0.0
TritonSqlLog0.0.lck
TritonSqlLog0.0
JTSqlLog0.0
JTSqlLog0.1
JTSqlLog0.3
JTSqlLog0.2
JTSqlLog0.4.lck
JTSqlLog0.4
JTSqlLog0.3.lck
JTSqlLog0.2.lck
JTSqlLog0.1.lck
JTSqlLog0.0.lck
WAS uses JDK Logging and JDK logger creates such files with extension .0,.1 etc along with the .lck file so that the WAS runtime has a lock to these files that it writes to.
Cheers
Manglu

Resources