Writing data from JavaDStream<String> in Apache spark to elasticsearch - elasticsearch

I am working on programming to process data from Apache kafka to elasticsearch. For that purpose I am using Apache Spark. I have gone through many link but unable to find example to write data from JavaDStream in Apache spark to elasticsearch.
Below is sample code of spark which gets data from kafka and prints it.
import org.apache.log4j.Logger;
import org.apache.log4j.Level;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import com.google.common.collect.ImmutableMap;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.util.List;
public class SparkStream {
public static JavaSparkContext sc;
public static List<Map<String, ?>> alldocs;
public static void main(String args[])
{
if(args.length != 2)
{
System.out.println("SparkStream <broker1-host:port,broker2-host:port><topic1,topic2,...>");
System.exit(1);
}
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkConf sparkConf=new SparkConf().setAppName("Data Streaming");
sparkConf.setMaster("local[2]");
sparkConf.set("es.index.auto.create", "true");
sparkConf.set("es.nodes","localhost");
sparkConf.set("es.port","9200");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Set<String> topicsSet=new HashSet<>(Arrays.asList(args[1].split(",")));
Map<String,String> kafkaParams=new HashMap<>();
String brokers=args[0];
kafkaParams.put("metadata.broker.list",brokers);
kafkaParams.put("auto.offset.reset", "largest");
kafkaParams.put("offsets.storage", "zookeeper");
JavaPairDStream<String, String> messages=KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
lines.print();
jssc.start();
jssc.awaitTermination();
}
}
`

One method to save to elastic search is using the saveToEs method inside a foreachRDD function. Any other method you wish to use would still require the foreachRDD call to your dstream.
For example:
lines.foreachRDD(lambda rdd: rdd.saveToEs("ESresource"))
See here for more

dstream.foreachRDD{rdd=>
val es = sqlContext.createDataFrame(rdd).toDF("use headings suitable for your dataset")
import org.elasticsearch.spark.sql._
es.saveToEs("wordcount/testing")
es.show()
}
In this code block "dstream" is the data stream which observe data from server like kafka. Inside brackets of "toDF()" you have to use headings. In "saveToES()" you have use elasticsearch index. Before this you have create SQLContext.
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
If you are using kafka to send data you have to add dependency mentioned below
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.1"
Get the dependency
To see full example see
In this example first you have to create kafka producer "test" then start elasticsearch
After run the program. You can see full sbt and code using above url.

Related

Call BigQuery stored procedure(Routine) using spring boot

I'm trying to call a Google BigQuery stored procedure (Routine) using Spring boot. I tried all the methods of the routines to extract data. However, it didn't help.
Has anyone ever created and called a BigQuery stored procedure (Routine) through the Spring boot? If so, how?
public static Boolean executeInsertQuery(String query, TableId tableId, String jobName) {
log.info("Starting {} truncate query", jobName);
BigQuery bigquery = GCPConfig.getBigQuery(); // bqClient
// query configuration
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query)
.setUseLegacySql(false)
.setAllowLargeResults(true)
.setDestinationTable(tableId) .setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE).build();
try {
// build the query job.
QueryJob queryJob = new QueryJob.Builder(queryConfig).bigQuery(bigquery).jobName(jobName).build();
QueryJob.Result result = queryJob.execute();
} catch (JobException e) {
log.error("{} unsuccessful. job id: {}, job name: {}. exception: {}", jobName, e.getJobId(),
e.getJobName(), e.toString());
return false;
}
}
package ops.google.com;
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryError;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.EncryptionConfiguration;
import com.google.cloud.bigquery.InsertAllRequest;
import com.google.cloud.bigquery.InsertAllResponse;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableResult;
import com.google.common.collect.ImmutableList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import com.google.auth.oauth2.GoogleCredentials;
import com.google.auth.oauth2.ServiceAccountCredentials;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class SelectFromBigQueryFunction {
private static final Logger logger = LogManager.getLogger(SelectFromBigQueryFunction.class);
public boolean tableSelectFromJoin(String key_path) {
String projectID = "ProjectID";
String datasetName = "DataSetName";
String tableName1 = "sample_attribute_type";
String tableName2 = "sample_attribute_value";
boolean status = false;
try {
//Call BQ Function/Routines, functinon name->bq_function_name
//String query = String.format("SELECT DataSetName.bq_function_name(1, 1)");
//Call BQ Stored Procedure, procedure name-> bq_stored_procedure_name
String query = String.format("CALL DataSetName.bq_stored_procedure_name()");
File credentialsPath = new File(key_path);
FileInputStream serviceAccountStream = new FileInputStream(credentialsPath);
GoogleCredentials credentials = ServiceAccountCredentials.fromStream(serviceAccountStream);
// Initialize client that will be used to send requests. This client only needs to be created
BigQuery bigquery = BigQueryOptions.newBuilder()
.setProjectId(projectID)
.setCredentials(credentials)
.build().getService();
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).build();
TableResult results = bigquery.query(queryConfig);
results.iterateAll().forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));
logger.info("Query performed successfully with encryption key.");
status = true;
} catch (BigQueryException | InterruptedException e) {
logger.error("Query not performed \n" + e.toString());
}catch(Exception e){
logger.error("Some Exception \n" + e.toString());
}return status;
}
}

Pig store fails to run store command when it is called using java code (embedded mode)

I am learning Hadoop, I tried running my pig script using java but it seems like it skips the store command written in the script and does not produce the output data file at the particular location.
But when I try running the pig script using command line, it gives the output data file as desired.
First I thought that java might have some permission issues that it is not creating a file. But I tried creating a file in the exact location using java and it easily creates an empty file. So it doesn't seem to be a permission issue.
Can anybody tell me what is the issue here that pig script runs successfully when used through command line but fails in the embedded mode?
Java Code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.pig.PigServer;
import java.io.IOException;
public class storePig {
public static void main(String args[]) throws Exception {
try {
PigServer pigServer = new PigServer("local");
runQuery(pigServer);
}catch (Exception e){
e.printStackTrace();
}
}
public static void runQuery(PigServer pigServer) throws IOException {
pigServer.registerScript("/home/anusharma/Desktop/stackoverflow/sampleScript.pig");
}
}
Pig-Script:
Employee = LOAD '/home/anusharma/Desktop/Hadoop/Pig/record.txt' using PigStorage(',') as (id:int, firstName:chararray, lastName:chararray, age:int, contact:chararray, city:chararray);
Employe = ORDER Employee BY age desc;
limitedEmployee = LIMIT Employe 4;
STORE limitedEmployee into '/home/anusharma/Desktop/stackoverflow/output' using
PigStorage('|');

Access Data from REST API in HIVE

Is there a way to create a hive table where the location for that hive table will be a http JSON REST API? I don't want to import the data every time in HDFS.
I had encountered similar situation in a project couple of years ago. This is the sort of low-key way of ingesting data from Restful to HDFS and then you use Hive analytics to implement the business logic.I hope you are familiar with core Java, Map Reduce (if not you might look into Hortonworks Data Flow, HDF which is a product of Hortonworks).
Step 1: Your data ingestion workflow should not be tied to your Hive workflow that contains business logic. This should be executed independently in timely manner based on your requirement (volume & velocity of data flow) and monitored regularly. I am writing this code on a text editor. WARN: It's not compiled or tested!!
The code below is using a Mapper which would take in the url or tweak it to accept the list of urls from the FS. The payload or requested data is stored as text file in the specified job output directory (forget the structure of data this time).
Mapper Class:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class HadoopHttpClientMap extends Mapper<LongWritable, Text, Text, Text> {
private int file = 0;
private String jobOutDir;
private String taskId;
#Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
jobOutDir = context.getOutputValueClass().getName();
taskId = context.getJobID().toString();
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
Path httpDest = new Path(jobOutDir, taskId + "_http_" + (file++));
InputStream is = null;
OutputStream os = null;
URLConnection connection;
try {
connection = new URL(value.toString()).openConnection();
//implement connection timeout logics
//authenticate.. etc
is = connection.getInputStream();
os = FileSystem.getLocal(context.getConfiguration()).create(httpDest,true);
IOUtils.copyBytes(is, os, context.getConfiguration(), true);
} catch(Throwable t){
t.printStackTrace();
}finally {
IOUtils.closeStream(is);
IOUtils.closeStream(os);
}
context.write(value, null);
//context.write(new Text (httpDest.getName()), new Text (os.toString()));
}
}
Mapper Only Job:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class HadoopHttpClientJob {
private static final String data_input_directory = “YOUR_INPUT_DIR”;
private static final String data_output_directory = “YOUR_OUTPUT_DIR”;
public HadoopHttpClientJob() {
}
public static void main(String... args) {
try {
Configuration conf = new Configuration();
Path test_data_in = new Path(data_input_directory, "urls.txt");
Path test_data_out = new Path(data_output_directory);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "HadoopHttpClientMap" + System.currentTimeMillis());
job.setJarByClass(HadoopHttpClientJob.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(test_data_out, true);
job.setMapperClass(HadoopHttpClientMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, test_data_in);
FileOutputFormat.setOutputPath(job, test_data_out);
job.waitForCompletion(true);
}catch (Throwable t){
t.printStackTrace();
}
}
}
Step 2: Create external table in Hive based on the HDFS directory. Remember to use Hive SerDe for the JSON data (in your case) then you can copy the data from external table into managed master tables. This is the step where you implement your incremental logics, compression..
Step 3: Point your hive queries (which you might have already created) to the master table to implement your business needs.
Note: If you are supposedly referring to realtime analysis or streaming api, you might have to change your application's architecture. Since you have asked architectural question, I am using my best educated guess to support you. Please go through this once. If you feel you can implement this in your application then you can ask the specific question, I will try my best to address them.

Hbase mapside join- One of the tables is not getting read? read from hbase and right result into hbase

I am trying to do mapside join of two tables located in Hbase. My aim is to keep record of the small table in hashmap and compare with the big table, and once matched, write record in a table in hbase again. I wrote the similar code for join operation using both Mapper and Reducer and it worked well and both tables are scanned in mapper class. But since reduce side join is not efficient at all, I want to join the tables in mapper side only. In the following code "commented if block" is just to see that it returns false always and first table (small one) is not getting read. Any hints helps are appreciated. I am using sandbox of HDP.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
//import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.util.Tool;
import com.sun.tools.javac.util.Log;
import java.io.IOException;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableSplit;
public class JoinDriver extends Configured implements Tool {
static int row_index = 0;
public static class JoinJobMapper extends TableMapper<ImmutableBytesWritable, Put> {
private static byte[] big_table_bytarr = Bytes.toBytes("big_table");
private static byte[] small_table_bytarr = Bytes.toBytes("small_table");
HashMap<String,String> myHashMap = new HashMap<String, String>();
byte[] c1_value;
byte[] c2_value;
String big_table;
String small_table;
String big_table_c1;
String big_table_c2;
String small_table_c1;
String small_table_c2;
Text mapperKeyS;
Text mapperValueS;
Text mapperKeyB;
Text mapperValueB;
public void map(ImmutableBytesWritable rowKey, Result columns, Context context) {
TableSplit currentSplit = (TableSplit) context.getInputSplit();
byte[] tableName = currentSplit.getTableName();
try {
Put put = new Put(Bytes.toBytes(++row_index));
// put small table into hashmap - myhashMap
if (Arrays.equals(tableName, small_table_bytarr)) {
c1_value = columns.getValue(Bytes.toBytes("s_cf"), Bytes.toBytes("s_cf_c1"));
c2_value = columns.getValue(Bytes.toBytes("s_cf"), Bytes.toBytes("s_cf_c2"));
small_table_c1 = new String(c1_value);
small_table_c2 = new String(c2_value);
mapperKeyS = new Text(small_table_c1);
mapperValueS = new Text(small_table_c2);
myHashMap.put(small_table_c1,small_table_c2);
} else if (Arrays.equals(tableName, big_table_bytarr)) {
c1_value = columns.getValue(Bytes.toBytes("b_cf"), Bytes.toBytes("b_cf_c1"));
c2_value = columns.getValue(Bytes.toBytes("b_cf"), Bytes.toBytes("b_cf_c2"));
big_table_c1 = new String(c1_value);
big_table_c2 = new String(c2_value);
mapperKeyB = new Text(big_table_c1);
mapperValueB = new Text(big_table_c2);
// if (set.containsKey(big_table_c1)){
put.addColumn(Bytes.toBytes("join"), Bytes.toBytes("join_c1"), Bytes.toBytes(big_table_c1));
context.write(new ImmutableBytesWritable(mapperKeyB.getBytes()), put );
put.addColumn(Bytes.toBytes("join"), Bytes.toBytes("join_c2"), Bytes.toBytes(big_table_c2));
context.write(new ImmutableBytesWritable(mapperKeyB.getBytes()), put );
put.addColumn(Bytes.toBytes("join"), Bytes.toBytes("join_c3"),Bytes.toBytes((myHashMap.get(big_table_c1))));
context.write(new ImmutableBytesWritable(mapperKeyB.getBytes()), put );
// }
}
} catch (Exception e) {
// TODO : exception handling logic
e.printStackTrace();
}
}
}
public int run(String[] args) throws Exception {
List<Scan> scans = new ArrayList<Scan>();
Scan scan1 = new Scan();
scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("small_table"));
System.out.println(scan1.getAttribute("scan.attributes.table.name"));
scans.add(scan1);
Scan scan2 = new Scan();
scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("big_table"));
System.out.println(scan2.getAttribute("scan.attributes.table.name"));
scans.add(scan2);
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJar("MSJJ.jar");
job.setJarByClass(JoinDriver.class);
TableMapReduceUtil.initTableMapperJob(scans, JoinJobMapper.class, ImmutableBytesWritable.class, Put.class, job);
TableMapReduceUtil.initTableReducerJob("joined_table", null, job);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
JoinDriver runJob = new JoinDriver();
runJob.run(args);
}
}
By reading your problem statement I believe you have got some wrong idea about uses of Multiple HBase table input.
I suggest you load small table in a HashMap, in setup method of mapper class. Then use map only job on big table, in map method you can fetch corresponding values from the HashMap which you loaded earlier.
Let me know how this works out.

Load Data into Hbase outside Client Node

Thanks in advance.
We are loading data into Hbase using Java. It's pretty straight and works fine when we run the program on the client node (edge node). But we want to run this program remotely (outside the hadoop cluster) within our network to load the data.
Is there anything required to do this in terms of security on the hadoop cluster? When I run the program outside the cluster it's hanging..
Please advise. Greatly appreciate your help.
Thanks
Code here
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import com.dev.stp.cvsLoadEventConfig;
import com.google.protobuf.ServiceException;
public class LoadData {
static String ZKHost;
static String ZKPort;
private static Configuration config = null;
private static String tableName;
public LoadData (){
//Set Application Config
LoadDataConfig conn = new LoadDataConfig();
ZKHost = conn.getZKHost();
ZKPort = conn.getZKPort();
config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", ZKHost);
config.set("hbase.zookeeper.property.clientPort", ZKPort);
config.set("zookeeper.znode.parent", "/hbase-unsecure");
tableName = "E_DATA";
}
//Insert Record
try {
HTable table = new HTable(config, tableName);
Put put = new Put(Bytes.toBytes(eventId));
put.add(Bytes.toBytes("E_DETAILS"), Bytes.toBytes("E_NAME"),Bytes.toBytes("test data 1"));
put.add(Bytes.toBytes("E_DETAILS"), Bytes.toBytes("E_TIMESTAMP"),Bytes.toBytes("test data 2"));
table.put(put);
table.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Resources