Perform aggregation in Elasticsearch index with Spark in Java - elasticsearch

I want to prepare a Java class that will read an index from Elasticsearch, perform aggregations using Spark and then write the results back to Elasticsearch. The target schema (in the form of StructType) is the same as the source one. My code is as follows
SparkConf conf = new SparkConf().setAppName("Aggregation").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaPairRDD<String, Map<String, Object>> pairRDD = JavaEsSpark.esRDD(sc, "kpi_aggregator/record");
RDD rdd = JavaPairRDD.toRDD(pairRDD);
Dataset df = sqlContext.createDataFrame(rdd, customSchema);
df.registerTempTable("data");
Dataset kpi1 = sqlContext.sql("SELECT host, SUM(bytes_uplink), SUM(bytes_downlink) FROM data GROUP BY host");
JavaEsSparkSQL.saveToEs(kpi1, "kpi_aggregator_total/record");
I am using the latest version of spark-core_2.11 and elasticsearch-spark-20_2.11. The previous code results in the following exception
java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.sql.Row
Any ideas what I am doing wrong?

You get this exception because sqlContext.createDataFrame(rdd, customSchema) expects RDD<CustomSchemaJavaBean> but instead you pass to it results of JavaPairRDD.toRDD(pairRDD) which is RDD<Tuple2<String, Map<String, Object>>>. You have to map your JavaPairRDD<String, Map<String, Object>> to RDD<CustomSchemaJavaBean>:
SparkConf conf = new SparkConf().setAppName("Aggregation").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CustomSchemaBean> rdd = JavaEsSpark.esRDD(sc, "kpi_aggregator/record")
.map(tuple2 -> {
/**transform Tuple2<String, Map<String, Object>> to CustomSchemaBean **/
return new CustomSchemaBean(????);
} );
Dataset df = sqlContext.createDataFrame(rdd, customSchema);
df.registerTempTable("data");
Dataset kpi1 = sqlContext.sql("SELECT host, SUM(bytes_uplink), SUM(bytes_downlink) FROM data GROUP BY host");
JavaEsSparkSQL.saveToEs(kpi1, "kpi_aggregator_total/record");
Notice I used JavaRDD not RDD both methods are legal.

Related

ElasticSearch new Java API print created quesy

I trying out the new Java Client for Elastic 8.1.1.
In older versions i was able to print out the generated json query by using searchRequest.source().
I cannot find out actuallay what methode/service i can use do to so with the new client.
My code looks:
final Query range_query = new Query.Builder().range(r -> r.field("pixel_x")
.from(String.valueOf(lookupDto.getPixel_x_min())).to(String.valueOf(lookupDto.getPixel_x_max())))
.build();
final Query bool_query = new Query.Builder().bool(t -> t.must(range_query)).build();
SearchRequest sc = SearchRequest.of(s -> s.query(bool_query).index(INDEX).size(100));
The SearchRequest object offers a source() method but ist value is null.
You can use below code for printing query with new Elastic Java Client:
Query termQuery = TermQuery.of(t -> t.field("field_name").value("search_value"))._toQuery();
StringWriter writer = new StringWriter();
JsonGenerator generator = JacksonJsonProvider.provider().createGenerator(writer);
termQuery.serialize(generator, new JacksonJsonpMapper());
generator.flush();
System.out.println(writer.toString());

spring data mongodb BulkOperations upsert not save the data to mongodb

for (int i = 0; i < deviceDataList.size(); i++) {
updateList = new ArrayList<>();
deviceData = deviceDataList.get(i);
Query query = new Query();
query.addCriteria(new Criteria("seqId").is(deviceData.getDeviceCd()));
query.addCriteria(new Criteria("time").is(deviceData.getGpsAtMillis()));
Update update = new Update();
update.set("seqId", deviceData.getSeqId());
update.set("partId", deviceData.getPartId());
update.set("partName", deviceData.getPartName());
update.set("time", deviceData.getTime());
Pair<Query, Update> updatePair = Pair.of(query, update);
updateList.add(updatePair);
}
MongoTemplate mongoTemplate = null;
BulkOperations operations = mongoTemplate.bulkOps(BulkOperations.BulkMode.UNORDERED, collectionName);
operations.upsert(updateList);
BulkWriteResult writeResult = operations.execute();
result.setSuccess(writeResult.wasAcknowledged());
result.setResult(writeResult.getModifiedCount());
I write a code to insert data when data do not exist or update data when data exist, but I find when the update data is included, the updateList data does not save into the MongoDB, I did not know why, this confuse me.
I use MongoDB 3.4.12 and use the spring-data-MongoDB jar, I am not sure why this happens, and what I should do to solve this problem.

How to extract the values from the nested Maps using lambdas expression?

I need to extract the foreign conversion from the nested map using lambda expression of java 8:
I was able to solve it by the old school of the java 8 for each but wanted to see how it works with the lambda expression of java 8.
e.g i want to filter the maps inside map .
for cmp1, fee1, Inr-Try we have value present as 31. which is desired output
// camp1
Map<String,String> forexMap3_1 = new HashMap();
forexMap3_1.put("Eur-Try","11");
forexMap3_1.put("Usd-Try","21");
forexMap3_1.put("Inr-Try","31");
Map<String,String> forexMap3_2= new HashMap();
forexMap3_2.put("Eur-Try","12");
forexMap3_2.put("Usd-Try","22");
forexMap3_2.put("Inr-Try","32");
Map<String, Map> feeMap2 = new HashMap();
feeMap2.put("fee1", forexMap3_1);
feeMap2.put("fee2",forexMap3_2);
campaigns.put("cmp1", feeMap2);
// camp2
Map<String,String> forexMap3_3 = new HashMap();
forexMap3_3.put("Eur-Try","11");
forexMap3_3.put("Usd-Try","21");
forexMap3_3.put("Inr-Try","31");
Map<String,String> forexMap3_4= new HashMap();
forexMap3_4.put("Eur-Try","12");
forexMap3_4.put("Usd-Try","22");
forexMap3_4.put("Inr-Try","32");
Map<String, Map> feeMap3 = new HashMap();
feeMap3.put("fee3", forexMap3_3);
feeMap3.put("fee4",forexMap3_4);
campaigns.put("cmp2", feeMap3);
Try this :
out.entrySet().stream().filter(x->x.getKey().equals(yourkey)).flatMap(x->x.getValue().entrySet().stream()).collect(Collectors.toMap(x->x.getKey(),x->x.getValue()));
just iterate on campaign children:
HashMap<String, String> finalMap = new HashMap<>();
campaigns.forEach((s, stringMapMap) -> stringMapMap.forEach((s1, map) -> finalMap.putAll(map)));
System.out.println(finalMap.get("Inr-Try")); // output: 31

Spark RDD to update

I am loading a file from HDFS into a JavaRDD and wanted to update that RDD. For that I am converting it to IndexedRDD (https://github.com/amplab/spark-indexedrdd) and I am not able to as I am getting Classcast Exception.
Basically I will make key value pair and update the key. IndexedRDD supports update. Is there any way to convert ?
JavaPairRDD<String, String> mappedRDD = lines.flatMapToPair( new PairFlatMapFunction<String, String, String>()
{
#Override
public Iterable<Tuple2<String, String>> call(String arg0) throws Exception {
String[] arr = arg0.split(" ",2);
System.out.println( "lenght" + arr.length);
List<Tuple2<String, String>> results = new ArrayList<Tuple2<String, String>>();
results.addAll(results);
return results;
}
});
IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap();
The collectAsMap() returns a java.util.Map containing all the entries from your JavaPairRDD, but nothing related to Spark. I mean, that function is to collect the values in one node and work with plain Java. Therefore, you cannot cast it to IndexedRDD or any other RDD type as its just a normal Map.
I haven't used IndexedRDD, but from the examples you can see that you need to create it by passing to its constructor a PairRDD:
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
So in your code it should be:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());

How to iterate through Elasticsearch source using Apache Spark?

I am trying to build a recommendation system by integrating Elasticsearch with Apache Spark. I am using Java. I am using movilens dataset as example data. I have indexed the data to Elasticsearch as well. So far, I have been able to read the input from Elasticsearch index as follows:
SparkConf conf = new SparkConf().setAppName("Example App").setMaster("local");
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer.class.getName());
conf.set("es.nodes", "localhost");
conf.set("es.port", "9200");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc, "movielens/recommendation");
Using esRDD.collect() function, I can see that I am retrieving the data from elastic search correctly. Now I need to feed the user id, item id and preference from the Elasticsearch result to Spark's recommendation. If I am using a csv file, I would be able to do it as follows:
String path = "resources/user_data.data";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Rating> ratings = data.map(
new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(" ");
return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
Double.parseDouble(sarray[2]));
}
}
);
What could be an equivalent mapping if I need to iterate through the elastic search output stored in esRDD and create a similar map as above? If there is any example code that I could refer to, that would be of great help.
Apologies for not answering the Spark question directly, but in case you missed it, there is a description of doing recommendations on MovieLens data using elasticsearch here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html
You have not specified the format of the data in ElasticSearch. But let's assume it has fields userId, movieId and rating so an example document looks something like {"userId":1,"movieId":1,"rating":4}.
Then you should be able to do (ignoring null checks etc):
JavaRDD<Rating> ratings = esRDD.map(
new Function<Map<String, Object>, Rating>() {
public Rating call(Map<String, Object> m) {
Int userId = Integer.parseInt(m.get("userId"));
Int movieId = Integer.parseInt(m.get("movieId"));
Double rating = Double.parseDouble(m.get("rating"));
return new Rating(userId, movieId, rating);
}
}
);

Resources