Spark RDD to update - hadoop

I am loading a file from HDFS into a JavaRDD and wanted to update that RDD. For that I am converting it to IndexedRDD (https://github.com/amplab/spark-indexedrdd) and I am not able to as I am getting Classcast Exception.
Basically I will make key value pair and update the key. IndexedRDD supports update. Is there any way to convert ?
JavaPairRDD<String, String> mappedRDD = lines.flatMapToPair( new PairFlatMapFunction<String, String, String>()
{
#Override
public Iterable<Tuple2<String, String>> call(String arg0) throws Exception {
String[] arr = arg0.split(" ",2);
System.out.println( "lenght" + arr.length);
List<Tuple2<String, String>> results = new ArrayList<Tuple2<String, String>>();
results.addAll(results);
return results;
}
});
IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap();

The collectAsMap() returns a java.util.Map containing all the entries from your JavaPairRDD, but nothing related to Spark. I mean, that function is to collect the values in one node and work with plain Java. Therefore, you cannot cast it to IndexedRDD or any other RDD type as its just a normal Map.
I haven't used IndexedRDD, but from the examples you can see that you need to create it by passing to its constructor a PairRDD:
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
So in your code it should be:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());

Related

Fetching Apache Flink results using Springboot API

I have created a springboot code to use an API so that I can fetch the flink results when the user hits the endpoint.
static Map<String, Long> finalList = new HashMap<String, Long>();
Above is the class variable outside the main method. Within the main method, a datastream API for flink is used to fetch data from Kafka using KafkaSource class and perform the aggregations in flink. The datastream is being put into data_15min variable which is an object of DataStream<Map<String, Long>>.
I used a map function to fetch the data from the datastream and put them into the above finalList object.
data_15min.map(new MapFunction<Map<String, Long>, Object>() {
ObjectMapper objMapperNew = new ObjectMapper();
#Override
public Object map(Map<String, Long> value) throws Exception {
System.out.println("Value in map: " + value);
System.out.println("KeySet: " + value.keySet());
value.forEach((key, val) -> {
if(!dcListNew.contains(key)) {
dcListNew.add(key);
finalList.put(key, val);
} else {
finalList.replace(key, val);
}
});
System.out.println("DC List NEw inside: " + dcListNew);
System.out.println("Final List inside: " + objMapperNew.writeValueAsString(finalList));
return finalList;
}
});
System.out.println("Final List: " + finalList);
On printing the finalList within the above map function I am getting the data as
Final List inside: {"ABC": 2, "AXY": 10}
Outside the map when I print, I still see it as:
Final List: {}
Can someone please help me so that this data is avialable outside the map method for me to send it to springboot api response?
I think the main reason this does not work is becuse of concurrency, I assume that You are calling execute() on your environment there somewhere, so only after execute() finishes Your finalList has been populated with records.
Note that working like this isn't safe at all and quite probably will result in errors after short time. Overall, seems that the architecture of the solution is missing something like database where You would store Your results from Flink and then access the db from Spring.

Convert ImmutableListMultimap to Map using Collectors.toMap

I would like to convert a ImmutableListMultimap<String, Character> to Map<String, List<Character>>.
I used to do it in the non-stream way as follows
void convertMultiMaptoList(ImmutableListMultimap<String, Character> reverseImmutableMultiMap) {
Map<String, List<Character>> result = new TreeMap<>();
for( Map.Entry<String, Character> entry: reverseImmutableMultiMap.entries()) {
String key = entry.getKey();
Character t = entry.getValue();
result.computeIfAbsent(key, x-> new ArrayList<>()).add(t);
}
//reverseImmutableMultiMap.entries().stream().collect(Collectors.toMap)
}
I was wondering how to write the above same logic using java8 stream way (Collectors.toMap).
Please share your thoughts
Well there is already a asMap that you can use to make this easier:
Builder<String, Character> builder = ImmutableListMultimap.builder();
builder.put("12", 'c');
builder.put("12", 'c');
ImmutableListMultimap<String, Character> map = builder.build();
Map<String, List<Character>> map2 = map.asMap()
.entrySet()
.stream()
.collect(Collectors.toMap(Entry::getKey, e -> new ArrayList<>(e.getValue())));
If on the other hand you are OK with the return type of the asMap than it's a simple method call:
ImmutableMap<String, Collection<Character>> asMap = map.asMap();
Map<String, List<Character>> result = reverseImmutableMultiMap.entries().stream()
.collect(groupingBy(Entry::getKey, TreeMap::new, mapping(Entry::getValue, toList())));
The important detail is mapping. It will convert the collector (toList) so that it collects List<Character> instead of List<Entry<String, Character>>. According to the mapping function Entry::getValue
groupingBy will group all entries by the String key
toList will collect all values with same key to a list
Also, passing TreeMap::new as an argument to groupingBy will make sure you get this specific type of Map instead of the default HashMap

How to filter records according to `timestamp` in Spring Data Hadoop?

I have a hbase table with a sample record as follows:
03af639717ae10eb743253433147e133 column=u:a, timestamp=1434300763147, value=apple
10f3d7f8fe8f25d5bdf52343a2601227 column=u:a, timestamp=1434300763148, value=mapple
20164b1aff21bc14e94623423a9d645d column=u:a, timestamp=1534300763142, value=papple
44d1cb38271362d20911a723410b2c67 column=u:a, timestamp=1634300763141, value=scapple
I am lost as I was trying to pull out the row values according to the timestamp. I am using spring data hadoop.
I was only able to fetch all the records using below code:
private static final byte[] CF_INFO = Bytes.toBytes("u");
private static final byte[] baseUrl = Bytes.toBytes("a");
List<Model> allNewsList
= hbaseTemplate.find(tableName, columnFamily, new RowMapper<News>()
{
#Override
public Model mapRow(Result result, int rowNum)
throws Exception
{
String dateString = TextUtils.getTimeStampInLong(result.toString());
String rowKey = Bytes.toString(result.getRow());
return new Model(
rowKey,
Bytes.toString(result.getValue(CF_INFO, col_a)
);
}
});
How can I apply filter such that I would be able to get records within timestamp [1434300763147,1534300763142].
Hopefully this would help someone someday.
final org.apache.hadoop.hbase.client.Scan scan = new Scan();
scan.setTimeRange(1434300763147,1534300763142);
final List<Model> yourObjects = hbaseTemplate.find(tableName, scan, mapper);
Also, worth a mention, the max value of the timerange is exclusive, so if you want records with that timestamp to be returned, make sure to increment the max value of timerange by 1.
The problem was solved using Scanner object from Hbase Client.

How to iterate through Elasticsearch source using Apache Spark?

I am trying to build a recommendation system by integrating Elasticsearch with Apache Spark. I am using Java. I am using movilens dataset as example data. I have indexed the data to Elasticsearch as well. So far, I have been able to read the input from Elasticsearch index as follows:
SparkConf conf = new SparkConf().setAppName("Example App").setMaster("local");
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer.class.getName());
conf.set("es.nodes", "localhost");
conf.set("es.port", "9200");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc, "movielens/recommendation");
Using esRDD.collect() function, I can see that I am retrieving the data from elastic search correctly. Now I need to feed the user id, item id and preference from the Elasticsearch result to Spark's recommendation. If I am using a csv file, I would be able to do it as follows:
String path = "resources/user_data.data";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Rating> ratings = data.map(
new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(" ");
return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
Double.parseDouble(sarray[2]));
}
}
);
What could be an equivalent mapping if I need to iterate through the elastic search output stored in esRDD and create a similar map as above? If there is any example code that I could refer to, that would be of great help.
Apologies for not answering the Spark question directly, but in case you missed it, there is a description of doing recommendations on MovieLens data using elasticsearch here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html
You have not specified the format of the data in ElasticSearch. But let's assume it has fields userId, movieId and rating so an example document looks something like {"userId":1,"movieId":1,"rating":4}.
Then you should be able to do (ignoring null checks etc):
JavaRDD<Rating> ratings = esRDD.map(
new Function<Map<String, Object>, Rating>() {
public Rating call(Map<String, Object> m) {
Int userId = Integer.parseInt(m.get("userId"));
Int movieId = Integer.parseInt(m.get("movieId"));
Double rating = Double.parseDouble(m.get("rating"));
return new Rating(userId, movieId, rating);
}
}
);

Are there simple way to receive Map instead of List when using Spring JdbcTemplate.query?

getSimpleJdbcTemplate().query(sql, getMapper()); returns List, but I need a Map where key will be store data of one of the field of object. For example, I have object named "Currency" which has fields: id, code, name, etc. Code above will return List object, but I want to get currency by id from Map. Now, I wrote the following code:
#Override
public Map<Integer, Currency> listCurrencies() {
String sql = "select cur_id, cur_code, cur_name ... from currencies";
List<Currency> currencies = getSimpleJdbcTemplate().query(sql, getMapper());
Map<Integer, Currency> map = new HashMap<Integer, Currency>(currencies.size());
for (Currency currency : currencies) {
map.put(currency.getId(), currency);
}
return map;
}
Are there any way to do same but without creating List object and looping inside it?
You have ResultSetExtractor for extracting values from the ResultSet. So in your case you can write a custom ResultSetExtractor which will return you the Map object.

Resources