I have a job that reads data from Cassandra and store the data as List ( the method fillOnceGeoFencesFromDB() attached below ) and than I create StreamExecutionEnvironment and consume data from the Kafka queue.
During transformation of DataStream I try to reference recently filled static ArrayList , but it's empty.
What is a best practice to pass previously filled List into the next Job ?
Any Idea will be appreciated.
private static ArrayList<GeoFences> allGeoFences = new ArrayList<>();
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(5000); // checkpoint every 5000 msecs
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties kafkaProps = new Properties();
kafkaProps.setProperty("zookeeper.connect", LOCAL_ZOOKEEPER_HOST);
kafkaProps.setProperty("bootstrap.servers", LOCAL_KAFKA_BROKER);
kafkaProps.setProperty("group.id", KAFKA_GROUP);
kafkaProps.setProperty("auto.offset.reset", "earliest");
fillOnceGeoFencesFromDB(); // populate data in ArrayList<GeoFences> allGeoFences
DataStream <Tuple6<UUID, String, String, String, String, Timestamp>> stream_parsed_with_timestamps = env
.addSource(new FlinkKafkaConsumer010<>(KAFKA_SUBSCRIBE_TOPIC, new SimpleStringSchema(), kafkaProps))
.rebalance().map(new MapFunction<String, Tuple4<UUID, String, String, Timestamp>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple4<UUID, String, String, Timestamp> map(String value) throws Exception {
return mapToTuple4(value);
}})
.
.
.
.
.
.
Please keep in mind that whatever happens in the map function will take place on the task managers while all your code in the main is only used to define your job.
Pass your parameter explicit to the MapFunction (That will make the code easier to read).
private static class GeoFenceMapper implements MapFunction<String, Tuple4<UUID, String, String, Timestamp>> {
private ArrayList<GeoFences> allGeoFences;
public GeoFenceMapper(ArrayList<GeoFences> allGeoFences) {
this.allGeoFences = allGeoFences;
}
#Override
public Tuple4<UUID, String, String, Timestamp> map(String value) throws Exception {
return mapToTuple4(value);
}})
}
and than use this new mapper:
DataStream <Tuple6<UUID, String, String, String, String, Timestamp>> stream_parsed_with_timestamps = env
.addSource(new FlinkKafkaConsumer010<>(KAFKA_SUBSCRIBE_TOPIC, new SimpleStringSchema(), kafkaProps))
.rebalance().map(new GeoFenceMapper(fillOnceGeoFencesFromDB()))
Hope this helps!
Related
I am working on creating excel file from data, for that I have created job. I want to set hashmap to the jobparameter so that I can use it in MyReader class, I have created CustomJobParameter Class.
Below code you can find to get the job parameters :
Get Job Parameters :
public JobParameters createJobParam (MyRequest request) {
final JobParameters parameters = new JobParametersBuilder()
.addString("MyParam1", request.getReportGenerationJobId())
.addString("MyParam2", request.getSessionId())
.addLong("time", System.currentTimeMillis())
.addParameter(
"MyObject",
new MyUtils.CustomJobParameter(request.getHsSlideArticles())
)
.toJobParameters();
return JobParameters;
}
CustomJobParameter Class written in MyUtils class:
public static class CustomJobParameter<T extends Serializable> extends JobParameter {
private HashMap customParam;
public CustomJobParameter (HashMap slideArticles) {
super("");
this.customParam = customParam;
}
public HashMap getValue () {
return customParam;
}
}
But while I am setting using custom parameters, it setting blank string, not object I am passing.
How can I pass the hashmap to my reader.
According to the documentation for JobParameter, a JobParameter can only be String, Long, Date, and Double.
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/JobParameter.html
Domain representation of a parameter to a batch job. Only the following types can be parameters: String, Long, Date, and Double.
The identifying flag is used to indicate if the parameter is to be
used as part of the identification of a job instance.
Therefore you can not extend JobParameter and expect it to work with HashMap.
However there is another option, JobParameters:
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/JobParameters.html
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/JobParametersBuilder.html
You could create a Map<String, JobParameter> instead :
Example:
new JobParameters(Maps.newHashMap("yearMonth", new JobParameter("2021-07")))
and then use JobParametersBuilder addJobParameters in your createJobParam to simply add all your Map<String, JobParameter> records:
addJobParameters(JobParameters jobParameters) //Copy job parameters into the current state.
So your method will look like:
public JobParameters createJobParam (MyRequest request) {
final JobParameters parameters = new JobParametersBuilder()
.addString("MyParam1", request.getReportGenerationJobId())
.addString("MyParam2", request.getSessionId())
.addLong("time", System.currentTimeMillis())
.addParameters(mayHashMapThatHas<String,JobParameter>)
.toJobParameters();
return JobParameters;
}
My client retrieves JSON content as below:
{
"table": "tablename",
"update": 1495104575669,
"rows": [
{"column5": 11, "column6": "yyy"},
{"column3": 22, "column4": "zzz"}
]
}
In rows array content, the key is not fixed. I want to retrieve the key and value and save into a Map using Gson 2.8.x.
How can I configure Gson to simply use to deserialize?
Here is my idea:
public class Dataset {
private String table;
private long update;
private List<Rows>> lists; <-- little confused here.
or private List<HashMap<String,Object> lists
Setter/Getter
}
public class Rows {
private HashMap<String, Object> map;
....
}
Dataset k = gson.fromJson(jsonStr, Dataset.class);
log.info(k.getRows().size()); <-- I got two null object
Thanks.
Gson does not support such a thing out of box. It would be nice, if you can make the property name fixed. If not, then you can have a few options that probably would help you.
Just rename the Dataset.lists field to Dataset.rows, if the property name is fixed, rows.
If the possible name set is known in advance, suggest Gson to pick alternative names using the #SerializedName.
If the possible name set is really unknown and may change in the future, you might want to try to make it fully dynamic using a custom TypeAdapter (streaming mode; requires less memory, but harder to use) or a custom JsonDeserializer (object mode; requires more memory to store intermediate tree views, but it's easy to use) registered with GsonBuilder.
For option #2, you can simply add the names of name alternatives:
#SerializedName(value = "lists", alternate = "rows")
final List<Map<String, Object>> lists;
For option #3, bind a downstream List<Map<String, Object>> type adapter trying to detect the name dynamically. Note that I omit the Rows class deserialization strategy for simplicity (and I believe you might want to remove the Rows class in favor of simple Map<String, Object> (another note: use Map, try not to specify collection implementations -- hash maps are unordered, but telling Gson you're going to deal with Map would let it to pick an ordered map like LinkedTreeMap (Gson internals) or LinkedHashMap that might be important for datasets)).
// Type tokens are immutable and can be declared constants
private static final TypeToken<String> stringTypeToken = new TypeToken<String>() {
};
private static final TypeToken<Long> longTypeToken = new TypeToken<Long>() {
};
private static final TypeToken<List<Map<String, Object>>> stringToObjectMapListTypeToken = new TypeToken<List<Map<String, Object>>>() {
};
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
#Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( typeToken.getRawType() != Dataset.class ) {
return null;
}
// If the actual type token represents the Dataset class, then pick the bunch of downstream type adapters
final TypeAdapter<String> stringTypeAdapter = gson.getDelegateAdapter(this, stringTypeToken);
final TypeAdapter<Long> primitiveLongTypeAdapter = gson.getDelegateAdapter(this, longTypeToken);
final TypeAdapter<List<Map<String, Object>>> stringToObjectMapListTypeAdapter = stringToObjectMapListTypeToken);
// And compose the bunch into a single dataset type adapter
final TypeAdapter<Dataset> datasetTypeAdapter = new TypeAdapter<Dataset>() {
#Override
public void write(final JsonWriter out, final Dataset dataset) {
// Omitted for brevity
throw new UnsupportedOperationException();
}
#Override
public Dataset read(final JsonReader in)
throws IOException {
in.beginObject();
String table = null;
long update = 0;
List<Map<String, Object>> lists = null;
while ( in.hasNext() ) {
final String name = in.nextName();
switch ( name ) {
case "table":
table = stringTypeAdapter.read(in);
break;
case "update":
update = primitiveLongTypeAdapter.read(in);
break;
default:
lists = stringToObjectMapListTypeAdapter.read(in);
break;
}
}
in.endObject();
return new Dataset(table, update, lists);
}
}.nullSafe(); // Making the type adapter null-safe
#SuppressWarnings("unchecked")
final TypeAdapter<T> typeAdapter = (TypeAdapter<T>) datasetTypeAdapter;
return typeAdapter;
}
})
.create();
final Dataset dataset = gson.fromJson(jsonReader, Dataset.class);
System.out.println(dataset.lists);
The code above would print then:
[{column5=11.0, column6=yyy}, {column3=22.0, column4=zzz}]
I have an RDD:
JavaPairRDD<Long, ViewRecord> myRDD
which is created via newAPIHadoopRDD method. I have an existed map function which I want to implement it in Spark way:
LongWritable one = new LongWritable(1L);
protected void map(Long key, ViewRecord viewRecord, Context context)
throws IOException ,InterruptedException {
String url = viewRecord.getUrl();
long day = viewRecord.getDay();
tuple.getKey().set(url);
tuple.getValue().set(day);
context.write(tuple, one);
};
PS: tuple is derived from:
KeyValueWritable<Text, LongWritable>
and can be found here: TextLong.java
I don't know what tuple is but if you just want to map record to tuple with key (url, day) and value 1L you can do it like this:
result = myRDD
.values()
.mapToPair(viewRecord -> {
String url = viewRecord.getUrl();
long day = viewRecord.getDay();
return new Tuple2<>(new Tuple2<>(url, day), 1L);
})
//java 7 style
JavaPairRDD<Pair, Long> result = myRDD
.values()
.mapToPair(new PairFunction<ViewRecord, Pair, Long>() {
#Override
public Tuple2<Pair, Long> call(ViewRecord record) throws Exception {
String url = record.getUrl();
Long day = record.getDay();
return new Tuple2<>(new Pair(url, day), 1L);
}
}
);
I created a driver which reads a config file, builds a list of objects (based on the config) and passes that list to MapReduce (MapReduce has a static attribute which holds a reference to that list of object).
It works but only locally. As soon as I run the job on a cluster config I will get all sort of errors suggesting that the list hasn't been built. It makes me think that I'm doing it wrong and on a cluster setup MapReduce is being run independently from the driver.
My question is how to correctly initialise a Mapper.
(I'm using Hadoop 2.4.1)
This is related to the problem of side data distribution.
There are two approaches for side data distribution.
1) Distributed Caches
2) Configuration
As you have the objects to be shared, we can use the Configuration class.
This discussion will depend on the Configuration class to make available an Object across the cluster, accessible to all Mappers and(or) Reducers. The approach here is quite simple. The setString(String, String) setter of the Configuration classed is harnessed to achieve this task. The Object that has to be shared across is serialized into a java string at the driver end and is de-serialized back to the object at the Mapper or Reducer.
In the example code below, I have used com.google.gson.Gson class for the easy serialization and deserialization. You can use Java Serialization as well.
Class that Represents the Object You need to Share
public class TestBean {
String string1;
String string2;
public TestBean(String test1, String test2) {
super();
this.string1 = test1;
this.string2 = test2;
}
public TestBean() {
this("", "");
}
public String getString1() {
return string1;
}
public void setString1(String test1) {
this.string1 = test1;
}
public String getString2() {
return string2;
}
public void setString2(String test2) {
this.string2 = test2;
}
}
The Main Class from where you can set the Configurations
public class GSONTestDriver {
public static void main(String[] args) throws Exception {
System.out.println("In Main");
Configuration conf = new Configuration();
TestBean testB1 = new TestBean("Hello1","Gson1");
TestBean testB2 = new TestBean("Hello2","Gson2");
Gson gson = new Gson();
String testSerialization1 = gson.toJson(testB1);
String testSerialization2 = gson.toJson(testB2);
conf.set("instance1", testSerialization1);
conf.set("instance2", testSerialization2);
Job job = new Job(conf, " GSON Test");
job.setJarByClass(GSONTestDriver.class);
job.setMapperClass(GSONTestMapper.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The mapper class from where you can retrieve the object
public class GSONTestMapper extends
Mapper<LongWritable, Text, Text, NullWritable> {
Configuration conf;
String inst1;
String inst2;
public void setup(Context context) {
conf = context.getConfiguration();
inst1 = conf.get("instance1");
inst2 = conf.get("instance2");
Gson gson = new Gson();
TestBean tb1 = gson.fromJson(inst1, TestBean.class);
System.out.println(tb1.getString1());
System.out.println(tb1.getString2());
TestBean tb2 = gson.fromJson(inst2, TestBean.class);
System.out.println(tb2.getString1());
System.out.println(tb2.getString2());
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(value,NullWritable.get());
}
}
The bean is converted to a serialized Json String using the toJson(Object src) method of the class com.google.gson.Gson. Then the serialised Json string is passed as value through the configuration instance and accessed by name from the Mapper. The string is deserialized there using the fromJson(String json, Class classOfT) method of the same Gson class. Instead of my test bean, you could place your objects.
Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.