Custom FileInputFormat always assign one filesplit to one slot - protocol-buffers

I have been writing protobuf records to our s3 buckets. And I want to use flink dataset api to read from it. So I implemented a custom FileInputFormat to achieve this. The code is as below.
public class ProtobufInputFormat extends FileInputFormat<StandardLog.Pageview> {
public ProtobufInputFormat() {
}
private transient boolean reachedEnd = false;
#Override
public boolean reachedEnd() throws IOException {
return reachedEnd;
}
#Override
public StandardLog.Pageview nextRecord(StandardLog.Pageview reuse) throws IOException {
StandardLog.Pageview pageview = StandardLog.Pageview.parseDelimitedFrom(stream);
if (pageview == null) {
reachedEnd = true;
}
return pageview;
}
#Override
public boolean supportsMultiPaths() {
return true;
}
}
public class BatchReadJob {
public static void main(String... args) throws Exception {
String readPath1 = args[0];
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ProtobufInputFormat inputFormat = new ProtobufInputFormat();
inputFormat.setNestedFileEnumeration(true);
inputFormat.setFilePaths(readPath1);
DataSet<StandardLog.Pageview> dataSource = env.createInput(inputFormat);
dataSource.map(new MapFunction<StandardLog.Pageview, String>() {
#Override
public String map(StandardLog.Pageview value) throws Exception {
return value.getId();
}
}).writeAsText("s3://xxx", FileSystem.WriteMode.OVERWRITE);
env.execute();
}
}
The problem is that flink always assign one filesplit to one parallelism slot. In other word, it always process the same number of file split as the number of the parallelism.
I want to know what's the correct way of implementing custom FileInputFormat.
Thanks.

I believe the behavior you're seeing is because ExecutionJobVertex calls the FileInputFormat. createInputSplits() method with a minNumSplits parameter equal to the vertex (data source) parallelism. So if you want a different behavior, then you'd have to override the createInputSplits method.
Though you didn't say what behavior you actually wanted. If, for example, you just want one split per file, then you could override the testForUnsplittable() method in your subclass of FileInputFormat to always return true; it should also set the (protected) unsplittable boolean to true.

Related

How to parse a list of list of spring properties

I have this Spring boot application.properties
list1=valueA,valueB
list2=valueC
list3=valueD,valueE
topics=list1,list2,list3
What I'm trying to do is to use in the topics element of #KafkaListener annotation the values of the values of topics property
Using the expression
#KafkaListener(topics={"#{'${topics}'.split(',')}"})
I get list1,list2,list3 as separated string
How can I loop on this list in order to get valueA,valueB,valueC,valueD,valueE?
Edit: I must parse topics properties in order that #KafkaListener registers for consuming message from topics valueA,valueB,valueC, etc.
I read that is possible call a method in this way:
#KafkaListener(topics="#parse(${topics})")
So, I wrote this method:
public String[] parse(String s) {
ExpressionParser parser = new SpelExpressionParser();
return Arrays.stream(s.split(",").map(key -> (String)(parser.parse(key).getValue())).toArray(String[]::new);
}
But the parse method is not invoked
So, I tried directly to do this into annotations
in this way:
#KafkaListener(topics="#{Arrays.stream('${topics}'.split(',')).map(key->${key}).toArray(String[]::new)}")
But also this solution give me errors.
Edit 2:
Modifying in this way the method is invoked
#KafkaListener(topics="parse()")
#Bean
public String[] parse(String s) {
...
}
The problems is how to get "topics" props inside the method
You can't invoke arbitrary methods like that; you need to reference a bean #someBean.parse(...); using #parse requires registering a static method as a function.
However, this works for me and is much simpler:
list1=valueA,valueB
list2=valueC
list3=valueD,valueE
topics=${list1},${list2},${list3}
and
#KafkaListener(id = "so64390079", topics = "#{'${topics}'.split(',')}")
EDIT
If you can't use placeholders in topics, this works...
#SpringBootApplication
public class So64390079Application {
public static void main(String[] args) {
SpringApplication.run(So64390079Application.class, args);
}
#KafkaListener(id = "so64390079", topics = "#{#parser.parse('${topics}')}")
public void listen(String in) {
System.out.println(in);
}
}
#Component
class Parser implements EnvironmentAware {
private Environment environmment;
#Override
public void setEnvironment(Environment environment) {
this.environmment = environment;
}
public String[] parse(String[] topics) {
StringBuilder sb = new StringBuilder();
for (String topic : topics) {
sb.append(this.environmment.getProperty(topic));
sb.append(',');
}
return StringUtils.commaDelimitedListToStringArray(sb.toString().substring(0, sb.length() - 1));
}
}

When to go for custom Input format for Map reduce jobs

When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}
Yes u can use TextInputformat for you case.

Read Application Object from GemFire using Spring Data GemFire. Data stored using SpringXD's gemfire-json-server

I'm using the gemfire-json-server module in SpringXD to populate a GemFire grid with json representation of “Order” objects. I understand the gemfire-json-server module saves data in Pdx form in GemFire. I’d like to read the contents of the GemFire grid into an “Order” object in my application. I get a ClassCastException that reads:
java.lang.ClassCastException: com.gemstone.gemfire.pdx.internal.PdxInstanceImpl cannot be cast to org.apache.geode.demo.cc.model.Order
I’m using the Spring Data GemFire libraries to read contents of the cluster. The code snippet to read the contents of the Grid follows:
public interface OrderRepository extends GemfireRepository<Order, String>{
Order findByTransactionId(String transactionId);
}
How can I use Spring Data GemFire to convert data read from the GemFire cluster into an Order object?
Note: The data was initially stored in GemFire using SpringXD's gemfire-json-server-module
Still waiting to hear back from the GemFire PDX engineering team, specifically on Region.get(key), but, interestingly enough if you annotate your application domain object with...
#JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "#type")
public class Order ... {
...
}
This works!
Under-the-hood I knew the GemFire JSONFormatter class (see here) used Jackson's API to un/marshal (de/serialize) JSON data to and from PDX.
However, the orderRepository.findOne(ID) and ordersRegion.get(key) still do not function as I would expect. See updated test class below for more details.
Will report back again when I have more information.
#RunWith(SpringJUnit4ClassRunner.class)
#ContextConfiguration(classes = GemFireConfiguration.class)
#SuppressWarnings("unused")
public class JsonToPdxToObjectDataAccessIntegrationTest {
protected static final AtomicLong ID_SEQUENCE = new AtomicLong(0l);
private Order amazon;
private Order bestBuy;
private Order target;
private Order walmart;
#Autowired
private OrderRepository orderRepository;
#Resource(name = "Orders")
private com.gemstone.gemfire.cache.Region<Long, Object> orders;
protected Order createOrder(String name) {
return createOrder(ID_SEQUENCE.incrementAndGet(), name);
}
protected Order createOrder(Long id, String name) {
return new Order(id, name);
}
protected <T> T fromPdx(Object pdxInstance, Class<T> toType) {
try {
if (pdxInstance == null) {
return null;
}
else if (toType.isInstance(pdxInstance)) {
return toType.cast(pdxInstance);
}
else if (pdxInstance instanceof PdxInstance) {
return new ObjectMapper().readValue(JSONFormatter.toJSON(((PdxInstance) pdxInstance)), toType);
}
else {
throw new IllegalArgumentException(String.format("Expected object of type PdxInstance; but was (%1$s)",
pdxInstance.getClass().getName()));
}
}
catch (IOException e) {
throw new RuntimeException(String.format("Failed to convert PDX to object of type (%1$s)", toType), e);
}
}
protected void log(Object value) {
System.out.printf("Object of Type (%1$s) has Value (%2$s)", ObjectUtils.nullSafeClassName(value), value);
}
protected Order put(Order order) {
Object existingOrder = orders.putIfAbsent(order.getTransactionId(), toPdx(order));
return (existingOrder != null ? fromPdx(existingOrder, Order.class) : order);
}
protected PdxInstance toPdx(Object obj) {
try {
return JSONFormatter.fromJSON(new ObjectMapper().writeValueAsString(obj));
}
catch (JsonProcessingException e) {
throw new RuntimeException(String.format("Failed to convert object (%1$s) to JSON", obj), e);
}
}
#Before
public void setup() {
amazon = put(createOrder("Amazon Order"));
bestBuy = put(createOrder("BestBuy Order"));
target = put(createOrder("Target Order"));
walmart = put(createOrder("Wal-Mart Order"));
}
#Test
public void regionGet() {
assertThat((Order) orders.get(amazon.getTransactionId()), is(equalTo(amazon)));
}
#Test
public void repositoryFindOneMethod() {
log(orderRepository.findOne(target.getTransactionId()));
assertThat(orderRepository.findOne(target.getTransactionId()), is(equalTo(target)));
}
#Test
public void repositoryQueryMethod() {
assertThat(orderRepository.findByTransactionId(amazon.getTransactionId()), is(equalTo(amazon)));
assertThat(orderRepository.findByTransactionId(bestBuy.getTransactionId()), is(equalTo(bestBuy)));
assertThat(orderRepository.findByTransactionId(target.getTransactionId()), is(equalTo(target)));
assertThat(orderRepository.findByTransactionId(walmart.getTransactionId()), is(equalTo(walmart)));
}
#Region("Orders")
#JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "#type")
public static class Order implements PdxSerializable {
protected static final OrderPdxSerializer pdxSerializer = new OrderPdxSerializer();
#Id
private Long transactionId;
private String name;
public Order() {
}
public Order(Long transactionId) {
this.transactionId = transactionId;
}
public Order(Long transactionId, String name) {
this.transactionId = transactionId;
this.name = name;
}
public String getName() {
return name;
}
public void setName(final String name) {
this.name = name;
}
public Long getTransactionId() {
return transactionId;
}
public void setTransactionId(final Long transactionId) {
this.transactionId = transactionId;
}
#Override
public void fromData(PdxReader reader) {
Order order = (Order) pdxSerializer.fromData(Order.class, reader);
if (order != null) {
this.transactionId = order.getTransactionId();
this.name = order.getName();
}
}
#Override
public void toData(PdxWriter writer) {
pdxSerializer.toData(this, writer);
}
#Override
public boolean equals(Object obj) {
if (obj == this) {
return true;
}
if (!(obj instanceof Order)) {
return false;
}
Order that = (Order) obj;
return ObjectUtils.nullSafeEquals(this.getTransactionId(), that.getTransactionId());
}
#Override
public int hashCode() {
int hashValue = 17;
hashValue = 37 * hashValue + ObjectUtils.nullSafeHashCode(getTransactionId());
return hashValue;
}
#Override
public String toString() {
return String.format("{ #type = %1$s, id = %2$d, name = %3$s }",
getClass().getName(), getTransactionId(), getName());
}
}
public static class OrderPdxSerializer implements PdxSerializer {
#Override
public Object fromData(Class<?> type, PdxReader in) {
if (Order.class.equals(type)) {
return new Order(in.readLong("transactionId"), in.readString("name"));
}
return null;
}
#Override
public boolean toData(Object obj, PdxWriter out) {
if (obj instanceof Order) {
Order order = (Order) obj;
out.writeLong("transactionId", order.getTransactionId());
out.writeString("name", order.getName());
return true;
}
return false;
}
}
public interface OrderRepository extends GemfireRepository<Order, Long> {
Order findByTransactionId(Long transactionId);
}
#Configuration
protected static class GemFireConfiguration {
#Bean
public Properties gemfireProperties() {
Properties gemfireProperties = new Properties();
gemfireProperties.setProperty("name", JsonToPdxToObjectDataAccessIntegrationTest.class.getSimpleName());
gemfireProperties.setProperty("mcast-port", "0");
gemfireProperties.setProperty("log-level", "warning");
return gemfireProperties;
}
#Bean
public CacheFactoryBean gemfireCache(Properties gemfireProperties) {
CacheFactoryBean cacheFactoryBean = new CacheFactoryBean();
cacheFactoryBean.setProperties(gemfireProperties);
//cacheFactoryBean.setPdxSerializer(new MappingPdxSerializer());
cacheFactoryBean.setPdxSerializer(new OrderPdxSerializer());
cacheFactoryBean.setPdxReadSerialized(false);
return cacheFactoryBean;
}
#Bean(name = "Orders")
public PartitionedRegionFactoryBean ordersRegion(Cache gemfireCache) {
PartitionedRegionFactoryBean regionFactoryBean = new PartitionedRegionFactoryBean();
regionFactoryBean.setCache(gemfireCache);
regionFactoryBean.setName("Orders");
regionFactoryBean.setPersistent(false);
return regionFactoryBean;
}
#Bean
public GemfireRepositoryFactoryBean orderRepository() {
GemfireRepositoryFactoryBean<OrderRepository, Order, Long> repositoryFactoryBean =
new GemfireRepositoryFactoryBean<>();
repositoryFactoryBean.setRepositoryInterface(OrderRepository.class);
return repositoryFactoryBean;
}
}
}
So, as you are aware, GemFire (and by extension, Apache Geode) stores JSON in PDX format (as a PdxInstance). This is so GemFire can interoperate with many different language-based clients (native C++/C#, web-oriented (JavaScript, Pyhton, Ruby, etc) using the Developer REST API, in addition to Java) and also to be able to use OQL to query the JSON data.
After a bit of experimentation, I am surprised GemFire is not behaving as I would expect. I created an example, self-contained test class (i.e. no Spring XD, of course) that simulates your use case... essentially storing JSON data in GemFire as PDX and then attempting to read the data back out as the Order application domain object type using the Repository abstraction, logical enough.
Given the use of the Repository abstraction and implementation from Spring Data GemFire, the infrastructure will attempt to access the application domain object based on the Repository generic type parameter (in this case "Order" from the "OrderRepository" definition).
However, the data is stored in PDX, so now what?
No matter, Spring Data GemFire provides the MappingPdxSerializer class to convert PDX instances back to application domain objects using the same "mapping meta-data" that the Repository infrastructure uses. Cool, so I plug that in...
#Bean
public CacheFactoryBean gemfireCache(Properties gemfireProperties) {
CacheFactoryBean cacheFactoryBean = new CacheFactoryBean();
cacheFactoryBean.setProperties(gemfireProperties);
cacheFactoryBean.setPdxSerializer(new MappingPdxSerializer());
cacheFactoryBean.setPdxReadSerialized(false);
return cacheFactoryBean;
}
You will also notice, I set the PDX 'read-serialized' property (cacheFactoryBean.setPdxReadSerialized(false);) to false in order to ensure data access operations return the domain object and not the PDX instance.
However, this had no affect on the query method. In fact, it had no affect on the following operations either...
orderRepository.findOne(amazonOrder.getTransactionId());
ordersRegion.get(amazonOrder.getTransactionId());
Both calls returned a PdxInstance. Note, the implementation of OrderRepository.findOne(..) is based on SimpleGemfireRepository.findOne(key), which uses GemfireTemplate.get(key), which just performs Region.get(key), and so is effectively the same as (ordersRegion.get(amazonOrder.getTransactionId();). The outcome should not be, especially with Region.get() and read-serialized set to false.
With the OQL query (SELECT * FROM /Orders WHERE transactionId = $1) generated from the findByTransactionId(String id), the Repository infrastructure has a bit less control over what the GemFire query engine will return based on what the caller (OrderRepository) expects (based on the generic type parameter), so running OQL statements could potentially behave differently than direct Region access using get.
Next, I went onto try modifying the Order type to implement PdxSerializable, to handle the conversion during data access operations (direct Region access with get, OQL, or otherwise). This had no affect.
So, I tried to implement a custom PdxSerializer for Order objects. This had no affect either.
The only thing I can conclude at this point is something is getting lost in translation between Order -> JSON -> PDX and then from PDX -> Order. Seemingly, GemFire needs additional type meta-data required by PDX (something like #JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "#type") in the JSON data that PDXFormatter recognizes, though I am not certain it does.
Note, in my test class, I used Jackson's ObjectMapper to serialize the Order to JSON and then GemFire's JSONFormatter to serialize the JSON to PDX, which I suspect Spring XD is doing similarly under-the-hood. In fact, Spring XD uses Spring Data GemFire and is most likely using the JSON Region Auto Proxy support. That is exactly what SDG's JSONRegionAdvice object does (see here).
Anyway, I have an inquiry out to the rest of the GemFire engineering team. There are also things that could be done in Spring Data GemFire to ensure the PDX data is converted, such as making use of the MappingPdxSerializer directly to convert the data automatically on behalf of the caller if the data is indeed of type PdxInstance. Similar to how JSON Region Auto Proxying works, you could write AOP interceptor for the Orders Region to automagicaly convert PDX to an Order.
Though, I don't think any of this should be necessary as GemFire should be doing the right thing in this case. Sorry I don't have a better answer right now. Let's see what I find out.
Cheers and stay tuned!
See subsequent post for test code.

Weird error in Hadoop reducer

The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.

custom converter in XStream

I am using XStream to serialize my Objects to XML format. The formatted xml that I get is as below: node1, node2, node 3 are attributes of pojo,DetailDollars
I have requirement where in I have to calucluate a percentage, for example 100/ 25 and add the new node to the existing ones. So, the final output should be :
<DetailDollars>
<node1>100 </node1>
<node2>25</node2>
<node3>10</node3>
</DetailDollars>
I wrote a custom converter and registered to my xstream object.
public void marshal(..){
writer.startNode("node4");
writer.setValue(getNode1()/ getnode2() );
writer.endNode();
}
But, the xml stream I get has only the new node:
<DetailDollars>
<node4>4</node4>
</DetailDollars>
I am not sure which xstream api would get me the desired format. could you please help me with this .
Here is the converter you need:
public class DetailDollarsConverter extends ReflectionConverter {
public DetailDollarsConverter(Mapper mapper,
ReflectionProvider reflectionProvider) {
super(mapper, reflectionProvider);
}
#Override
public void marshal(Object obj, HierarchicalStreamWriter writer,
MarshallingContext context) {
super.marshal(obj,writer,context);
DetailDollars dl = (DetailDollars) obj;
writer.startNode("node4");
writer.setValue(Double.toString(dl.getNode1() / dl.getNode2()));
writer.endNode();
}
#Override
public Object unmarshal(HierarchicalStreamReader reader,
UnmarshallingContext context) {
return super.unmarshal(reader,context);
}
#SuppressWarnings("unchecked")
#Override
public boolean canConvert(Class clazz) {
return clazz.equals(DetailDollars.class);
}
}

Resources