Hadoop/MapReduce: Reading and writing classes generated from DDL - hadoop

Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL?
I have defined some struct-like records using DDL. For example:
class Customer {
ustring FirstName;
ustring LastName;
ustring CardNo;
long LastPurchase;
}
I've compiled this to get a Customer class and included it into my project. I can easily see how to use this as input and output for mappers and reducers (the generated class implements Writable), but not how to read and write it to file.
The JavaDoc for the org.apache.hadoop.record package talks about serializing these records in Binary, CSV or XML format. How do I actually do that? Say my reducer produces IntWritable keys and Customer values. What OutputFormat do I use to write the result in CSV format? What InputFormat would I use to read the resulting files in later, if I wanted to perform analysis over them?

Ok, so I think I have this figured out. I'm not sure if it is the most straight-forward way, so please correct me if you know a simpler work-flow.
Every class generated from DDL implements the Record interface, and consequently provides two methods:
serialize(RecordOutput out) for writing
deserialize(RecordInput in) for reading
RecordOutput and RecordInput are utility interfaces provided in the org.apache.hadoop.record package. There are a few implementations (e.g. XMLRecordOutput, BinaryRecordOutput, CSVRecordOutput)
As far as I know, you have to implement your own OutputFormat or InputFormat classes to use these. This is fairly easy to do.
For example, the OutputFormat I talked about in the original question (one that writes Integer keys and Customer values in CSV format) would be implemented like this:
private static class CustomerOutputFormat
extends TextOutputFormat<IntWritable, Customer>
{
public RecordWriter<IntWritable, Customer> getRecordWriter(FileSystem ignored,
JobConf job,
String name,
Progressable progress)
throws IOException {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new CustomerRecordWriter(fileOut);
}
protected static class CustomerRecordWriter
implements RecordWriter<IntWritable, Customer>
{
protected DataOutputStream outStream ;
public AnchorRecordWriter(DataOutputStream out) {
this.outStream = out ;
}
public synchronized void write(IntWritable key, Customer value) throws IOException {
CsvRecordOutput csvOutput = new CsvRecordOutput(outStream);
csvOutput.writeInteger(key.get(), "id") ;
value.serialize(csvOutput) ;
}
public synchronized void close(Reporter reporter) throws IOException {
outStream.close();
}
}
}
Creating the InputFormat is much the same. Because the csv format is one entry per line, we can use a LineRecordReader internally to do most of the work.
private static class CustomerInputFormat extends FileInputFormat<IntWritable, Customer> {
public RecordReader<IntWritable, Customer> getRecordReader(
InputSplit genericSplit,
JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new CustomerRecordReader(job, (FileSplit) genericSplit);
}
private class CustomerRecordReader implements RecordReader<IntWritable, Customer> {
private LineRecordReader lrr ;
public CustomerRecordReader(Configuration job, FileSplit split)
throws IOException{
this.lrr = new LineRecordReader(job, split);
}
public IntWritable createKey() {
return new IntWritable();
}
public Customer createValue() {
return new Customer();
}
public synchronized boolean next(IntWritable key, Customer value)
throws IOException {
LongWritable offset = new LongWritable() ;
Text line = new Text() ;
if (!lrr.next(offset, line))
return false ;
CsvRecordInput cri = new CsvRecordInput(new
ByteArrayInputStream(line.toString().getBytes())) ;
key.set(cri.readInt("id")) ;
value.deserialize(cri) ;
return true ;
}
public float getProgress() {
return lrr.getProgress() ;
}
public synchronized long getPos() throws IOException {
return lrr.getPos() ;
}
public synchronized void close() throws IOException {
lrr.close();
}
}
}

Related

Cannot Write Data to ElasticSearch with AbstractReactiveElasticsearchConfiguration

I am trying out to write data to my local Elasticsearch Docker Container (7.4.2), for simplicity I used the AbstractReactiveElasticsearchConfiguration given from Spring also Overriding the entityMapper function. The I constructed my repository extending the ReactiveElasticsearchRepository
Then in the end I used my autowired repository to saveAll() my collection of elements containing the data. However Elasticsearch doesn't write any data. Also i have a REST controller which is starting my whole process returning nothing basicly, DeferredResult>
The REST method coming from my ApiDelegateImpl
#Override
public DeferredResult<ResponseEntity<Void>> openUsageExporterStartPost() {
final DeferredResult<ResponseEntity<Void>> deferredResult = new DeferredResult<>();
ForkJoinPool.commonPool().execute(() -> {
try {
openUsageExporterAdapter.startExport();
deferredResult.setResult(ResponseEntity.accepted().build());
} catch (Exception e) {
deferredResult.setErrorResult(e);
}
}
);
return deferredResult;
}
My Elasticsearch Configuration
#Configuration
public class ElasticSearchConfig extends AbstractReactiveElasticsearchConfiguration {
#Value("${spring.data.elasticsearch.client.reactive.endpoints}")
private String elasticSearchEndpoint;
#Bean
#Override
public EntityMapper entityMapper() {
final ElasticsearchEntityMapper entityMapper = new ElasticsearchEntityMapper(elasticsearchMappingContext(), new DefaultConversionService());
entityMapper.setConversions(elasticsearchCustomConversions());
return entityMapper;
}
#Override
public ReactiveElasticsearchClient reactiveElasticsearchClient() {
ClientConfiguration clientConfiguration = ClientConfiguration.builder()
.connectedTo(elasticSearchEndpoint)
.build();
return ReactiveRestClients.create(clientConfiguration);
}
}
My Repository
public interface OpenUsageRepository extends ReactiveElasticsearchRepository<OpenUsage, Long> {
}
My DTO
#Data
#Document(indexName = "open_usages", type = "open_usages")
#TypeAlias("OpenUsage")
public class OpenUsage {
#Field(name = "id")
#Id
private Long id;
......
}
My Adapter Implementation
#Autowired
private final OpenUsageRepository openUsageRepository;
...transform entity into OpenUsage...
public void doSomething(final List<OpenUsage> openUsages){
openUsageRepository.saveAll(openUsages)
}
And finally my IT test
#SpringBootTest(webEnvironment = WebEnvironment.RANDOM_PORT)
#Testcontainers
#TestPropertySource(locations = {"classpath:application-it.properties"})
#ContextConfiguration(initializers = OpenUsageExporterApplicationIT.Initializer.class)
class OpenUsageExporterApplicationIT {
#LocalServerPort
private int port;
private final static String STARTCALL = "http://localhost:%s/open-usage-exporter/start/";
#Container
private static ElasticsearchContainer container = new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:6.8.4").withExposedPorts(9200);
static class Initializer implements ApplicationContextInitializer<ConfigurableApplicationContext> {
#Override
public void initialize(final ConfigurableApplicationContext configurableApplicationContext) {
final List<String> pairs = new ArrayList<>();
pairs.add("spring.data.elasticsearch.client.reactive.endpoints=" + container.getContainerIpAddress() + ":" + container.getFirstMappedPort());
pairs.add("spring.elasticsearch.rest.uris=http://" + container.getContainerIpAddress() + ":" + container.getFirstMappedPort());
TestPropertyValues.of(pairs).applyTo(configurableApplicationContext);
}
}
#Test
void testExportToES() throws IOException, InterruptedException {
final List<OpenUsageEntity> openUsageEntities = dbPreparator.insertTestData();
assertTrue(openUsageEntities.size() > 0);
final String result = executeRestCall(STARTCALL);
// Awaitility here tells me nothing is in ElasticSearch :(
}
private String executeRestCall(final String urlTemplate) throws IOException {
final String url = String.format(urlTemplate, port);
final HttpUriRequest request = new HttpPost(url);
final HttpResponse response = HttpClientBuilder.create().build().execute(request);
// Get the result.
return EntityUtils.toString(response.getEntity());
}
}
public void doSomething(final List<OpenUsage> openUsages){
openUsageRepository.saveAll(openUsages)
}
This lacks a semicolon at the end, so it should not compile.
But I assume this is just a typo, and there is a semicolon in reality.
Anyway, saveAll() returns a Flux. This Flux is just a recipe for saving your data, and it is not 'executed' until subscribe() is called by someone (or something like blockLast()). You just throw that Flux away, so the saving never gets executed.
How to fix this? One option is to add .blockLast() call:
openUsageRepository.saveAll(openUsages).blockLast();
But this will save the data in a blocking way effectively defeating the reactivity.
Another option is, if the code you are calling saveAll() from supports reactivity is just to return the Flux returned by saveAll(), but, as your doSomething() has void return type, this is doubtful.
It is not seen how your startExport() connects to doSomething() anyway. But it looks like your 'calling code' does not use any notion of reactivity, so a real solution would be to either rewrite the calling code to use reactivity (obtain a Publisher and subscribe() on it, then wait till the data arrives), or revert to using blocking API (ElasticsearchRepository instead of ReactiveElasticsearchRepository).

JSON-B serializes Map keys using toString and not with registered Adapter

I have a JAX-RS service that returns a Map<Artifact, String> and I have registered a
public class ArtifactAdapter implements JsonbAdapter<Artifact, String>
which a see hit when deserializing the in-parameter but not when serializing the return value, instead the Artifact toString() is used. If I change the return type to a Artifact, the adapter is called. I was under the impression that the Map would be serialized with built-in ways and then the adapter would be called for the Artifact.
What would be the workaround? Register an Adapter for the whole Map?
I dumped the thread stack in my toString and it confirms my suspicions
at java.lang.Thread.dumpStack(Thread.java:1336)
Artifact.toString(Artifact.java:154)
at java.lang.String.valueOf(String.java:2994)
at org.eclipse.yasson.internal.serializer.MapSerializer.serializeInternal(MapSerializer.java:41)
at org.eclipse.yasson.internal.serializer.MapSerializer.serializeInternal(MapSerializer.java:30)
at org.eclipse.yasson.internal.serializer.AbstractContainerSerializer.serialize(AbstractContainerSerializer.java:63)
at org.eclipse.yasson.internal.Marshaller.serializeRoot(Marshaller.java:118)
at org.eclipse.yasson.internal.Marshaller.marshall(Marshaller.java:74)
at org.eclipse.yasson.internal.JsonBinding.toJson(JsonBinding.java:98)
is the serializer hell-bent on using toString at this point?
I tried
public class Person {
private String name;
public Person(String name) {
this.name = name;
}
public String getName() {
return name;
}
}
public class PersonAdapter implements JsonbAdapter{
#Override
public String adaptToJson(Person obj) throws Exception {
return obj.getName();
}
#Override
public Person adaptFromJson(String obj) throws Exception {
return new Person(obj);
}
}
public class Test {
public static void main(String[] args) {
Map<Person, Integer> data = new HashMap<>();
data.put(new Person("John"), 23);
JsonbConfig config = new JsonbConfig().withAdapters(new PersonAdapter());
Jsonb jsonb = JsonbBuilder.create(config);
System.out.println(jsonb.toJson(data, new HashMap<Person, Integer>() {
}.getClass().getGenericSuperclass()));
}
}
but still ended up with the toString() of Person
Thanks in advance,
Nik
https://github.com/eclipse-ee4j/yasson/issues/110 (in my case since that's the default provider for WildFly)

Mulitple column name in spring data

Model
#Column(name="Desc", name="des", name="DS")
private String description;
How can I mention multiple name of column ? so, that if any one found it map value to description?
How can I mention multiple name of column ?
You can't. A database does not allow columns to have multiple names, hence you can't map multiple column names to a class field.
so, that if any one found it map value to description?
If you have multiple stored procedures that return in their respective result set "Desc", "des", "ds" and you need to map this to the same Java class - you need to define different row mappers and describe the mapping there.
For example let's say you have SP1 and SP2 and you want both result sets from those to be mapped to ResultDto.
ResultDto looks like:
public class ResultDto {
private String name; //always maps to DB column "name"
private String desc; //maps to different DB columns - "ds", "desc"
//ommitted..
}
You can define a base row mapper and define the mapping for all overlapping fields of all Stored Procs result sets.
Code Example:
protected static abstract class BaseRowMapper implements RowMapper<ResultDto> {
public abstract ResultDto mapRow(ResultSet rs, int rowNum) throws SQLException;
protected void mapBase(ResultSet rs, ResultDto resultDto) throws SQLException {
resultDto.setName(rs.getString("name")); //map all overlapping fields/columns here
}
}
private static class SP1RowMapper extends BaseRowMapper {
#Override
public ResultDto mapRow(ResultSet rs, int rowNum) throws SQLException {
ResultDto resultDto = new ResultDto();
mapBase(rs, resultDto);
resultDto.setDescription(rs.getString("ds"));
return resultDto;
}
}
private static class SP2RowMapper extends BaseRowMapper {
#Override
public ResultDto mapRow(ResultSet rs, int rowNum) throws SQLException {
ResultDto resultDto = new ResultDto();
mapBase(rs, resultDto);
resultDto.setDescription(rs.getString("desc"));
return resultDto;
}
}
I don't know how you call the Stored Procedures, but if you use Spring's SimpleJdbcCall, the code will look like:
new SimpleJdbcCall(datasource)
.withProcedureName("SP NAME")
.declareParameters(
//Stored Proc params
)
.returningResultSet("result set id", rowMapperInstance);

How to convert Oracle user defined Type into java object in spring jdbc stored procedure

I am working on springjdbcTemplate, and all db call will be done through stored procedures. In Oracle 11g I have created one user defined type containing with other type as field inside it as below.
create or replace type WORKER AS Object (NAME VARCHAR2(30),
age NUMBER);
create or replace type WORKER_LIST IS TABLE OF WORKER;
create or replace type MANAGER AS Object(
NAME VARCHAR2(30),
workers WORKER_LIST
);
And at Java side I have created the classes as follows.
public class Worker implements SQLData {
private String name;
private int age;
#Override
public String getSQLTypeName() throws SQLException {
return "WORKER";
}
#Override
public void readSQL(SQLInput stream, String typeName) throws SQLException {
setName(stream.readString());
setAge(stream.readInt());
}
#Override
public void writeSQL(SQLOutput stream) throws SQLException {
stream.writeString(getName());
stream.writeInt(getAge());
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}
public class Manager implements SQLData {
private String name;
private List<Worker> workers;
#Override
public String getSQLTypeName() throws SQLException {
return "Manager";
}
#Override
public void readSQL(SQLInput stream, String typeName) throws SQLException {
setName(stream.readString());
setWorkers((List<Worker>) stream.readObject());
}
#Override
public void writeSQL(SQLOutput stream) throws SQLException {
stream.writeString(getName());
stream.writeObject((SQLData) getWorkers());
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public List<Worker> getWorkers() {
return workers;
}
public void setWorkers(List<Worker> workers) {
this.workers = workers;
}
}
I have mentioned in typeMap about the mappings.
But I am not getting expected results.
Worker type is returned as Struct and List<Worker> is returned as array.
Please let me know what should I have do and what is the standard protocol to get the expected object as I mentioned above. I'm new to JDBCTemplate. Please suggest.
Thanks
Ram
I think I've managed to get something working.
You mentioned something about the connection's type map. When using Spring it's difficult to get hold of the database connection in order to add the types to the connection's type map, so I'm not sure what you mean when you write 'I have mentioned in typeMap about the mappings'.
Spring offers one way to add an entry to the connection's type map, in the form of the class SqlReturnSqlData. This can be used to call a stored procedure or function which returns a user-defined type. It adds an entry to the connection's type map to specify the database type of the object and the class to map this object to just before it retrieves a value from a CallableStatement. However, this only works if you only need to map a single type. You have two such types that need mapping: MANAGER and WORKER.
Fortunately, it's not difficult to come up with a replacement for SqlReturnSqlData that can add more than one entry to the connection's type map:
import org.springframework.jdbc.core.SqlReturnType;
import java.sql.CallableStatement;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.Map;
public class SqlReturnSqlDataWithAuxiliaryTypes implements SqlReturnType {
private Class<?> targetClass;
private Map<String, Class<?>> auxiliaryTypes;
public SqlReturnSqlDataWithAuxiliaryTypes(Class<?> targetClass, Map<String, Class<?>> auxiliaryTypes) {
this.targetClass = targetClass;
this.auxiliaryTypes = auxiliaryTypes;
}
#Override
public Object getTypeValue(CallableStatement cs, int paramIndex, int sqlType, String typeName) throws SQLException {
Connection con = cs.getConnection();
Map<String, Class<?>> typeMap = con.getTypeMap();
typeMap.put(typeName, this.targetClass);
typeMap.putAll(auxiliaryTypes);
Object o = cs.getObject(paramIndex);
return o;
}
}
The above has been adapted from the source of SqlReturnSqlData. All I've really done is added an extra field auxiliaryTypes, the contents of which gets added into the connection's type map in the call to getTypeValue().
I also needed to adjust the readSQL method of your Manager class. The object you read back from the stream will be an implementation of java.sql.Array. You can't just cast this to a list. Sadly, getting this out is a little fiddly:
#Override
public void readSQL(SQLInput stream, String typeName) throws SQLException {
setName(stream.readString());
Array array = (Array) stream.readObject();
Object[] objects = (Object[]) array.getArray();
List<Worker> workers = Arrays.stream(objects).map(o -> (Worker)o).collect(toList());
setWorkers(workers);
}
(If you're not using Java 8, replace the line with Arrays.stream(...) with a loop.)
To test this I wrote a short stored function to return a MANAGER object:
CREATE OR REPLACE FUNCTION f_get_manager
RETURN manager
AS
BEGIN
RETURN manager('Big Boss Man', worker_list(worker('Bill', 40), worker('Fred', 36)));
END;
/
The code to call this stored function was then as follows:
Map<String, Class<?>> auxiliaryTypes = Collections.singletonMap("WORKER", Worker.class);
SimpleJdbcCall jdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withSchemaName("my_schema")
.withFunctionName("f_get_manager")
.declareParameters(
new SqlOutParameter(
"return",
OracleTypes.STRUCT,
"MANAGER",
new SqlReturnSqlDataWithAuxiliaryTypes(Manager.class, auxiliaryTypes)));
Manager manager = jdbcCall.executeFunction(Manager.class);
// ... do something with manager.
This worked, in that it returned a Manager object with two Workers in it.
Finally, if you have stored procedures that save a Manager object to the database, be aware that your Manager class's writeSQL method will not work. Unless you've written your own List implementation, List<Worker> cannot be casted to SQLData. Instead, you'll need to create an Oracle array object and put the entries in that. That however is awkward because you'll need the database connection to create the array, but that won't be available in the writeSQL method. See this question for one possible solution.

RowMapper returns the list , but execute returned values returns the list size as 1?

please find below my sample code.The Row mapper returns a list. When printed it give me the size in the DB but when i check
(List) employeeDaomap .get("allEmployees") i get the list size as 1 , and entire rows as one item? why what is the wrong in implementation
Also Spring doc says not to use rs.next(), how do we get the list of
values from the DB
public class MyTestDAO extends StoredProcedure {
/** The log. */
static Logger log = Logger.getLogger(MyTestDAO.class);
private static final String SPROC_NAME = "TestSchema.PKG_Test.prc_get_employee_list";
TestRowMapper mapper=new TestRowMapper();
public MyTestDAO(DataSource dataSource){
super(dataSource, SPROC_NAME);
declareParameter(new SqlOutParameter("allEmployees", OracleTypes.CURSOR, mapper));
compile();
}
/**
* Gets the myemplist data from the DB
*
*/
public List<EmployeeDAO> getEmployeeList()
throws Exception {
Map<String,Object> employeeDaomap =new HashMap<String,Object>();
employeeDaomap =execute();
log.info("employeeDaomap after execute ="+employeeDaomap);
log.info("employeeDaomap after execute size ="+employeeDaomap.size()); // expected 1
List<EmployeeDAO> list = (List<EmployeeDAO>) employeeDaomap .get("allEmployees");
log.info("size of the list ="+list.size()); // need to get the size of the list ,
return list;
}
private Map<String, Object> execute() {
return super.execute(new HashMap<String, Object>());
}
}
public class TestRowMapper implements RowMapper<List<EmployeeDAO>> {
static Logger log = Logger.getLogger(TestRowMapper.class);
#Override
public List<EmployeeDAO> mapRow(ResultSet rs, int rowNum)
throws SQLException {
// TODO Auto-generated method stub
rs.setFetchSize(3000);
List<EmployeeDAO> responseItems = new ArrayList<EmployeeDAO>();
EmployeeDAO responseItem = null;
log.info("row num "+rowNum);
while (rs.next()) {
responseItem = new EmployeeDAO();
responseItem.setID(rs.getString("id"));
responseItem.setName(rs.getString("name"));
responseItem.setDesc(rs.getString("desc"));
responseItems.add(responseItem);
}
log.info("TestRowMapper items ="+responseItems);
return responseItems;
}
}
The solution is to use the implements ResultSetExtractor instead of RowMapper and provide implementation for extractData.
public class TestRowMapper implements ResultSetExtractor<List<EmployeeDAO>> {
static Logger log = Logger.getLogger(TestRowMapper.class);
#Override
public List<EMAccountResponse> extractData(ResultSet rs)
throws SQLException, DataAccessException {
rs.setFetchSize(3000);
List<EmployeeDAO> responseItems = new ArrayList<EmployeeDAO>();
EmployeeDAO responseItem = null;
log.info("row num "+rowNum);
while (rs.next()) {
responseItem = new EmployeeDAO();
responseItem.setID(rs.getString("id"));
responseItem.setName(rs.getString("name"));
responseItem.setDesc(rs.getString("desc"));
responseItems.add(responseItem);
}
log.info("TestRowMapper items ="+responseItems);
return responseItems;
}
}

Resources