How to read specific fields from Avro-Parquet file in Java? - hadoop

How can I read a subset of fields from an avro-parquet file in java?
I thought I could define an avro schema which is a subset of the stored records and then read them...but I get an exception.
here is how i tried to solve it
I have 2 avro schemas:
classA
ClassB
The fields of ClassB are a subset of ClassA.
final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
final ParquetReader<ClassB> reader = builder.build();
//AvroParquetReader<ClassA> readerA = new AvroParquetReader<ClassA>(files[0].getPath());
ClassB record = null;
final List<ClassB> list = new ArrayList<>();
while ((record = reader.read()) != null) {
list.add(record);
}
But I get a ClassCastException on line (record=reader.read()): Cannot convert ClassA to ClassB
I suppose the reader is reading the schema from the file.
I tried to send in the model (i.e. builder.withModel) but since classB extends org.apache.avro.specific.SpecificRecordBase it throws an exception.
I event tried to set the schema in the configuration and set it through builder.withConfig but no cigar...

So...
Couple of things:
AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.$Schema) can be used to set a projection for the columns that are selected.
The reader.readNext method still will return a ClassA object but will null out the fields that are not present in ClassB.
To use the reader directly you can do the following:
AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.SCHEMA$);
final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
final ParquetReader<ClassA> reader = builder.withConf(hadoopConf).build();
ClassA record = null;
final List<ClassA> list = new ArrayList<>();
while ((record = reader.read()) != null) {
list.add(record);
}
Also if you're planning to use an inputformat to read the avro-parquet file, there is a convenience method - here is a spark example:
final Job job = Job.getInstance(hadoopConf);
ParquetInputFormat.setInputPaths(job, pathGlob);
AvroParquetInputFormat.setRequestedProjection(job, ClassB.SCHEMA$);
#SuppressWarnings("unchecked")
final JavaPairRDD<Void, ClassA> rdd = sc.newAPIHadoopRDD(job.getConfiguration(), AvroParquetInputFormat.class,
Void.class, ClassA.class);

Related

How to Read Records From Any Database Table and Export As TextFile Using Spring Batch

I am building a spring batch job that will be invoked through a webservice. The webservice will take a list of select and delete statement pairs. The records returned by the select statement will be saved as a CSV on the filesystem and then those same records will be deleted by executing the supplied delete statement.
I have seen a number of ColumnRowMapper examples but that requires me to create a POJO for each table entity. I am looking for a solution that will handle any column from any table. Any suggestions on approach?
****UPDATE****
Since writing this post, I've landed on the following solution.
#Bean
#StepScope
public JdbcCursorItemReader<Map<String, ?>> getRowsOfDataForExportFromTable(){
JdbcCursorItemReader<Map<String, ? extends Object>> databaseReader = new JdbcCursorItemReader<>();
databaseReader.setDataSource(jdbcTemplate.getDataSource());
databaseReader.setSql("select * from SOME_TABLE where last_updated_date < DATE_SUB(NOW(), INTERVAL 10 DAY);");
databaseReader.setRowMapper(new RowMapper<Map<String, ? extends Object>>() {
#Override
public Map<String, ? extends Object> mapRow(ResultSet resultSet, int i) throws SQLException {
Map<String,String> resultMap = new LinkedHashMap<>();
int numOfColumns = resultSet.getMetaData().getColumnCount();
for (int j = 1; j < numOfColumns+1; j++){
String columnName = resultSet.getMetaData().getColumnName(j);
String value = resultSet.getString(j);
resultMap.put(columnName,value);
}
return resultMap;
}
});
return databaseReader;
}
The above ItemReader will build a LinkedHashMap row mapper where the column name is the key and the column value is the value.
Did you try to use Map instead of POJO? You can dynamically fill it in Reader, and then create CSV file from this Map.

how does Chronicle-wire support schema evolution?

I am new to Chronicle-wire. In the document it claims support for "setting of fields to the default, if not available" in the schema evolution section.
Do we have an example of how this works?
I have an example of adding an array field to a simple Marshallable object. When reading the journals contains the old version of the object, how can we set a default value (eg. new String[0]) to the field instead of a null?
There're a few ways to achieve that, one example is below:
public class TestMarshallable implements Marshallable {
private long a;
private int b;
private String newField = "defaultValue";
#Override
public void readMarshallable(#NotNull WireIn wire) throws IORuntimeException {
a = wire.read("a").int64();
b = wire.read("b").int32();
if (wire.bytes().readRemaining() > 0)
newField = wire.read("newField").text();
}
}
In this example, it is assumed that your new field will be written last, hence you can simply check if there's more to read - and do so. Default value is the one you assign to the field.
More complicated, but way more flexible way:
public class TestMarshallable implements Marshallable {
private long a = 0;
private int b = 1;
private String newField = "defaultValue";
#Override
public void readMarshallable(#NotNull WireIn wire) throws IORuntimeException {
#NotNull StringBuilder name = new StringBuilder();
while (!wire.isEmpty()) {
#NotNull ValueIn in = wire.read(name);
if (StringUtils.isEqual(name, "a"))
a = in.int64();
else if (StringUtils.isEqual(name, "b"))
b = in.int32();
else if (StringUtils.isEqual(name, "newField"))
newField = in.text();
else
unexpectedField(name, in);
wire.consumePadding();
}
}
}
In the last example readMarshallable simply overwrites the fields it could find in the stream leaving others with default values (NB this can also be used to save certain amount of writes, if you often write default values you can skip them altogether in writeMarshallable)

Native SQL from Spring / Hibernate without entity mapping?

I need to write some temporary code in my existing Spring Boot 1.2.5 application that will do some complex SQL queries. By complex, I mean a single queries about 4 different tables and I have a number of these. We all decided to use existing SQL to reduce potential risk of getting the new queries wrong, which in this case is a good way to go.
My application uses JPA / Hibernate and maps some entities to tables. From my research it seems like I would have to do a lot of entity mapping.
I tried writing a class that would just get the Hibernate session object and execute a native query but when it tried to configure the session factory it threw an exception complaining it could not find the config file.
Could I perhaps do this from one of my existing entities, or at least find a way to get the Hibernate session that already exists?
UPDATE:
Here is the exception, which makes perfect sense since there is no config file to find. Its app configured in the properties file.
org.hibernate.HibernateException: /hibernate.cfg.xml not found
at org.hibernate.internal.util.ConfigHelper.getResourceAsStream(ConfigHelper.java:173)
For what it's worth, the code:
#NamedNativeQuery(name = "verifyEa", query = "select account_nm from per_person where account_nm = :accountName")
public class VerifyEaResult
{
private SessionFactory sessionFact = null;
String accountName;
private void initSessionFactory()
{
Configuration config = new Configuration().configure();
ServiceRegistry serviceRegistry = new ServiceRegistryBuilder().applySettings(config.getProperties()).getBootstrapServiceRegistry();
sessionFact = config.buildSessionFactory(serviceRegistry);
}
public String getAccountName()
{
// Quick simple test query
String sql = "SELECT * FROM PER_ACCOUNT WHERE ACCOUNT_NM = 'lynnedelete'";
initSessionFactory();
Session session = sessionFact.getCurrentSession();
SQLQuery q = session.createSQLQuery(sql);
List<Object> result = q.list();
return accountName;
}
}
You can use Data access with JDBC, for example:
public class Client {
private final JdbcTemplate jdbcTemplate;
// Quick simple test query
final static String SQL = "SELECT * FROM PER_ACCOUNT WHERE ACCOUNT_NM = ?";
#Autowired
public Client(DataSource dataSource) {
jdbcTemplate = new JdbcTemplate(dataSource);
}
public List<Map<String, Object>> getData(String name) {
return jdbcTemplate.queryForList(SQL, name);
}
}
The short way is:
jdbcTemplate.queryForList("SELECT 1", Collections.emptyMap());

Spring Data MongoDB: Accessing and updating sub documents

First experiments with Spring Data and MongoDB were great. Now I've got the following structure (simplified):
public class Letter {
#Id
private String id;
private List<Section> sections;
}
public class Section {
private String id;
private String content;
}
Loading and saving entire Letter objects/documents works like a charm. (I use ObjectId to generate unique IDs for the Section.id field.)
Letter letter1 = mongoTemplate.findById(id, Letter.class)
mongoTemplate.insert(letter2);
mongoTemplate.save(letter3);
As documents are big (200K) and sometimes only sub-parts are needed by the application: Is there a possibility to query for a sub-document (section), modify and save it?
I'd like to implement a method like
Section s = findLetterSection(letterId, sectionId);
s.setText("blubb");
replaceLetterSection(letterId, sectionId, s);
And of course methods like:
addLetterSection(letterId, s); // add after last section
insertLetterSection(letterId, sectionId, s); // insert before given section
deleteLetterSection(letterId, sectionId); // delete given section
I see that the last three methods are somewhat "strange", i.e. loading the entire document, modifying the collection and saving it again may be the better approach from an object-oriented point of view; but the first use case ("navigating" to a sub-document/sub-object and working in the scope of this object) seems natural.
I think MongoDB can update sub-documents, but can SpringData be used for object mapping? Thanks for any pointers.
I figured out the following approach for slicing and loading only one subobject. Does it seem ok? I am aware of problems with concurrent modifications.
Query query1 = Query.query(Criteria.where("_id").is(instance));
query1.fields().include("sections._id");
LetterInstance letter1 = mongoTemplate.findOne(query1, LetterInstance.class);
LetterSection emptySection = letter1.findSectionById(sectionId);
int index = letter1.getSections().indexOf(emptySection);
Query query2 = Query.query(Criteria.where("_id").is(instance));
query2.fields().include("sections").slice("sections", index, 1);
LetterInstance letter2 = mongoTemplate.findOne(query2, LetterInstance.class);
LetterSection section = letter2.getSections().get(0);
This is an alternative solution loading all sections, but omitting the other (large) fields.
Query query = Query.query(Criteria.where("_id").is(instance));
query.fields().include("sections");
LetterInstance letter = mongoTemplate.findOne(query, LetterInstance.class);
LetterSection section = letter.findSectionById(sectionId);
This is the code I use for storing only a single collection element:
MongoConverter converter = mongoTemplate.getConverter();
DBObject newSectionRec = (DBObject)converter.convertToMongoType(newSection);
Query query = Query.query(Criteria.where("_id").is(instance).and("sections._id").is(new ObjectId(newSection.getSectionId())));
Update update = new Update().set("sections.$", newSectionRec);
mongoTemplate.updateFirst(query, update, LetterInstance.class);
It is nice to see how Spring Data can be used with "partial results" from MongoDB.
Any comments highly appreciated!
I think Matthias Wuttke's answer is great, for anyone looking for a generic version of his answer see code below:
#Service
public class MongoUtils {
#Autowired
private MongoTemplate mongo;
public <D, N extends Domain> N findNestedDocument(Class<D> docClass, String collectionName, UUID outerId, UUID innerId,
Function<D, List<N>> collectionGetter) {
// get index of subdocument in array
Query query = new Query(Criteria.where("_id").is(outerId).and(collectionName + "._id").is(innerId));
query.fields().include(collectionName + "._id");
D obj = mongo.findOne(query, docClass);
if (obj == null) {
return null;
}
List<UUID> itemIds = collectionGetter.apply(obj).stream().map(N::getId).collect(Collectors.toList());
int index = itemIds.indexOf(innerId);
if (index == -1) {
return null;
}
// retrieve subdocument at index using slice operator
Query query2 = new Query(Criteria.where("_id").is(outerId).and(collectionName + "._id").is(innerId));
query2.fields().include(collectionName).slice(collectionName, index, 1);
D obj2 = mongo.findOne(query2, docClass);
if (obj2 == null) {
return null;
}
return collectionGetter.apply(obj2).get(0);
}
public void removeNestedDocument(UUID outerId, UUID innerId, String collectionName, Class<?> outerClass) {
Update update = new Update();
update.pull(collectionName, new Query(Criteria.where("_id").is(innerId)));
mongo.updateFirst(new Query(Criteria.where("_id").is(outerId)), update, outerClass);
}
}
This could for example be called using
mongoUtils.findNestedDocument(Shop.class, "items", shopId, itemId, Shop::getItems);
mongoUtils.removeNestedDocument(shopId, itemId, "items", Shop.class);
The Domain interface looks like this:
public interface Domain {
UUID getId();
}
Notice: If the nested document's constructor contains elements with primitive datatype, it is important for the nested document to have a default (empty) constructor, which may be protected, in order for the class to be instantiatable with null arguments.
Solution
Thats my solution for this problem:
The object should be updated
#Getter
#Setter
#Document(collection = "projectchild")
public class ProjectChild {
#Id
private String _id;
private String name;
private String code;
#Field("desc")
private String description;
private String startDate;
private String endDate;
#Field("cost")
private long estimatedCost;
private List<String> countryList;
private List<Task> tasks;
#Version
private Long version;
}
Coding the Solution
public Mono<ProjectChild> UpdateCritTemplChild(
String id, String idch, String ownername) {
Query query = new Query();
query.addCriteria(Criteria.where("_id")
.is(id)); // find the parent
query.addCriteria(Criteria.where("tasks._id")
.is(idch)); // find the child which will be changed
Update update = new Update();
update.set("tasks.$.ownername", ownername); // change the field inside the child that must be updated
return template
// findAndModify:
// Find/modify/get the "new object" from a single operation.
.findAndModify(
query, update,
new FindAndModifyOptions().returnNew(true), ProjectChild.class
)
;
}

How to use Spring ColumnMapRowMapper?

Can anyone help me with an example of ColumnMapRowMapper? How to use it?
I've written an answer in my blog, http://selvam2day.blogspot.com/2013/06/singlecolumnrowmapper.html, but here it is for your convenience below:
SingleColumnRowMapper & ColumnMapRowMapper examples in Spring
Spring JDBC includes two default implementations of RowMapper - SingleColumnRowMapper and ColumnMapRowMapper. Below are sample usages of those row mappers.
There are lots of situations when you just want to select one column or only a selected set of columns in your application, and to write custom row mapper implementations for these scenarios doesn't seem right. In these scenarios, we can make use of the spring-provided row mapper implementations.
SingleColumnRowMapper
This class implements the RowMapper interface. As the name suggests, this class can be used to retrieve a single value from the database as a java.util.List. The list contains the column values one per each row.
In the code snippet below, the type of the result value for each row is specified by the constructor argument. It can also be specified by invoking the setRequiredType(Class<T> requiredType) method.
public List getFirstName(int userID)
{
String sql = "select firstname from users where user_id = " + userID;
SingleColumnRowMapper rowMapper = new SingleColumnRowMapper(String.class);
List firstNameList = (List) getJdbcTemplate().query(sql, rowMapper);
for(String firstName: firstNameList)
System.out.println(firstName);
return firstNameList;
}
More information on the class and its methods can be found in the spring javadoc link below.
http://static.springsource.org/spring/docs/3.0.x/javadoc-api/org/springframework/jdbc/core/SingleColumnRowMapper.html
ColumnMapRowMapper
ColumnMapRowMapper class can be used to retrieve more than one column from a database table. This class also implements the RowMapper interface. This class creates a java.util.Map for each row, representing all columns as key-value pairs: one entry for each column, with the column name as key.
public List<Map<String, Object>> getUserData(int userID)
{
String sql = "select firstname, lastname, dept from users where userID = ? ";
ColumnMapRowMapper rowMapper = new ColumnMapRowMapper();
List<Map<String, Object>> userDataList = getJdbcTemplate().query(sql, rowMapper, userID);
for(Map<String, Object> map: userDataList){
System.out.println("FirstName = " + map.get("firstname"));
System.out.println("LastName = " + map.get("lastname"));
System.out.println("Department = " + map.get("dept"));
}
return userDataList;
}
More information on the class and its methods can be found in the spring javadoc link below.
http://static.springsource.org/spring/docs/3.0.x/javadoc-api/org/springframework/jdbc/core/ColumnMapRowMapper.html

Resources