Generate composite hbase rowkey using Flume Serializer

Generate composite hbase rowkey using Flume Serializer - hadoop

I have GIS data which looks like this -
'111, 2011-02-01 20:30:30, 116.50443, 40.00951'
'111, 2011-02-01 20:30:31, 116.50443, 40.00951'
'112, 2011-02-01 20:30:30, 116.58197, 40.06665'
'112, 2011-02-01 20:30:31, 116.58197, 40.06665'
First column is driver_id, second is timestamp, third is longitude & fourth is latitude.
I am ingesting this type of data using Flume & my sink is HBase (type - AsyncHBaseSink).
By default the HBase assigns rowkey as first column (like 111). I want to create a composite rowkey (like combination of first two columns 111_2011-02-01 20:30:30).
I tried putting the required changes in 'AsyncHbaseLogEventSerializer.java' but they were not reflected.
Please suggest how can I do the same.

Composite key should work in AsyncHbaseSerializer
Below is the sample code snippet.
Declare at class level privae List<PutRequest> puts = null;
/**
* Method joinRowKeyContent. (with EMPTY string separation)
*
* Joiner is google guava class
* #param objArray Object...
*
* #return String
*/
public static String joinRowKeyContent(Object... objArray) {
return Joiner.on("").appendTo(new StringBuilder(), objArray).toString();
}
/**
* Method preParePutRequestForBody.
*
* #param rowKeyBytes
* #param timestamp
*/
private void preParePutRequest(final byte[] rowKeyBytes, final long timestamp) {
// Process
LOG.debug("Processing ..." + Bytes.toString(rowKeyBytes));
final PutRequest putreq = new PutRequest(table, rowKeyBytes, colFam, Bytes.toBytes("yourcolumn"), yourcolumnasBytearray, timestamp);
puts.add(putreq);
}
Your get actions method looks like...
#Override
public List<PutRequest> getActions() {
//create rowkey like this
final String rowKey = joinRowKeyContent(driver_id, timestamp, longitude , latitude);
// call prepare put requests method here
final byte[] rowKeyBytes = Bytes.toBytes(rowKey);
puts.clear();
preParePutRequest(rowKeyBytes ,<timestamp>)
return puts;
}

Related

How to query an object which contains another object, in reactive PostgreSQL, Quarkus

I have a class RetailPlace, whose data is placed in retail_place table, and RetailPlaceAddress, whose data is in retail_place_address table. retail_place and retail_place_address connected via retail_place.id = retail_place_address.retail_place_id. I need to create a method which returns RetailPlace object with RetailPlaceAddress in it.
I tried to get RetailPlaceAddress first, and then to place it in RetailPlace object which I get next, but it didn't worked:
public static Uni<RetailPlace> get(PgPool client, long id){
return client.preparedQuery("select * from retail_place_address where retail_place_id = $1")
.execute().onItem().transform(RowSet::iterator)
.onItem().transform(iterator -> iterator.hasNext() ? RetailPlaceAddress.from(iterator.next()) : null).onItem()
.transform(retailPlaceAddress -> new RetailPlace(client.preparedQuery("select * from retail_place where id = $1").execute()
.onItem().transform(pgRowSet -> new RetailPlace(pgRowSet.iterator().next().getLong("id"), pgRowSet.iterator().next().getString("title"), retailPlaceAddress))));
}

Custom Hive SerDe unable to SELECT column but works when I do SELECT *

I'm writing a custom SerDe and will only be using it to deserialize.
The underlying data is a thrift binary, each row is an event log. Each event has a schema which i have access to, but we wrap the event in another schema, let's call it Message before storing.
The reason I'm writing a SerDe instead of using the ThriftDeserializer is because as mentioned the underlying event is wrapped as a Message. So we first need to deserialize using the schema of Message and then deserialize the data for that event.
The SerDe works (only) when I do a SELECT * and I can deserialize the data as expected but whenever I select a column from the table instead of a SELECT *, the rows are all NULL. The object inspector returned is a ThriftStructObjectInspector and the Object returned by the deserialize is a TBase.
What could cause Hive to return NULL when we select a column, but return the column data when I do a SELECT * ?
Here's the SerDe class (changed some classnames):
public class MyThriftSerde extends AbstractSerDe {
private static final Log LOG = LogFactory.getLog(MyThriftSerde.class);
/* Abstracting away the deserialization of the underlying event which is wrapped in a message */
private static final MessageDeserializer myMessageDeserializer =
MessageDeserializer.getInstance();
/* Underlying event class which is wrapped in a Message */
private String schemaClassName;
private Class<?> schemaClass;
/* Used to read the input row */
public static List<String> inputFieldNames;
public static List<ObjectInspector> inputFieldOIs;
public static List<Integer> notSkipIDs;
public static ObjectInspector inputRowObjectInspector;
/* Output Object Inspector */
public static ObjectInspector thriftStructObjectInspector;
#Override
public void initialize(Configuration conf, Properties tbl) throws SerDeException {
try {
logHeading("INITIALIZE MyThriftSerde");
schemaClassName = tbl.getProperty(SERIALIZATION_CLASS);
schemaClass = conf.getClassByName(schemaClassName);
LOG.info(String.format("Building DDL for event: %s", schemaClass.getName()));
inputFieldNames = new ArrayList<>();
inputFieldOIs = new ArrayList<>();
notSkipIDs = new ArrayList<>();
/* Initialize the Input fields */
// The underlying data is stored in RCFile format, and only has 1 column, event_binary
// So we create a ColumnarStructBase for each row we deserialize.
// This ColumnasStruct only has 1 column: event_binary
inputFieldNames.add("event_binary");
notSkipIDs.add(0);
inputFieldOIs.add(LazyPrimitiveObjectInspectorFactory.LAZY_BINARY_OBJECT_INSPECTOR);
inputRowObjectInspector =
ObjectInspectorFactory.getColumnarStructObjectInspector(inputFieldNames, inputFieldOIs);
/* Output Object Inspector*/
// This is what the SerDe will return, it is a ThriftStructObjectInspector
thriftStructObjectInspector =
ObjectInspectorFactory.getReflectionObjectInspector(
schemaClass, ObjectInspectorFactory.ObjectInspectorOptions.THRIFT);
// Only for debugging
logHeading("THRIFT OBJECT INSPECTOR");
LOG.info("Output OI Class Name: " + thriftStructObjectInspector.getClass().getName());
LOG.info(
"OI Details: "
+ ObjectInspectorUtils.getObjectInspectorName(thriftStructObjectInspector));
} catch (Exception e) {
LOG.info("Exception while initializing SerDe", e);
}
}
#Override
public Object deserialize(Writable rowWritable) throws SerDeException {
logHeading("START DESERIALIZATION");
ColumnarStructBase inputLazyStruct =
new ColumnarStruct(inputRowObjectInspector, notSkipIDs, null);
LazyBinary eventBinary;
Message rowAsMessage;
TBase deserializedRow = null;
try {
inputLazyStruct.init((BytesRefArrayWritable) rowWritable);
eventBinary = (LazyBinary) inputLazyStruct.getField(0);
rowAsMessage =
myMessageDeserializer.fromBytes(eventBinary.getWritableObject().copyBytes(), null);
deserializedRow = rowAsMessage.getEvent();
LOG.info("deserializedRow.getClass(): " + deserializedRow.getClass());
LOG.info("deserializedRow.toString(): " + deserializedRow.toString());
} catch (Exception e) {
e.printStackTrace();
}
logHeading("END DESERIALIZATION");
return deserializedRow;
}
private void logHeading(String s) {
LOG.info(String.format("------------------- %s -------------------", s));
}
#Override
public ObjectInspector getObjectInspector() {
return thriftStructObjectInspector;
}
}
Context on the code:
In the underlying data, each row contains only 1 column (called event_binary), stored as a binary. The binary is a Message which contains 2 fields, "schema" + "event_data". i.e. each row is a Message which contains the underlying event's schema + data. We use the schema from Message to deserialize the data.
The SerDe first deserializes the row as a Message, extracts the event data and then deserializes the event.
I create an EXTERNAL table which points to the Thrift data using
ADD JAR hdfs://my-jar.jar;
CREATE EXTERNAL TABLE dev_db.thrift_event_data_deserialized
ROW FORMAT SERDE 'com.test.only.MyThriftSerde'
WITH SERDEPROPERTIES (
"serialization.class"="com.test.only.TestEvent"
) STORED AS RCFILE
LOCATION 'location/of/thrift/data';
MSCK REPAIR TABLE thrift_event_data_deserialized;
Then SELECT * FROM dev_db.thrift_event_data_deserialized LIMIT 10; works as expected
But, SELECT column1_name, column2_name FROM dev_db.thrift_event_data_deserialized LIMIT 10; does not work.
Any idea what i'm missing here? Would love any help on this!

Open / Closed Principle & Single Responsibilty -- Graphs

I am coding a Graph exploration program and have hit a bit of a stumbling block.
My graph is made up of Vertex and NetworkLink objects, and can be obatined by querying a GeographyModel object.
The idea is List<NetworkLink> is retrieved from the GeographyModel and then supplied to a MetaMap to get the required additional information.
What I want to do is try and adhere to the Open / Closed Principle by adding information each NetworkLink by creating MetaMap objects, but have somewhat got my knickers in a twist as to how to do this!
Below is the code for the MetaMap.
public class MetaMap<T> {
private final String name;
private final Map<NetworkLink, List<T>> metaData;
private final Map<T, Set<NetworkLink>> reverseLookup;
private final List<T> fallback;
private final List<T> information;
public MetaMap(String name, T fallback){
this.name = name;
this.metaData = new HashMap<>();
this.reverseLookup = new HashMap<>();
this.fallback = new ArrayList<>();
this.fallback.add(fallback);
this.information = new ArrayList<>();
}
/**
* Returns an identifier giving the information contained in this map
*
* #return
*/
public String getName() {
return name;
}
/**
* Marks from origin to destination with information of type T
*
* #param line
* #param information
*/
public void markLineFragment(RunningLine line, T information) {
line.getLinks().stream().map((link) -> {
if(!metaData.containsKey(link)) {
metaData.put(link, new ArrayList<>());
}
return link;
}).forEach((link) -> {
metaData.get(link).add(information);
});
if(!reverseLookup.containsKey(information)) {
reverseLookup.put(information, new HashSet<>());
}
reverseLookup.get(information).addAll(line.getLinks());
}
/**
* Returns the information on the given NetworkLink
*
* #param link
* #return
*/
public List<T> getInformation(NetworkLink link) {
return metaData.getOrDefault(link, fallback);
}
/**
* Returns the information associated with the given line fragment
* #param line
* #return
*/
public List<T> getInformation(RunningLine line) {
Set<T> resultSet = new HashSet();
line.getLinks().stream().forEach((link) -> {
List<T> result = getInformation(link);
resultSet.addAll(result);
});
return new ArrayList<>(resultSet);
}
/**
* Returns all of the matching links which match the given information
* #param information
* #return
*/
public List<NetworkLink> getMatchingLinks(T information) {
return new ArrayList<>(reverseLookup.get(information));
}
public void addInformation(T info) {
information.add(info);
}
public void removeInformation(T info) {
information.remove(info);
}
Now... the problem I have is that as I expand the program, each new part will require a new MetaMap which is derived from GeographyModel.
I want to follow the OCP and SRP as I am adding capabilities the program, but get a touch stuck as to implementation / combining the two concepts. A couple of thoughts do occur...
I could get each new model requiring a MetaMap to register itself with the GeographyModel, but fear I would be violating the SRP. Each new prgoram feature could own a MetaMap and maintain it, but that would require querying the GeographyModel in the first place.
Any ideas how I could approach this?

Why would you want to implement OCP? What problems are you trying to solve?
If you implemented OCP only because of everyone else thought it was good I strongly recommend you to think twice.
Each principle in SOLID / GRASP as well as design patterns are guild-lines and solutions for a very specific kind of problems. Basically they are tools. You should identify your problems first, state them as clearly as possible. They you will be able to pick the right tools to deal with them.
Blindly implementing SOLID / GRASP or design patterns is much like using a hammer for cooking food. If you were lucky enough you might success but we both know that the probability was very low.
https://www.u-cursos.cl/usuario/777719ab2ddbbdb16d99df29431d3036/mi_blog/r/head_first_design_patterns.pdf
Please navigate to page 125/681 (on the top bar) and read the entire page!

How to filter records according to `timestamp` in Spring Data Hadoop?

I have a hbase table with a sample record as follows:
03af639717ae10eb743253433147e133 column=u:a, timestamp=1434300763147, value=apple
10f3d7f8fe8f25d5bdf52343a2601227 column=u:a, timestamp=1434300763148, value=mapple
20164b1aff21bc14e94623423a9d645d column=u:a, timestamp=1534300763142, value=papple
44d1cb38271362d20911a723410b2c67 column=u:a, timestamp=1634300763141, value=scapple
I am lost as I was trying to pull out the row values according to the timestamp. I am using spring data hadoop.
I was only able to fetch all the records using below code:
private static final byte[] CF_INFO = Bytes.toBytes("u");
private static final byte[] baseUrl = Bytes.toBytes("a");
List<Model> allNewsList
= hbaseTemplate.find(tableName, columnFamily, new RowMapper<News>()
{
#Override
public Model mapRow(Result result, int rowNum)
throws Exception
{
String dateString = TextUtils.getTimeStampInLong(result.toString());
String rowKey = Bytes.toString(result.getRow());
return new Model(
rowKey,
Bytes.toString(result.getValue(CF_INFO, col_a)
);
}
});
How can I apply filter such that I would be able to get records within timestamp [1434300763147,1534300763142].

Hopefully this would help someone someday.
final org.apache.hadoop.hbase.client.Scan scan = new Scan();
scan.setTimeRange(1434300763147,1534300763142);
final List<Model> yourObjects = hbaseTemplate.find(tableName, scan, mapper);
Also, worth a mention, the max value of the timerange is exclusive, so if you want records with that timestamp to be returned, make sure to increment the max value of timerange by 1.

The problem was solved using Scanner object from Hbase Client.

Is there a function like _compile_select or get_compiled_select()?

Looks like _compile_select is deprecated and get_compiled_select is not added to 2.1.0. Are there any other functions like those two? And also I am curious. Is there any particular reason to not adding get_compiled_select() to Active Record and removing _compile_select?

I've added get_compiled_select() to DB_active_rec.php and it seems to work without problem, but i wouldn't remove _compile_select() since it's used in many other methods.
The pull request for adding this method is here, with some other useful methods like:
get_compiled_select()
get_compiled_insert()
get_compiled_update()
get_compiled_delete()
https://github.com/EllisLab/CodeIgniter/pull/307
if you want just the method, it's just this:
/**
* Get SELECT query string
*
* Compiles a SELECT query string and returns the sql.
*
* #access public
* #param string the table name to select from (optional)
* #param boolean TRUE: resets AR values; FALSE: leave AR vaules alone
* #return string
*/
public function get_compiled_select($table = '', $reset = TRUE)
{
if ($table != '')
{
$this->_track_aliases($table);
$this->from($table);
}
$select = $this->_compile_select();
if ($reset === TRUE)
{
$this->_reset_select();
}
return $select;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Generate composite hbase rowkey using Flume Serializer - hadoop

Related

How to query an object which contains another object, in reactive PostgreSQL, Quarkus

Custom Hive SerDe unable to SELECT column but works when I do SELECT *

Open / Closed Principle & Single Responsibilty -- Graphs

How to filter records according to `timestamp` in Spring Data Hadoop?

Is there a function like _compile_select or get_compiled_select()?

Categories

Resources