How to infer parquet schema by hive table schema without inserting any records?

How to infer parquet schema by hive table schema without inserting any records? - hadoop

Now given a hive table with its schema, namely:
hive> show create table nba_player;
OK
CREATE TABLE `nba_player`(
`id` bigint,
`player_id` bigint,
`player_name` string,
`admission_time` timestamp,
`nationality` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://endpoint:8020/user/hive/warehouse/nba_player'
TBLPROPERTIES (
'transient_lastDdlTime'='1541140811')
Time taken: 0.022 seconds, Fetched: 16 row(s)
How to infer its parquet schema without inserting any records?
The parquet schema is like:
message_meta
{optional int64 id;
optional int64 player_id;
optional binary player_name;
optional timestamp admission_time;
optional binary nationality;}

The code is shown as below
/**
* Generate MessageType by table properties using HiveSchemaConverter
*
* #param tableProperties {#link Properties}
* #return MessageType
*/
public static MessageType getMessageTypeFromTable(final Properties tableProperties) {
final String columnNameProperty = tableProperties.getProperty(IOConstants.COLUMNS);
final String columnTypeProperty = tableProperties.getProperty(IOConstants.COLUMNS_TYPES);
List<String> columnNames;
List<TypeInfo> columnTypes;
if (columnNameProperty.length() == 0) {
columnNames = new ArrayList<String>();
} else {
columnNames = Arrays.asList(columnNameProperty.split(","));
}
if (columnTypeProperty.length() == 0) {
columnTypes = new ArrayList<TypeInfo>();
} else {
columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
}
MessageType messageType = HiveSchemaConverter.convert(columnNames, columnTypes);
logger.info("messageType is inferred to be: {}", messageType.toString());
return messageType;
}
public class ParquetHelperTest {
#Test
public void testGenerateParquetSchemaFromTableProperties() {
Properties tableProperties = new Properties();
tableProperties.setProperty(IOConstants.COLUMNS, "id,player_id,player_name,admission_time,nationality");
tableProperties.setProperty(IOConstants.COLUMNS_TYPES, "bigint,bigint,string,timestamp,string");
MessageType messageType = ParquetHelper.getMessageTypeFromTable(tableProperties);
String expectedMessageType = "message hive_schema {\n"
+ " optional int64 id;\n"
+ " optional int64 player_id;\n"
+ " optional binary player_name (UTF8);\n"
+ " optional int96 admission_time;\n"
+ " optional binary nationality (UTF8);\n"
+ "}";
String calculatedMessageType = messageType.toString();
calculatedMessageType = calculatedMessageType.replaceAll("\\s", "");
expectedMessageType = expectedMessageType.replaceAll("\\s", "");
Assert.assertTrue(calculatedMessageType.equalsIgnoreCase(expectedMessageType));
}
}

Related

Custom Result handling by calling store procedure in Spring Data JPA

I have requirement to call store procedures which takes input parameters. This store procedure returns custom result set, that result set i need to read and process further before return to UI. How we can achieve this?
EG:
#Query("CALL SP_EMPLOYEE_REPORT(:year)",nativeQuery = true)
List<EmpolypeeCustomReportBean> getEmployeeReport(#param("year") Integer year);

Given the following stored procedure.
CREATE PROCEDURE NAME_OF_THE_PROCEDURE(IN param VARCHAR(255), OUT retval INT)
You can call it from interface query:
#Procedure(value = "NAME_OF_THE_PROCEDURE")
int getFromStoredProcedure(String param);
Also by #Query annotation:
#Query(value = "CALL NAME_OF_THE_PROCEDURE(:input_value);", nativeQuery = true)
Integer findSomeThing(#Param("input_value") Integer name);
Or you can use named stored procedure query too.
#Entity
#NamedStoredProcedureQuery(name = "MyObj.getSomethingFromProc",
procedureName = "NAME_OF_THE_PROCEDURE", parameters = {
#StoredProcedureParameter(mode = ParameterMode.IN, name = "param", type = String.class),
#StoredProcedureParameter(mode = ParameterMode.OUT, name = "retval", type = Integer.class)})
public class MyObj{
// class definition
}
Then call it.
#Procedure(name = "MyObj.getSomethingFromProc")
Integer getSomethingFromStoredProc(#Param("param") String model);
Also you can use resultClasses and resultSetMapping properties in #NamedStoredProcedureQuery for complex return types.
Complex example provided by Eclipselink:
#NamedStoredProcedureQuery(
name="ReadUsingMultipleResultSetMappings",
procedureName="Read_Multiple_Result_Sets",
resultSetMappings={"EmployeeResultSetMapping", "AddressResultSetMapping", "ProjectResultSetMapping", "EmployeeConstructorResultSetMapping"}
)
#SqlResultSetMappings({
#SqlResultSetMapping(
name = "EmployeeResultSetMapping",
entities = {
#EntityResult(entityClass = Employee.class)
}
),
#SqlResultSetMapping(
name = "EmployeeConstructorResultSetMapping",
classes = {
#ConstructorResult(
targetClass = EmployeeDetails.class,
columns = {
#ColumnResult(name="EMP_ID", type=Integer.class),
#ColumnResult(name="F_NAME", type=String.class),
#ColumnResult(name="L_NAME", type=String.class),
#ColumnResult(name="R_COUNT", type=Integer.class)
}
)
}
)
})
public Employee(){
....
}

oracle jdbc driver reports no primary key columns on a table that has a primary key

This was reported by HibernateTools Reverse Engineering, but it seems to be true.
oracle jdbc driver reports no primary key columns on a table that has a primary key.
#Test
public void checkTable() throws SQLException, IOException {
System.out.println("in check table");
assertNotNull(conn);
Statement s = conn.createStatement();
ResultSet rset = s.executeQuery("select user from dual");
rset.next();
String username = rset.getString(1);
rset.close();
try {
s.execute("drop table " + username + ".x");
} catch (Exception e) {
// nothing it might not exist
}
s.execute("create table " + username + ".x (y number)");
s.execute("alter table x add constraint x_pk primary key (y)");
DatabaseMetaData meta = conn.getMetaData();
final String[] tableTypes = new String[] { "TABLE", "VIEW" };
ResultSet rs = meta.getTables(null, username, "X",tableTypes);
rs.next();
String table = rs.getString("table_name");
System.out.println("table is " + table);
rs.close();
rs = s.executeQuery("select * from user_constraints where table_name = 'X'");
rs.next();
String type = rs.getString("constraint_type");
assertEquals("P",type); // primary key
rs.close();
rs = meta.getPrimaryKeys(null, username, "X");
rs.next();
logger.info("getting pk");
System.out.print("wtf");
int colCount = 0;
while (rs.next()) {
final String pkName = rs.getString("pk_name");
logger.info("pkName: {}", pkName);
int keySeq = rs.getShort("key_seq"); // TODO should probably be column seq
String columnName = rs.getString("column_name");
logger.warn("seq: {}, columnName: {}, keySeq, columnName");
colCount++;
}
System.out.println("colCount: " + colCount);
assertEquals(1,colCount);
}

Hextoraw() not working with IN clause while using NamedParameterJdbcTemplate

I am trying to update certain rows in my oracle DB using id which is of RAW(255).
Sample ids 0BF3957A016E4EBCB68809E6C2EA8B80, 1199B9F29F0A46F486C052669854C2F8...
#Autowired
private NamedParameterJdbcTemplate jdbcTempalte;
private static final String UPDATE_SUB_STATUS = "update SUBSCRIPTIONS set status = :status, modified_date = systimestamp where id in (:ids)";
public void saveSubscriptionsStatus(List<String> ids, String status) {
MapSqlParameterSource paramSource = new MapSqlParameterSource();
List<String> idsHexToRaw = new ArrayList<>();
String temp = new String();
for (String id : ids) {
temp = "hextoraw('" + id + "')";
idsHexToRaw.add(temp);
}
paramSource.addValue("ids", idsHexToRaw);
paramSource.addValue("status", status);
jdbcTempalte.update(*UPDATE_SUB_STATUS*, paramSource);
}
This above block of code is executing without any error but the updates are not reflected to the db, while if I skip using hextoraw() and just pass the list of ids it works fine and also updates the data in table. see below code
public void saveSubscriptionsStatus(List<String> ids, String status) {
MapSqlParameterSource paramSource = new MapSqlParameterSource();]
paramSource.addValue("ids", ids);
paramSource.addValue("status", status);
jdbcTempalte.update(UPDATE_SUB_STATUS, paramSource);
}
this code works fine and updates the table, but since i am not using hextoraw() it scans the full table for updation which I don't want since i have created indexes. So using hextoraw() will use index for scanning the table but it is not updating the values which is kind of weird.

Got a solution myself by trying all the different combinations :
#Autowired
private NamedParameterJdbcTemplate jdbcTempalte;
public void saveSubscriptionsStatus(List<String> ids, String status) {
String UPDATE_SUB_STATUS = "update SUBSCRIPTIONS set status = :status, modified_date = systimestamp where id in (";
MapSqlParameterSource paramSource = new MapSqlParameterSource();
String subQuery = "";
for (int i = 0; i < ids.size(); i++) {
String temp = "id" + i;
paramSource.addValue(temp, ids.get(i));
subQuery = subQuery + "hextoraw(:" + temp + "), ";
}
subQuery = subQuery.substring(0, subQuery.length() - 2);
UPDATE_SUB_STATUS = UPDATE_SUB_STATUS + subQuery + ")";
paramSource.addValue("status", status);
jdbcTempalte.update(UPDATE_SUB_STATUS, paramSource);
}
What this do is create a query with all the ids to hextoraw as id0, id1, id2...... and also added this values in the MapSqlParameterSource instance and then this worked fine and it also used the index for updating my table.
After running my new function the query look like : update
SUBSCRIPTIONS set status = :status, modified_date = systimestamp
where id in (hextoraw(:id0), hextoraw(:id1), hextoraw(:id2)...)
MapSqlParameterSource instance looks like : {("id0", "randomUUID"),
("id1", "randomUUID"), ("id2", "randomUUID").....}

Instead of doing string manipulation, Convert the list to List of ByteArray
List<byte[]> productGuidByteList = stringList.stream().map(item -> GuidHelper.asBytes(item)).collect(Collectors.toList());
parameters.addValue("productGuidSearch", productGuidByteList);
public static byte[] asBytes(UUID uuid) {
ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(uuid.getMostSignificantBits());
bb.putLong(uuid.getLeastSignificantBits());
return bb.array();
}

Cassandra read Column

String cqlStatement = "SELECT * FROM local";
for (Row row : session.execute(cqlStatement)) {
System.out.println(row.toString());
}
how to get each Column value from the selected row ?

If you are using cassandra ds-driver the following should work for you.
String cqlStatement = "SELECT * FROM local";
for (Row row : session.execute(cqlStatement))
{
row.getString("columnName"); // for string data type
// row.getBool("columnName"); for boolean data type
// row.getUUID("columnName"); for UUID type
// row.getVarint("columnName"); for int type
// row.getLong("columnName"); for long type
// row.getDate("columnName"); for date type
// row.getBytes("columnName"); for bytes/anonymous type
}

Splitting a tuple into multiple tuples in Pig

I like to generate multiple tuples from a single tuple. What I mean is:
I have file with following data in it.
>> cat data
ID | ColumnName1:Value1 | ColumnName2:Value2
so I load it by the following command
grunt >> A = load '$data' using PigStorage('|');
grunt >> dump A;
(ID,ColumnName1:Value1,ColumnName2:Value2)
Now I want to split this tuple into two tuples.
(ID, ColumnName1, Value1)
(ID, ColumnName2, Value2)
Can I use UDF along with foreach and generate. Some thing like the following?
grunt >> foreach A generate SOMEUDF(A)
EDIT:
input tuple : (id1,column1,column2)
output : two tuples (id1,column1) and (id2,column2) so it is List or should I return a Bag?
public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
{
public List<Tuple> exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
// not sure how whether I can create tuples on my own. Looks like I should use TupleFactory.
// return list of tuples.
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
Is this approach correct?

You could write a UDF or use a PIG script with built-in functions.
For example:
-- data should be chararray, PigStorage('|') return bytearray which will not work for this example
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
-- split by | and create a row so we can dereference it later
splt = foreach inpt generate FLATTEN(STRSPLIT($0, '\\|')) ;
-- first column is id, rest is converted into a bag and flatten it to make rows
id_vals = foreach splt generate $0 as id, FLATTEN(TOBAG(*)) as value;
-- there will be records with (id, id), but id should not have ':'
id_vals = foreach id_vals generate id, INDEXOF(value, ':') as p, STRSPLIT(value, ':', 2) as vals;
final = foreach (filter id_vals by p != -1) generate id, FLATTEN(vals) as (col, val);
dump final;
Test INPUT:
1|c1:11:33|c2:12
234|c1:21|c2:22
33|c1:31|c2:32
345|c1:41|c2:42
OUTPUT
(1,c1,11:33)
(1,c2,12)
(234,c1,21)
(234,c2,22)
(33,c1,31)
(33,c2,32)
(345,c1,41)
(345,c2,42)
I hope it helps.
Cheers.

Here is the UDF version. I prefer to return a BAG:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
/**
* Converts input chararray "ID|ColumnName1:Value1|ColumnName2:Value2|.." into a bag
* {(ID, ColumnName1, Value1), (ID, ColumnName2, Value2), ...}
*
* Default rows separator is '|' and key value separator is ':'.
* In this implementation white spaces around separator characters are not removed.
* ID can be made of any character (including sequence of white spaces).
* #author
*
*/
public class TupleToBagColumnValuePairs extends EvalFunc<DataBag> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
private static final BagFactory bagFactory = BagFactory.getInstance();
//Row separator character. Default is '|'.
private String rowsSeparator;
//Column value separator character. Default i
private String columnValueSeparator;
public TupleToBagColumnValuePairs() {
this.rowsSeparator = "\\|";
this.columnValueSeparator = ":";
}
public TupleToBagColumnValuePairs(String rowsSeparator, String keyValueSeparator) {
this.rowsSeparator = rowsSeparator;
this.columnValueSeparator = keyValueSeparator;
}
/**
* Creates a tuple with 3 fields (id:chararray, column:chararray, value:chararray)
* #param outputBag Output tuples (id, column, value) are added to this bag
* #param id
* #param column
* #param value
* #throws ExecException
*/
protected void addTuple(DataBag outputBag, String id, String column, String value) throws ExecException {
Tuple outputTuple = tupleFactory.newTuple();
outputTuple.append(id);
outputTuple.append(column);
outputTuple.append( value);
outputBag.add(outputTuple);
}
/**
* Takes column{separator}value from splitInputLine, splits id into column value and adds them to the outputBag as (id, column, value)
* #param outputBag Output tuples (id, column, value) should be added to this bag
* #param id
* #param splitInputLine format column{separator}value, which start from index 1
* #throws ExecException
*/
protected void parseColumnValues(DataBag outputBag, String id,
String[] splitInputLine) throws ExecException {
for (int i = 1; i < splitInputLine.length; i++) {
if (splitInputLine[i] != null) {
int columnValueSplitIndex = splitInputLine[i].indexOf(this.columnValueSeparator);
if (columnValueSplitIndex != -1) {
String column = splitInputLine[i].substring(0, columnValueSplitIndex);
String value = null;
if (columnValueSplitIndex + 1 < splitInputLine[i].length()) {
value = splitInputLine[i].substring(columnValueSplitIndex + 1);
}
this.addTuple(outputBag, id, column, value);
} else {
String column = splitInputLine[i];
this.addTuple(outputBag, id, column, null);
}
}
}
}
/**
* input - contains only one field of type chararray, which will be split by '|'
* All inputs that are: null or of length 0 are ignored.
*/
#Override
public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() != 1 || input.isNull(0)) {
return null;
}
String inputLine = (String)input.get(0);
String[] splitInputLine = inputLine.split(this.rowsSeparator, -1);
if (splitInputLine.length > 1 && splitInputLine[0].length() > 0) {
String id = splitInputLine[0];
DataBag outputBag = bagFactory.newDefaultBag();
if (splitInputLine.length == 1) { // there is just an id in the line
this.addTuple(outputBag, id, null, null);
} else {
this.parseColumnValues(outputBag, id, splitInputLine);
}
return outputBag;
}
return null;
}
#Override
public Schema outputSchema(Schema input) {
try {
if (input.size() != 1) {
throw new RuntimeException("Expected input to have only one field");
}
Schema.FieldSchema inputFieldSchema = input.getField(0);
if (inputFieldSchema.type != DataType.CHARARRAY) {
throw new RuntimeException("Expected a CHARARRAY as input");
}
Schema tupleSchema = new Schema();
tupleSchema.add(new Schema.FieldSchema("id", DataType.CHARARRAY));
tupleSchema.add(new Schema.FieldSchema("column", DataType.CHARARRAY));
tupleSchema.add(new Schema.FieldSchema("value", DataType.CHARARRAY));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), tupleSchema, DataType.BAG));
} catch (FrontendException exx) {
throw new RuntimeException(exx);
}
}
}
Here is how it is used in PIG:
register 'path to the jar';
define IdColumnValue myPackage.TupleToBagColumnValuePairs();
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
result = foreach inpt generate FLATTEN(IdColumnValue($0)) as (id1, c2, v2);
dump result;
A good inspiration for writing UDFs with bags see DataFu source code by LinkedIn

You could use TransposeTupleToBag (UDF from DataFu lib) on the output of STRSPLIT to get the bag, and then FLATTEN the bag to create separate row per original column.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to infer parquet schema by hive table schema without inserting any records? - hadoop

Related

Custom Result handling by calling store procedure in Spring Data JPA

oracle jdbc driver reports no primary key columns on a table that has a primary key

Hextoraw() not working with IN clause while using NamedParameterJdbcTemplate

Cassandra read Column

Splitting a tuple into multiple tuples in Pig

Categories

Resources