Creating Hive table on Parquet file which has JSON data - hadoop

The objective I'm trying to achieve
Obtain the data from source big JSON files ( employee-sample.json)
A simple spark application to read it as text file and store in parquet ( simple-loader.java ). I'm not aware of what is in JSON file so I cannot put any schema , so I want schema on read, and not schema on write. A parquet file with one column named "value" containing the JSON string in created
Create an HIVE external table on parquet files, and when I do "select * from table", I see one-column coming out with JSON data.
What I really need is creating a HIVE table which could read the JSON-data in "value" column and apply schema and emit columns, so that I can create variety of tables on my RAW-data, depending on the need.
I have created hive tables on JSON file, and extracted the columns , but this extract column from parquet and apply JSON schema is tricking me
employee-sample.json
{"name":"Dave", "age" : 30 , "DOB":"1987-01-01"}
{"name":"Steve", "age" : 31 , "DOB":"1986-01-01"}
{"name":"Kumar", "age" : 32 , "DOB":"1985-01-01"}
Simple Spark code to convert JSON to parquet
simple-loader.java
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder()
.appName(JsonToParquet.class.getName())
.master("local[*]").getOrCreate();
Dataset<String> eventsDataSet = sparkSession.read().textFile("D:\\dev\\employee-sample.json");
eventsDataSet.createOrReplaceTempView("rawView");
sparkSession.sqlContext().sql("select string(value) as value from rawView")
.write()
.parquet("D:\\dev\\" + UUID.randomUUID().toString());
sparkSession.close();
}
hive table on parquet files
CREATE EXTERNAL TABLE EVENTS_RAW (
VALUE STRING)
STORED AS PARQUET
LOCATION 'hdfs://XXXXXX:8020/employee/data_raw';
I tried by setting JSON serde, but it works only if data is stored in JSON foram, ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
EXPECTED FORMAT
CREATE EXTERNAL TABLE EVENTS_DATA (
NAME STRING,
AGE STRING,
DOB STRING)
??????????????????????????????

Create hive external table example:
public static final String CREATE_EXTERNAL = "CREATE EXTERNAL TABLE %s" +
" (%s) " +
" PARTITIONED BY(%s) " +
" STORED AS %s" +
" LOCATION '%s'";
/**
* Will create an external table and recover the partitions
*/
public void createExternalTable(SparkSession sparkSession, StructType schema, String tableName, SparkFormat format, List<StructField> partitions, String tablePath){
String createQuery = createTableString(schema, tableName, format, partitions, tablePath);
logger.info("Going to create External table with the following query:\n " + createQuery);
sparkSession.sql(createQuery);
logger.debug("Finish to create External table with the following query:\n " + createQuery);
recoverPartitions(sparkSession, tableName);
}
public String createTableString(StructType schema, String tableName, SparkFormat format, List<StructField> partitions, String tablePath){
Set<String> partitionNames = partitions.stream().map(struct -> struct.name()).collect(Collectors.toSet());
String columns = Arrays.stream(schema.fields())
//Filter the partitions
.filter(field -> !partitionNames.contains(field.name()))
//
.map(HiveTableHelper::fieldToStringBuilder)
.collect(Collectors.joining(", "));
String partitionsString = partitions.stream().map(HiveTableHelper::fieldToStringBuilder).collect(Collectors.joining(", "));
return String.format(CREATE_EXTERNAL, tableName, columns, partitionsString, format.name(), tablePath);
}
/**
*
* #param sparkSession
* #param table
*/
public void recoverPartitions(SparkSession sparkSession, String table){
String query = "ALTER TABLE " + table + " RECOVER PARTITIONS";
logger.debug("Start: " + query);
sparkSession.sql(query);
sparkSession.catalog().refreshTable(table);
logger.debug("Finish: " + query);
}
public static StringBuilder fieldToStringBuilder(StructField field){
StringBuilder sb = new StringBuilder();
sb.append(field.name()).append( " ").append(field.dataType().simpleString());
return sb;
}

Related

Would iteration over Cassandra rows with LIMIT and OFFSET have any unexpected side effects?

My Project has a huge Cassadra Table. My Cassadra running with 5 nodes in a kubernetes. As Backend I use Spring Boot 2.X.
I try to update values in my entire Table. Because of the size of the table I do not use a "SELECT * FROM TABLE" query.
Instead I think about to use "limit" with "offset"
String sql = "SELECT * FROM " + tableName + " LIMIT " + limit + " OFFSET " + offset;
With recursive call
private boolean migrateBookingTable(Database database, Statement statement, String tableName, int limit, int offset) throws SQLException, LiquibaseException {
String sql = "SELECT * FROM " + tableName + " LIMIT " + limit + " OFFSET " + offset;
try (ResultSet resultSet = statement.executeQuery(sql)) {
//if resultSet is empty, we are done
if (!resultSet.isBeforeFirst()) {
return false;
}
while (resultSet.next()) {
//some logic
}
database.execute...
}
return migrateBookingTable(database, statement, tableName, limit, offset+limit);
}
I tested it on a small test environment. Its worked. But because of cassandra peculiarities and the fact of 5 Nodes on production. Im not sure about side effects.
Is this an "ok" way to go?
OFFSET is not part of CQL language, not sure how you tested.
cqlsh:ks1> select * from user LIMIT 1 OFFSET 1;
SyntaxException: line 1:27 mismatched input 'OFFSET' expecting EOF (...* from user LIMIT 1 [OFFSET]...)
Because of the size of the table I do not use a "SELECT * FROM TABLE" query.
Without the awful findAll() allowed by Spring Data Cassandra every request is paged with Cassandra. Why not going with the default behaviour of Paging with the cassandra drivers.
SimpleStatement statement = QueryBuilder.selectFrom(USER_TABLENAME).all().build()
.setPageSize(10) // 10 per pages
.setTimeout(Duration.ofSeconds(1)) // 1s timeout
.setConsistencyLevel(ConsistencyLevel.ONE);
ResultSet page1 = session.execute(statement);
LOGGER.info("+ Page 1 has {} items",
page1.getAvailableWithoutFetching());
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
LOGGER.info("Page1: " + page1Iter.next().getString(USER_EMAIL));
}
// Getting ready for page2
ByteBuffer pagingStateAsBytes = page1.getExecutionInfo().getPagingState();
statement.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(statement);
Spring Data Also allow paging with Slice

How to Dynamically Retrieve JDBC ResultSet Data

The requirement is :-
The application will run dynamic SQLs and show the results in table format in JSP. The SQL passed to the application will change, which means the number, name, datatype of selected columns will change and so the result set will also change. The SQL is stored in a config.properties file, everytime we need to run a different SQL, we will just change the SQL in config.properties file. After the SQL is executed, from the ResultSet's Metadata object I have retrieved the column names and column datatypes by :-
ResultSetMetaData rsmd = rs.getMetaData(); // rs is the ResultSet
HashMap<String , String> hmap = new LinkedHashMap<String , String>();
for(int i=1;i<=rsmd.getColumnCount();i++)
{
hmap.put(rsmd.getColumnName(i), rsmd.getColumnTypeName(i));
}
hmap.entrySet().forEach(entry ->{System.out.println(entry.getKey() + " : " + entry.getValue());});
Output :-
TRADER : VARCHAR2
TRAN_NUM : NUMBER
STARTTIME : DATE
ERROR_DETAILS : CLOB
In JDBC, we have specific methods eg. rs.getString(columnName), rs.getInt(columnIndex), rs.getTimestamp(), rs.getClob() to get data of different data types. But in this scenario everything is dynamic, as columnName and columnDatatype will change everytime.
The ResultSet contains around 2000 rows.
How to write the logic, to check the column's datatype and apply the correct rs.getXXX() method to retrieve the ResultSet's data dynamically ?
Thanks & Regards
Saswata Mandal
I am able to do it by :-
while(rs.next())
{
JsonObject jsonRow = new JsonObject();
for(String colName : ResultSetColumnNames)
{
jsonRow.addProperty(colName, rs.getObject(colName)==null ? "NULL": rs.getObject(colName).toString());
}
jsonArry.add(jsonRow);
}
Thanks and Regards
Saswata Mandal

Oracle XMLType - Loading from XML flat-file as chunks

Using Java 8 and Oracle 11g. Regarding loading XML data from a flat file into an Oracle XMLType field. I can make it work with this code:
private String readAllBytesJava7(String filePath) {
Files files;
Paths paths;
String content;
content = "";
try {
content = new String ( Files.readAllBytes( Paths.get(filePath) ) );
}
catch (IOException e) {
log.error(e);
}
return content;
}
pstmt = oracleConnection.prepareStatement("update MYTABLE set XML_SOURCE = ? where TRANSACTION_NO = ?");
xmlFileAsString = this.readAllBytesJava7(fileTempLocation);
xmlType = XMLType.createXML(oracleConnection, xmlFileAsString);
pstmt.setObject(1,xmlType);
pstmt.setInt(2, ataSpecHeader.id);
pstmt.executeUpdate();
But as you might surmise, that only works for small XML files... Anything too large will cause a memory exception.
What I'd like to do is load the XML file in "chunks" as described here:
https://docs.oracle.com/cd/A97335_02/apps.102/a83724/oralob2.htm
and
https://community.oracle.com/thread/4721
Those posts show how to load a BLOB/CLOB column from a flat-file by "chunks". I can make it work if the column is blob/clob, but I couldn't adapt it for an XMLType column. Most of what I found online in regards to loading an XMLType column deals with using the oracle-directory object or using sql-loader, but I won't be able to use those as my solution. Is there any kind of post/example that someone knows of for how to load an XML file into an XMLType column as "chunks"?
Additional information:
I'm trying to take what I see in the posts for blob/clob and adapt it for XMLType. Here's the issues I'm facing:
sqlXml = oracleConnection.createSQLXML();
pstmt = oracleConnection.prepareStatement("update MYTABLE set
XML_SOURCE = XMLType.createXML('<e/>') where 1=1 and TRANSACTION_NO = ?");
pstmt.setInt(1, ataSpecHeader.id);
pstmt.executeUpdate();
With blob/clob, you start out by setting the blob/clob field to "empty" (so it isn't null)... I'm not sure how to do this with XMLType... the closest I can get it just to set it to some kind of xml as shown above.
The next step is to select the blob/clob field and get the output stream on it. Something like what is shown here:
cmd = "SELECT XML_SOURCE FROM MYTABLE WHERE TRANSACTION_NO = ${ataSpecHeader.id} FOR UPDATE ";
stmt = oracleConnection.createStatement();
rset = stmt.executeQuery(cmd);
rset.next();
xmlType = ((OracleResultSet)rset).getOPAQUE(1);
//clob = ((OracleResultSet)rset).getCLOB(1);
//blob = ((OracleResultSet)rset).getBLOB(1);
clob = xmlType.getClobVal();
//sqlXml = rset.getSQLXML(1);
//outstream = sqlXml.setBinaryStream();
//outstream = blob.getBinaryOutputStream();
outstream = clob.getAsciiOutputStream();
//At this point, read the XML file in "chunks" and write it to the outstream object by doing: outstream.write
The lines that are commented-out are to show the different things I've tried. To re-state... I can make it work fine if the field in the table is a BLOB or CLOB. But I'm not sure what to do if it's an XMLType. I'd like to get an outstream handle to the XMLType field so I can write to it, as I would if it were a BLOB or CLOB. Notice for BLOB/CLOB it selects the blob/clob field with "for update" and then gets an Outstream on it so I can write to it. For XMLType, i tried getting the field to an XMLType java class and SQLXML java class, but it won't work that way. I also tried getting the field first as xmltype/sqlxml and then casting to blob/clob to then get an outstream, but it won't work either. The truth is, I'm not sure what I'm supposed to do in order to be able to write to the XMLType field as a stream/chunks.

Unable to fetch data from Hbase based on query parameters

How to get data from HBase? I have a table with empId, name, startDate, endDate and other columns. Now I want to get data from an HBase table based upon empId, startDate and endDate.In normal SQL I can use:
select * from tableName where empId=val and date>=startDate and date<=endDate
How can I do this in HBase as it stores data as key value pairs? The key is empId.
Getting filtered rows in HBase shell is tricky. Since the shell is JRuby-based you can have here Ruby commands as well:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.filter.FilterList
import java.text.SimpleDateFormat
import java.lang.Long
def dateToBytes(val)
Long.toString(
SimpleDateFormat.new("yyyy/MM/dd").parse(val).getTime()).to_java_bytes
end
# table properties
colfam='c'.to_java_bytes;
col_name='name';
col_start='startDate';
col_end='endDate';
# query params
q_name='name2';
q_start='2012/08/14';
q_end='2012/08/24';
# filters
f_name=SingleColumnValueFilter.new(
colfam, col_name.to_java_bytes,
CompareFilter::CompareOp::EQUAL,
BinaryComparator.new(q_name.to_java_bytes));
f_start=SingleColumnValueFilter.new(
colfam, col_start.to_java_bytes,
CompareFilter::CompareOp::GREATER_OR_EQUAL,
BinaryComparator.new(dateToBytes(q_start)));
f_end=SingleColumnValueFilter.new(
colfam, col_end.to_java_bytes,
CompareFilter::CompareOp::LESS_OR_EQUAL,
BinaryComparator.new(dateToBytes(q_end)));
filterlist= FilterList.new([f_name, f_start, f_end]);
# get the result
scan 'mytable', {"FILTER"=>filterlist}
Similarly in Java construct a FilterList :
// Query params
String nameParam = "name2";
String startDateParam = "2012/08/14";
String endDateParam = "2012/08/24";
Filter nameFilter =
new SingleColumnValueFilter(colFam, nameQual, CompareOp.EQUAL,
Bytes.toBytes(nameParam));
//getBytesFromDate(): parses startDateParam and create a byte array out of it
Filter startDateFilter =
new SingleColumnValueFilter(colFam, startDateQual,
CompareOp.GREATER_OR_EQUAL, getBytesFromDate(startDateParam));
Filter endDateFilter =
new SingleColumnValueFilter(colFam, endDateQual,
CompareOp.LESS_OR_EQUAL, getBytesFromDate(endDateParam));
FilterList filters = new FilterList();
filters.addFilter(nameFilter);
filters.addFilter(startDateFilter);
filters.addFilter(endDateFilter);
HTable htable = new HTable(conf, tableName);
Scan scan = new Scan();
scan.setFilter(filters);
ResultScanner rs = htable.getScanner(scan);
//process your result...

Reading/Writing DataTables to and from an OleDb Database LINQ

My current project is to take information from an OleDbDatabase and .CSV files and place it all into a larger OleDbDatabase.
I have currently read in all the information I need from both .CSV files, and the OleDbDatabase into DataTables.... Where it is getting hairy is writing all of the information back to another OleDbDatabase.
Right now my current method is to do something like this:
OleDbTransaction myTransaction = null;
try
{
OleDbConnection conn = new OleDbConnection("PROVIDER=Microsoft.Jet.OLEDB.4.0;" +
"Data Source=" + Database);
conn.Open();
OleDbCommand command = conn.CreateCommand();
string strSQL;
command.Transaction = myTransaction;
strSQL = "Insert into TABLE " +
"(FirstName, LastName) values ('" +
FirstName + "', '" + LastName + "')";
command.CommandType = CommandType.Text;
command.CommandText = strSQL;
command.ExecuteNonQuery();
conn.close();
catch (Exception)
{
// IF invalid data is entered, rolls back the database
myTransaction.Rollback();
}
Of course, this is very basic and I'm using an SQL command to commit my transactions to a connection. My problem is I could do this, but I have about 200 fields that need inserted over several tables. I'm willing to do the leg work if that's the only way to go. But I feel like there is an easier method. Is there anything in LINQ that could help me out with this?
If the column names in the DataTable match exactly to the column names in the destination table, then you might be able to use a OleDbCommandBuilder (Warning: I haven't tested this yet). One area you may run into problems is if the data types of the source data table do not match those of the destination table (e.g if the source column data types are all strings).
EDIT
I revised my original code in a number of ways. First, I switched to using the Merge method on a DataTable. This allowed me to skip using the LoadDataRow in a loop.
using ( var conn = new OleDbConnection( destinationConnString ) )
{
//query off the destination table. Could also use Select Col1, Col2..
//if you were not going to insert into all columns.
const string selectSql = "Select * From [DestinationTable]";
using ( var adapter = new OleDbDataAdapter( selectSql, conn ) )
{
using ( var builder = new OleDbCommandBuilder( adapter ) )
{
conn.Open();
var destinationTable = new DataTable();
adapter.Fill( destinationTable );
//if the column names do not match exactly, then they
//will be skipped
destinationTable.Merge( sourceDataTable, true, MissingSchemaAction.Ignore );
//ensure that all rows are marked as Added.
destinationTable.AcceptChanges();
foreach ( DataRow row in destinationTable.Rows )
row.SetAdded();
builder.QuotePrefix = "[";
builder.QuoteSuffix= "]";
//forces the builder to rebuild its insert command
builder.GetInsertCommand();
adapter.Update( destinationTable );
}
}
}
ADDITION An alternate solution would be to use a framework like FileHelpers to read the CSV file and post it into your database. It does have an OleDbStorage DataLink for posting into OleDb sources. See the SqlServerStorage InsertRecord example to see how (in the end substitute OleDbStorage for SqlServerStorage).
It sounds like you have many .mdb and .csv that you need to merge into a single .mdb. This answer is running with that assumption, and that you have SQL Server available to you. If you don't, then consider downloading SQL Express.
Use SQL Server to act as the broker between your multiple datasources and your target datastore. Script each datasource as an insert into a SQL Server holding table. When all data is loaded into the holding table, perform a final push into your target Access datastore.
Consider these steps:
In SQL Server, create a holding table for the imported CSV data.
CREATE TABLE CsvImport
(CustomerID smallint,
LastName varchar(40),
BirthDate smalldatetime)
Create a stored proc whose job will be to read a given CSV filepath, and insert into a SQL Server table.
CREATE PROC ReadFromCSV
#CsvFilePath varchar(1000)
AS
BULK
INSERT CsvImport
FROM #CsvFilePath --'c:\some.csv'
WITH
(
FIELDTERMINATOR = ',', --your own specific terminators should go here
ROWTERMINATOR = '\n'
)
GO
Create a script to call this stored proc for each .csv file you have on disk. Perhaps some Excel trickery or filesystem dir piped commands can help you create these statements.
exec ReadFromCSV 'c:\1.csv
For each .mdb datasource, create a temp linked server.
DECLARE #MdbFilePath varchar(1000);
SELECT #MdbFilePath = 'C:\MyMdb1.mdb';
EXEC master.dbo.sp_addlinkedserver #server = N'MY_ACCESS_DB_', #srvproduct=N'Access', #provider=N'Microsoft.Jet.OLEDB.4.0', #datasrc=#MdbFilePath
-- grab the relevant data
--your data's now in the table...
INSERT CsvImport(CustomerID,
SELECT [CustomerID]
,[LastName]
,[BirthDate]
FROM [MY_ACCESS_DB_]...[Customers]
--remove the linked server
EXEC master.dbo.sp_dropserver #server=N'MY_ACCESS_DB_', #droplogins='droplogins'
When you're done importing data into that holding table, create a Linked Server in your SQL Server instance. This is the target datastore. SELECT the data from SQL Server into Access.
EXEC master.dbo.sp_addlinkedserver #server = N'MY_ACCESS_TARGET', #srvproduct=N'Access', #provider=N'Microsoft.Jet.OLEDB.4.0', #datasrc='C:\Target.mdb'
INSERT INTO [MY_ACCESS_TARGET]...[Customer]
([CustomerID]
,[LastName]
,[BirthDate])
SELECT Customer,
LastName,
BirthDate
FROM CsvImport

Resources