Integrating Spark SQL and Apache Drill through JDBC

Integrating Spark SQL and Apache Drill through JDBC - hadoop

I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC:
Map<String, String> connectionOptions = new HashMap<String, String>();
connectionOptions.put("url", args[0]);
connectionOptions.put("dbtable", args[1]);
connectionOptions.put("driver", "org.apache.drill.jdbc.Driver");
DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();
Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the actual data:
SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0
SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)
The first one is successful, but in the second one Spark encloses fields within double quotes, which is something that Drill doesn't support, so the query fails.
Did someone managed to get this integration working?
Thank you!

you can add JDBC Dialect for this and register the dialect before using jdbc connector
case object DrillDialect extends JdbcDialect {
def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:")
override def quoteIdentifier(colName: java.lang.String): java.lang.String = {
return colName
}
def instance = this
}
JdbcDialects.registerDialect(DrillDialect)

This is how the accepted answer code looks in Java:
import org.apache.spark.sql.jdbc.JdbcDialect;
public class DrillDialect extends JdbcDialect {
#Override
public String quoteIdentifier(String colName){
return colName;
}
public boolean canHandle(String url){
return url.startsWith("jdbc:drill:");
}
}
Before creating the Spark Session register the Dialect:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.jdbc.JdbcDialects;
public static void main(String[] args) {
JdbcDialects.registerDialect(new DrillDialect());
SparkSession spark = SparkSession
.builder()
.appName("Drill Dialect")
.getOrCreate();
//More Spark code here..
spark.stop();
}
Tried and tested with Spark 2.3.2 and Drill 1.16.0. Hope it helps you too!

Related

The result of a jdbc query is not readable for my jasper (report dependency), but with JPA is okay

SO: Debian 11
SDK: open-17 64 bit
Spring-boot-starter-data-jdbc:2.6.4
Spring-jdbc:5.3.16
HikariCP:4.0.3
Postgres 14
Hi. I try to make some pdf reports with Jasper Dependency inside of my Spring Boot project. Recently I was working with JPA, but I need to move to JDBC. I made a small report with JPA-Jasper, but try to do the same thing with JDBC-Jasper, and I have a problem with that, the type of data structure that jdbc.query is returning.
When I sout the result of the query db with JPA and JDBC respectively, I see a result like this structure:
JPA ---> [{}, {}, ... {}]
JDBC ---> [[], [], ... []]
Log example
JPA
[Transaccion{id=1, transaccion_tipo_id=2, transaccion_estado_id=1, usuario_id=18, monto=0.0, referencia=19-0100-9, codigo_verificacion=0}, ...]
JDBC
[Transaccion[id=1, transaccion_tipo_id=2, transaccion_estado_id=1, usuario_id=18, monto=0.0, referencia=19-0100-9, codigo_verificacion=0], ...]
Here I have my data access service, using jdbc.query
#Override
public List<Transaccion> selectTransacciones() {
var sql = """
SELECT id,
transaccion_tipo_id,
transaccion_estado_id,
usuario_id,
monto,
referencia,
codigo_verificacion
FROM procesadora.transaccion
LIMIT 200
""";
return jdbcTemplate.query(sql, new TransaccionRowMapper());
};
This is what my rawMapper is returning
import org.springframework.jdbc.core.RowMapper;
import java.sql.ResultSet;
import java.sql.SQLException;
public class TransaccionRowMapper implements RowMapper<Transaccion> {
#Override
public Transaccion mapRow(ResultSet resultSet, int rowNum) throws SQLException {
return new Transaccion(
resultSet.getLong("id"),
resultSet.getInt("transaccion_tipo_id"),
resultSet.getInt("transaccion_estado_id"),
resultSet.getInt("usuario_id"),
resultSet.getDouble("monto"),
resultSet.getString("referencia"),
resultSet.getString("codigo_verificacion")
);
}
}
Here is the transaccion Object, I use record to be make more cleaner.
public record Transaccion (Long id,
Integer transaccion_tipo_id,
Integer transaccion_estado_id,
Integer usuario_id,
Double monto,
String referencia,
String codigo_verificacion) {
}
Here I create the pdf Report
#Service
public class TransaccionReporte {
private final TransaccionService transaccionService;
public TransaccionReporte(TransaccionService transaccionService) {
this.transaccionService = transaccionService;
}
public String exportReport() throws FileNotFoundException, JRException {
System.out.println("*** reporte begins ***");
// bring data from db
List<Transaccion> transaccions = transaccionService.getTransacciones();
System.out.println(transaccions); // <--- here I expect [{}, {}] not [[], []]
System.out.println(transaccions.getClass());
System.out.println("*** List transacciones done ***");
// load file - blueprint for the pdf
File file = ResourceUtils.getFile("classpath:transaccion.jrxml");
System.out.println("*** Load file done ***");
// compile it
JasperReport jasperReport = JasperCompileManager.compileReport(file.getAbsolutePath());
JRBeanCollectionDataSource dataSource = new JRBeanCollectionDataSource(transaccionService.getTransacciones());
...
I think that is not a difficult problem, but a need more information about rowMapper, jdbc and all stuff releted to that, I anybody can give me a clue I'll very appreciated.
This is the repo I working on: https://github.com/biagiola/spring-jdbc-jasper-report

How to use directly query using mybatis?

i want use this query 'ANALYZE TABLE {tableName}' but i think mybatis supports only CRUD.
how to use 'ANALYZE TABLE' in mybatis?

Just declare it as a normal select and specify Map as the return type.
#Select("analyze table ${tableName}")
Map<String, Object> analyzeTable(String tableName);
#Test
public void testAnalyzeTable() {
try (SqlSession sqlSession = sqlSessionFactory.openSession()) {
Mapper mapper = sqlSession.getMapper(Mapper.class);
Map<String, Object> result = mapper.analyzeTable("users");
assertEquals("test.users", result.get("Table"));
assertEquals("analyze", result.get("Op"));
assertEquals("status", result.get("Msg_type"));
assertEquals("OK", result.get("Msg_text"));
}
}
Tested using...
MariaDB 10.4.10
MariaDB Connector/J 2.5.4

Elasticsearch Connector as Source in Flink

I used Elasticsearch Connector as a Sink to insert data into Elasticsearch (see : https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/elasticsearch.html).
But, I did not found any connector to get data from Elasticsearch as source.
Is there any connector or example to use Elasticsearch documents as source in a Flink pipline?
Regards,
Ali

I don't know of an explicit ES source for Flink. I did see one user talking about using elasticsearch-hadoop as a HadoopInputFormat with Flink, but I don't know if that worked for them (see their code).

I finaly defined a simple read from ElasticSearch function
public static class ElasticsearchFunction
extends ProcessFunction<MetricMeasurement, MetricPrediction> {
public ElasticsearchFunction() throws UnknownHostException {
client = new PreBuiltTransportClient(settings)
.addTransportAddress(new TransportAddress(InetAddress.getByName("YOUR_IP"), PORT_NUMBER));
}
#Override
public void processElement(MetricMeasurement in, Context context, Collector<MetricPrediction> out) throws Exception {
MetricPrediction metricPrediction = new MetricPrediction();
metricPrediction.setMetricId(in.getMetricId());
metricPrediction.setGroupId(in.getGroupId());
metricPrediction.setBucket(in.getBucket());
// Get the metric measurement from Elasticsearch
SearchResponse response = client.prepareSearch("YOUR_INDEX_NAME")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.termQuery("YOUR_TERM", in.getMetricId())) // Query
.setPostFilter(QueryBuilders.rangeQuery("value").from(0L).to(50L)) // Filter
.setFrom(0).setSize(1).setExplain(true)
.get();
SearchHit[] results = response.getHits().getHits();
for(SearchHit hit : results){
String sourceAsString = hit.getSourceAsString();
if (sourceAsString != null) {
ObjectMapper mapper = new ObjectMapper();
MetricMeasurement obj = mapper.readValue(sourceAsString, MetricMeasurement.class);
obj.getMetricId();
metricPrediction.setPredictionValue(obj.getValue());
}
}
out.collect(metricPrediction);
}
}

Hadoop Compatibility + Elasticsearch Hadoop
https://github.com/cclient/flink-connector-elasticsearch-source

Does hive jdbc support java.sql.PreparedStatement?

I am trying to query hive using java.sql.PreparedStatement and getting an empty result set, Same query giving proper resultset when executed using java.sql.Statement. I am using hive jdbc 1.2.2 jar and hives server is in Hortonworks hdp stack.

Yes, it does:
public class HivePreparedStatement extends HiveStatement implements java.sql.PreparedStatement
As can be seen, internally Hive does implement the JDBC interface PreparedStatement and thus, the driver supports this JDBC feature.
For reference see: https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hive/jdbc/HivePreparedStatement.html
Hope it helps.

Only formally.
https://github.com/apache/hive/blob/ab4c53de82d4aaa33706510441167f2df55df15e/jdbc/src/java/org/apache/hive/jdbc/HivePreparedStatement.java#L116
private String updateSql(String sql, HashMap<Integer, String> parameters) throws SQLException {
List<String> parts = this.splitSqlStatement(sql);
StringBuilder newSql = new StringBuilder((String)parts.get(0));
for(int i = 1; i < parts.size(); ++i) {
if (!parameters.containsKey(i)) {
throw new SQLException("Parameter #" + i + " is unset");
}
newSql.append((String)parameters.get(i));
newSql.append((String)parts.get(i));
}
return newSql.toString();
}

Understanding HBase Java Client

I started Hbase few days back and going through all the material of online.
I have installed and configured HBase and shell commands are working fine.
I got an example of Java client to get data from HBase Table and it executed successfully but I could not understand how it is working? In the code nowhere we have mentioned the port, host of Hbase server? How it able to fetch the data from table?
This is my code:
public class RetriveData {
public static void main(String[] args) throws IOException {
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
#SuppressWarnings({ "deprecation", "resource" })
HTable table = new HTable(config, "emp");
// Instantiating Get class
Get g = new Get(Bytes.toBytes("1"));
// Reading the data
Result result = table.get(g);
// Reading values from Result class object
byte [] value = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("name"));
byte [] value1 = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("city"));
// Printing the values
String name = Bytes.toString(value);
String city = Bytes.toString(value1);
System.out.println("name: " + name + " city: " + city);
}
}
The output looks like:
Output:
name: raju city: hyderabad

I agree with Binary Nerds answer
adding some more interesting information for better understanding.
Your Question :
I could not understand how it is working? In the code nowhere we have
mentioned the port, host of Hbase server? How it able to fetch the
data from table?
Since you are executing this program in cluster
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create()
all the cluster properties will be taken care from inside the cluster.. since you are in cluster and you are executing hbase java client program..
Now try like below (execute same program in different way from remote machine eclipse on windows to find out difference of what you have done earlier and now).
public static Configuration configuration; // this is class variable
static { //fill clusternode1,clusternode2,clusternode3 from your cluster
configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum",
"clusternode1,clusternode2,clusternode3");
configuration.set("hbase.master", "clusternode1:600000");
}
Hope this heps you to understand.

If you look at the source code for HBaseConfiguration on github you can see what it does when it calls create().
public static Configuration create() {
Configuration conf = new Configuration();
// In case HBaseConfiguration is loaded from a different classloader than
// Configuration, conf needs to be set with appropriate class loader to resolve
// HBase resources.
conf.setClassLoader(HBaseConfiguration.class.getClassLoader());
return addHbaseResources(conf);
}
Followed by:
public static Configuration addHbaseResources(Configuration conf) {
conf.addResource("hbase-default.xml");
conf.addResource("hbase-site.xml");
checkDefaultsVersion(conf);
HeapMemorySizeUtil.checkForClusterFreeMemoryLimit(conf);
return conf;
}
So its loading the configuration from your HBase configuration files hbase-default.xml and hbase-site.xml.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Integrating Spark SQL and Apache Drill through JDBC - hadoop

Related

The result of a jdbc query is not readable for my jasper (report dependency), but with JPA is okay

How to use directly query using mybatis?

Elasticsearch Connector as Source in Flink

Does hive jdbc support java.sql.PreparedStatement?

Understanding HBase Java Client

Categories

Resources