Spark DataFrame not executing group-by statements within a JDBC data source - jdbc

I've registered a MySQL data source as follows:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "videos"))
jdbcDF.registerTempTable("videos")
and then executed the following Spark SQL query:
select
uploader, count(*) as items
from
videos_table
where
publisher_id = 154
group by
uploader
order by
items desc
This call actually executes the following query on the MySQL server:
SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154
and then loads the data to the Spark cluster and performs the group-by as a Spark operation.
This behavior is problematic due to the excess network traffic created by not performing the group-by on the MySQL server. Is there a way to force the DataFrame to run the literal query on the MySQL server?

Well, it depends. Spark can push-down over JDBC only the predicates so it is not possible to dynamically execute arbitrary query on a database side. Still, it is possible to use any valid query as a table argument so you can do something like this:
val tableQuery =
"""(SELECT uploader, count(*) as items FROM videos GROUP BY uploader) tmp"""
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> tableQuery
))
If that's not enough you can try to create a custom data source.

Related

How to fetch sql query results in airflow using JDBC operator

I have configured JDBC connection in Airflow connections. My Task part of DAG looks like below which contains a select statement. When triggering the DAG is success, but my the query results are not printed in log. How to fetch the results of the query using JDBC operator.
dag = DAG(dag_id='test_azure_sqldw_v1',
default_args=default_args,schedule_interval=None,dagrun_timeout=timedelta(seconds=120),)
sql="select count(*) from tablename"
azure_sqldw=JdbcOpetask_id='azure_sqldw',sql=sql,jdbc_conn_id="cdf_sqldw",autocommit=True,dag=dag)
The operator does not print to the log. It just run the query.
If you want to fetch results to do something with it you need to use the hook.
from airflow.providers.jdbc.hooks.jdbc import JdbcHook
def func(jdbc_conn_id, sql, **kwargs):
"""Print df from JDBC """
pprint(kwargs)
hook = JdbcHook(jdbc_conn_id=jdbc_conn_id)
df = hook.get_pandas_df(sql=sql,autocommit=True)
print(df.to_string())
run_this = PythonOperator(
task_id='task',
python_callable=func,
op_kwargs={'jdbc_conn_id': 'cdf_sqldw', 'sql': 'select count(*) from tablename' },
dag=dag,
)
You can also create a custom operator that does the required action you seek.

etcd 3 Range Query

How can I use etcds new range query to get a subset of records, based on these values:
a-key/path/foo_1: value-1
a-key/path/foo_2: value-2
a-key/path/foo_3: value-3
a-key/path/foo_4: value-4
a-key/path/foo_5: value-5
I'd like to be able to query that data like this:
Get everything from a-key/path/foo_3 to a-key/path/foo_4 (specific end result), returning:
a-key/path/foo_3: value-3
a-key/path/foo_4: value-4
Or, everything from a-key/path/foo_3 onwards (no end) for example:
a-key/path/foo_3: value-3
a-key/path/foo_4: value-4
a-key/path/foo_5: value-5
I'm using the dotnet-etcd client for .NET (Core) if that helps/ affects any answers.
The a-key/path/foo_ keys have no end, meaning they could go on forever... in Azure Table Storage I can query the same data like this:
// Query for table entities based on an optional `versionTo`.
new TableQuery<DynamicTableEntity>().Where(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition(nameof(ITableEntity.PartitionKey), QueryComparisons.Equal, aggregateId),
TableOperators.And,
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition(nameof(ITableEntity.RowKey), QueryComparisons.GreaterThanOrEqual, CreateRowKey(versionFrom)),
TableOperators.And,
TableQuery.GenerateFilterCondition(nameof(ITableEntity.RowKey), QueryComparisons.LessThanOrEqual, CreateRowKey(versionTo ?? int.MaxValue)))
)
);
// Generate the row key
string CreateEventRowKey(int version) => $"keyPrefix_{$"{version}".PadLeft(10, '0')}";

Joining RDDs from two different databases

I am trying to develop a spark application that would get data from two different Oracle databases and work on them.
May be things like joining RDDs that I pulled from two databases to create a new RDD.
Can I create different database connections inside one spark application ?
you can try something like this which is DataFrame approach though I haven't tested the below.
Database 1 :
val employees = sqlContext.load("jdbc",
Map("url" -> "jdbc:oracle:thin:hr/hr#//localhost:1521/database1",
"dbtable" -> "hr.employees"))
employees.printschema
Dabase 2 :
val departments = sqlContext.load("jdbc",
Map("url" -> "jdbc:oracle:thin:hr/hr#//localhost:1521/database2",
"dbtable" -> "hr.departments"))
departments.printschema()
Now join (broadcast is hint that its small data set and can perform broad cast hash join):
val empDepartments = employees.join(broadcast(departments),
employees("DEPARTMENT_ID")===
departments("DEPARTMENT_ID"), "inner")
empDepartments.printSchema()
empDepartments.explain(true)
empDepartments.show()
RDD ( or now dataframe) is an abstraction layer where all data appear to be of similar format irrespective of the underneath datasource.
So once you load your data into a dataframe, you should be able to use it just as is.
sqlContext.read.format("com.databricks.spark.avro").load("somepath").registerTempTable("avro_data")
sqlContext.read.format("parquet").load("someotjerpath").registerTempTable("parquet_data")
sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcConnectionString).option("query", query).load.registerTempTable("redshift_data")`
and then be able to do:
sqlContext.sql("select * from avro_data a left join parquet_data p on a.key = b.key left join redshift_data r on r.key=a.key")

ActiveRecord Subquery Inner Join

I am trying to convert a "raw" PostGIS SQL query into a Rails ActiveRecord query. My goal is to convert two sequential ActiveRecord queries (each taking ~1ms) into a single ActiveRecord query taking (~1ms). Using the SQL below with ActiveRecord::Base.connection.execute I was able to validate the reduction in time.
Thus, my direct request is to help me to convert this query into an ActiveRecord query (and the best way to execute it).
SELECT COUNT(*)
FROM "users"
INNER JOIN (
SELECT "centroid"
FROM "zip_caches"
WHERE "zip_caches"."postalcode" = '<postalcode>'
) AS "sub" ON ST_Intersects("users"."vendor_coverage", "sub"."centroid")
WHERE "users"."active" = 1;
NOTE that the value <postalcode> is the only variable data in this query. Obviously, there are two models here User and ZipCache. User has no direct relation to ZipCache.
The current two step ActiveRecord query looks like this.
zip = ZipCache.select(:centroid).where(postalcode: '<postalcode>').limit(1).first
User.where{st_intersects(vendor_coverage, zip.centroid)}.count
Disclamer: I've never used PostGIS
First in your final request, it seems like you've missed the WHERE "users"."active" = 1; part.
Here is what I'd do:
First add a active scope on user (for reusability)
scope :active, -> { User.where(active: 1) }
Then for the actual query, You can have the sub query without executing it and use it in a joins on the User model, such as:
subquery = ZipCache.select(:centroid).where(postalcode: '<postalcode>')
User.active
.joins("INNER JOIN (#{subquery.to_sql}) sub ON ST_Intersects(users.vendor_coverage, sub.centroid)")
.count
This allow minimal raw SQL, while keeping only one query.
In any case, check the actual sql request in your console/log by setting the logger level to debug.
The amazing tool scuttle.io is perfect for converting these sorts of queries:
User.select(Arel.star.count).where(User.arel_table[:active].eq(1)).joins(
User.arel_table.join(ZipCach.arel_table).on(
Arel::Nodes::NamedFunction.new(
'ST_Intersects', [
User.arel_table[:vendor_coverage], Sub.arel_table[:centroid]
]
)
).join_sources
)

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Resources