Joining RDDs from two different databases - oracle

I am trying to develop a spark application that would get data from two different Oracle databases and work on them.
May be things like joining RDDs that I pulled from two databases to create a new RDD.
Can I create different database connections inside one spark application ?

you can try something like this which is DataFrame approach though I haven't tested the below.
Database 1 :
val employees = sqlContext.load("jdbc",
Map("url" -> "jdbc:oracle:thin:hr/hr#//localhost:1521/database1",
"dbtable" -> "hr.employees"))
employees.printschema
Dabase 2 :
val departments = sqlContext.load("jdbc",
Map("url" -> "jdbc:oracle:thin:hr/hr#//localhost:1521/database2",
"dbtable" -> "hr.departments"))
departments.printschema()
Now join (broadcast is hint that its small data set and can perform broad cast hash join):
val empDepartments = employees.join(broadcast(departments),
employees("DEPARTMENT_ID")===
departments("DEPARTMENT_ID"), "inner")
empDepartments.printSchema()
empDepartments.explain(true)
empDepartments.show()

RDD ( or now dataframe) is an abstraction layer where all data appear to be of similar format irrespective of the underneath datasource.
So once you load your data into a dataframe, you should be able to use it just as is.
sqlContext.read.format("com.databricks.spark.avro").load("somepath").registerTempTable("avro_data")
sqlContext.read.format("parquet").load("someotjerpath").registerTempTable("parquet_data")
sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcConnectionString).option("query", query).load.registerTempTable("redshift_data")`
and then be able to do:
sqlContext.sql("select * from avro_data a left join parquet_data p on a.key = b.key left join redshift_data r on r.key=a.key")

Related

How to do the mappings for joins with OR conditions using Hibernate

I am trying to do the mappings and trying to write a non-native HQL query by joining the 4 tables. The logic is written in stored procedure, but we want to migrate it to hibernate/JPA. I am unable to do proper mappings and write a query to create same logic
FROM
[dbo].[vwInstitutionUser]AS IU
INNER JOIN [dbo].[vwCommonCourse] AS C
ON C.CustomerID=CAST(IU.InstitutionID AS VARCHAR(10))
INNER JOIN [dbo].[vwCommonCourses] AS CM
ON C.CourseID = CM.CourseID
INNER JOIN dbo.vwProductMaster AS PM
ON CM.LearningActivityID = PM.ResourceID
OR CM.AssessmentID = PM.AssessmentID
WHERE
IU.UserID = #UserID
AND PM.IsCert = CASE WHEN #SubReportID='MACS' THEN PM.IsCert ELSE
#IsCert END
ORDER BY PM.IsCert, PM.ResourceName
Please let me know if anybody has same use case, Thanks!
I m not sure if I undrestand you correctly, but i would recommend you the following actions:
Each inner join can be mapped as a relation within JPA (#OneToMany,
#OneToOne, #ManyToOne,#ManyToMany)
Ordering can be implemented with Comparable interface on entity level
For the computed values i would suggest to use sql computed value or service layer within your business logic which will provide you those values (#Transient field)
You can also write NativeQuery for you model
Hope that i answered at least some of your questions :)

How to Merge Maps in Pig

I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.

Spark DataFrame not executing group-by statements within a JDBC data source

I've registered a MySQL data source as follows:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "videos"))
jdbcDF.registerTempTable("videos")
and then executed the following Spark SQL query:
select
uploader, count(*) as items
from
videos_table
where
publisher_id = 154
group by
uploader
order by
items desc
This call actually executes the following query on the MySQL server:
SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154
and then loads the data to the Spark cluster and performs the group-by as a Spark operation.
This behavior is problematic due to the excess network traffic created by not performing the group-by on the MySQL server. Is there a way to force the DataFrame to run the literal query on the MySQL server?
Well, it depends. Spark can push-down over JDBC only the predicates so it is not possible to dynamically execute arbitrary query on a database side. Still, it is possible to use any valid query as a table argument so you can do something like this:
val tableQuery =
"""(SELECT uploader, count(*) as items FROM videos GROUP BY uploader) tmp"""
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> tableQuery
))
If that's not enough you can try to create a custom data source.

Will compiled queries be effective when parameters change for same query everytime?

I am new to entity framework. I am using a linq query that will fetch many records (upto millions) from database. There are many filter parameters in where condition and on each request the parameters may change. So i wanted to know whether compiled queries will be effective in this case or will it be a new query on each request. Here is my query:
List<FarmerDetailsReport> fdr =
(from fp in mstfp join pd in personalDetails on fp.personDetails.Id equals pd.Id
join ic in identityCertificate on fp.identityCertificate.Id equals ic.Id
join pid in pacsInsuranceData on fp.pacsInsuranceData.Id equals pid.Id into temp
from pid in temp.DefaultIfEmpty()
join bd in bankDetails on fp.bankDetails.Id equals bd.Id
join cd in contactDetails on fp.contactDetails.Id equals cd.Id
join id in incomeDetails on fp.incomeDetails.Id equals id.Id into tmp
from id in tmp.DefaultIfEmpty()
join ua in userAttributes on fp.UserId equals ua.EmailID
where ((ua.CompanyName == companyName ) && (cd.District == model.DistrictForProfileMIS ) && (cd.Block == model.BlockForProfileMIS) && (bd.bankName == model.BankForProfileMIS ) && Status == "Active")
select new FarmerDetailsReport { .......... }).ToList();
Short answer:
Yes...... well, maybe.
Long answer:
This is hard to answer as you have no control over the actual SQL that gets generated.
We had perf problems with some queries like this as the optimizer would optimize for a certain wet of filter cases (like short circuits of clauses) then when a new query was made with a massive change in parameters it would take AGES.
What we did in the end:
Don't use a big LINQ query, create a stored proc or view where you have more control over the SQL generated.
Used things like OPTION(RECOMPILE) ... looks this up it was very useful.
Have a few overloads of the query for different parameters so that the DB can optimize them separately.
Obviously this is just what we did, it might not be perfect for you. I STRONGLY suggest getting the generated SQL for each different parametrized version and going over it with your DBA (if you have one) or your team and google if you don't.

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Resources