How to convert Java ResultSet to R DataFrame? - renjin

I want to convert resultset table data to dataframe from Java to R. Get the ResultSet in Java and convert it to R compatible format and use the same as dataframe in R.

Related

java.sql.SQLException: ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion

I have a dynamic query build with and executed with TypedQuery<NewsContentBaseInfo> and one of the fields is CLOB object - news.stores. Here is the error I get and I can't find iformation how to solve this:
java.sql.SQLException: ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion
Here is the query:
SELECT DISTINCT new com.kaufland.newsletter.usecase.newscontent.search.dto.response.NewsContentBaseInfo(news.id, news.uuid, news.dayAndTimeOfPublish, news.title, news.subtitle, news.categoryCountry, news.newsPeriod, to_char(news.stores))
FROM com.kaufland.newsletter.domain.content.AbstractNewsContent news
LEFT OUTER JOIN news.newsLinks newsLinks
WHERE news.country = :country AND news.status = :status
AND news.dayAndTimeOfPublish >= :dayAndTimeOfPublishStart
AND news.dayAndTimeOfPublish <= :dayAndTimeOfPublishEnd
AND (news.stores LIKE '%'||:storeNumber0||'%')
AND news.categoryCountry.id in :includeCategoryIds
AND (LOWER(news.title) LIKE LOWER('%'||:searchText||'%')
OR LOWER(news.subtitle) LIKE LOWER('%'||:searchText||'%')
OR LOWER(news.text1) LIKE LOWER('%'||:searchTextEscaped||'%')
OR LOWER(news.text2) LIKE LOWER('%'||:searchTextEscaped||'%')
OR LOWER(news.text3) LIKE LOWER('%'||:searchTextEscaped||'%')
OR LOWER(newsLinks.displayText) LIKE LOWER('%'||:searchText||'%'))
ORDER BY news.dayAndTimeOfPublish DESC
The to_char function returns a varchar that is limited to 4000 characters. If the CLOB is greater than that you can get this error (depending on the Oracle version).
If you really need an String value you can try with the dbms_lob package (https://docs.oracle.com/database/121/ARPLS/d_lob.htm#ARPLS600), that can handle more characters.

Presto for loop

I am new to presto and I would like to know if there is any way to have for loop. I have a query that aggregates some data date by date, and when i run it it throws an error of: exceeded max memory size of 30GB.
I can use other suggestions if looping is not an option.
the query I am using:
select dt as DATE_KPI,brand,count(distinct concat(cast(post_visid_high as varchar),
cast(post_visid_low as varchar)))as kpi_value
from hive.adobe.tbl
and dt >= date '2017-05-15' and dt <= date '2017-06-13'
group by 1,2
Assuming you are using, Hive you can write the source data to a table bucketed bucketed on brand, and then process groups of buckets with WHERE "$bucket" % 32 = <N>.
Otherwise, you can fragment the query into n queries and then process 1/n of the "brand" in each query. You use WHERE abs(from_big_endian_64(xxhash64(to_utf8(brand)))) % 32 = <N> to bucketize the brands.

hive pyspark date comparison

I am trying to translate a hiveQL query into pyspark. I am filtering on dates and getting different results and I would like to know how to get the behaviour in pySpark to match that of Hive. The hive query is:
SELECT COUNT(zip_cd) FROM table WHERE dt >= '2012-01-01';
In pySpark I am entering into the interpreter:
import pyspark.sql.functions as psf
import datetime as dt
hc = HiveContext(sc)
table_df = hc.table('table')
DateFrom = dt.datetime.strptime('2012-01-01', '%Y-%m-%d')
table_df.filter(psf.trim(table.dt) >= DateFrom).count()
I am getting similar, but not the same, results in the two counts. Does anyone know what is going on here?
Your code first creates datetime object from date 2012-01-01. Then during filtering the object is replaced with it's string representation (2012-01-01 00:00:00) and dates are compared using lexicographically which filters out 2012-01-01:
>>> '2012-01-01' >= '2012-01-01 00:00:00'
False
>>> '2012-01-02' >= '2012-01-01 00:00:00'
True
To achieve the same result as SQL just remove code with strptime and compare dates using strings.

Can I index a column in parquet file to make it join faster using Spark

I have two DataFrames each saved in a parquet file. I need to join this two DFs by the unique incremental "id" column.
Can I created index on the id column so they can join faster? Here is the code
// First DF which contain a few thousands items
val dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet")
// Second DF which contains 10 million items
val dfDocVectors = sqlContext.parquetFile(docVectorsParquet) // DataFrame of (id, vector)
dfExamples.join(dfDocVectors, dfExamples("id") === dfDocVectors("id")).select(dfDocVectors("id"),
dfDocVectors("vector"), dfExamples("cat"))
I need to perform such kind of join many times. To speed up the join, can I create index on
the "id" column in the parquet file like what I can do to a database table?
Spark joins use an object called a partitioner. If a DataFrame has no partitioner, executing a join will involve these steps:
Create a new hash partitioner for the bigger side
Shuffle both dataframes against this partitioner
Now we've got the same key on the same node, so local join operations can finish the execution
You can optimize your join by addressing some of #1 and #2. I'd suggest that you repartition your bigger dataset by the join key (id):
// First DF which contain a few thousands items
val dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet")
// Second DF which contains 10 million items
val dfDocVectors = sqlContext.parquetFile(docVectorsParquet)
.repartition($"id")
// DataFrame of (id, vector)
Now, joining any smaller dataframe with dfDocVectors is going to be much faster -- the expensive shuffle step for the big dataframe has already been done.

Parsing date format to join in hive

I have a date field which is of type String and in the format:
03/11/2001
And I want to join it with another column, which is in a different String format:
1855-05-25 12:00:00.0
How can I join both columns efficiently in hive, ignoring the time part of the second column?
My query looks like below:
LEFT JOIN tabel1 t1 ON table2.Date=t1.Date
Since you have both the date values in different formats you need to use the date functions for both and convert it to a similar format of date type in your join query. It would be something like this :
LEFT JOIN tabel1 t1 ON unix_timestamp(table2.Date, 'yyyy-MM-dd HH:mm:ss.S')table2.Date=unix_timestamp(t1.Date,'MM/dd/yyyy')
You could refer this and this for the hive in built date functions.
convert the dates into same format
to_date(table2.date) = to_date(t1.date)

Resources