Read partitionned table or a view in bigQuery with apache spark - view

I'm using dataproc-bigQuery connector to read a partitioned table ,it contains over 300GB of data and its partitioned by date ,but all I need is the data from today to read with the spark connector, I tried reading it with a view from bigquery already partitioned, but that doesn't work , is there a way to read a partition from a bigquery table with apache spark?
Update (now with code snippet):
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration
import com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.feature.{HashingTF, IDF}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
#transient
val conf = sc.hadoopConfiguration
//path to the view
val fullyQualifiedInputTableId = "XXXX"
val projectId conf.get("fs.gs.project.id")
val bucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
BigQueryConfiguration.configureBigQueryInput(conf,
fullyQualifiedInputTableId)
val outputTableId = projectId + ":sparkBigQuery.classifiedQueries"
val outputGcsPath = ("gs://" +bucket+"/hadoop/tmp/bigquery/wordcountoutput")
BigQueryOutputConfiguration.configure(conf,outputTableId,null,outputGcsPath,BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])
conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName)
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,"WRITE_TRUNCATE")
def convertToTuple(record: JsonObject) : (String, String,Double) = {
val user = record.get("user").getAsString
val query = record.get("query").getAsString.toLowerCase
val classifiedQuery= nb.predict(tf.transform(query.split(" ")))
return (user, query,classifiedQuery)
}
// Load data from BigQuery.
val tableData = sc.newAPIHadoopRDD(
conf,
classOf[GsonBigQueryInputFormat],
classOf[LongWritable],
classOf[JsonObject])
tableData.map(entry=>convertToReadbale(entry._2)).first()
val classifiedRDD = tableData.map(entry => convertToTuple(entry._2))
classifiedRDD.take(10).foreach(l => println(l))

Use the partition decorator ("$") documented here, it looks like the Hadoop connector does support the "$" in the table name string.

Related

Why HSQL not support count function in a spring batch project

I am new to SpringBatch And HSQL, I simple create an application. I
want to select all the rows from the table ,So I write custom query By
using EntityManager but when I Use COUNT(*),and pass Star symbol
inside the count function it show me an error Like this
<expression>, ALL, DISTINCT or identifier expected, got '*'
How to I use * symbol inside the Count function in Hsql
Here is the my query details
JobCompleteNotificationListner.kt
package com.nilmani.privatecricketleague.data
import com.nilmani.privatecricketleague.model.Team
import org.slf4j.Logger
import org.slf4j.LoggerFactory
import org.springframework.batch.core.BatchStatus
import org.springframework.batch.core.JobExecution
import org.springframework.batch.core.listener.JobExecutionListenerSupport
import java.util.*
import javax.persistence.EntityManager
import javax.persistence.Transient
import kotlin.collections.HashMap
class JobCompleteNotificationListner : JobExecutionListenerSupport {
val log:Logger = LoggerFactory.getLogger(JobCompleteNotificationListner::class.java)
val em:EntityManager
constructor(em:EntityManager){this.em= em}
#Transient
override fun afterJob(jobExcution:JobExecution){
if (jobExcution.status == BatchStatus.COMPLETED){
log.info("Job finished Time to verify the result")
val teamData:Map<String,Team> = HashMap()
em.createQuery("SELECT m.team1 , COUNT(*) FROM match m GROUP BY m.team1",Objects::class.java)
}
}
}
I get error at this below line
em.createQuery("SELECT m.team1 , COUNT(*) FROM match m GROUP BY m.team1",Objects::class.java)

Trying to read a csv file in Jmeter using Beanshell Sampler

I am trying to read the cells in an xls file.This is the code I have.Please let me know where I am going wrong.I don't see any error in the logviewer but it isn't printing anything.
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Files;
import java.nio.file.Paths;
public class ApacheCommonsCSV {
public void readCSV() throws IOException {
String CSV_File_Path = "C:\\source\\Test.csv";
// read the file
Reader reader = Files.newBufferedReader(Paths.get(CSV_File_Path));
// parse the file into csv values
CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT);
for (CSVRecord csvRecord : csvParser) {
// Accessing Values by Column Index
String name = csvRecord.get(0);
String product = csvRecord.get(1);
// print the value to console
log.info("Record No - " + csvRecord.getRecordNumber());
log.info("---------------");
log.info("Name : " + name);
log.info("Product : " + product);
log.info("---------------");
}
}
}
You're declaring readCSV() function but not calling it anywhere, that's why your code doesn't even get executed.
You need to add this readCSV() function call and if you're lucky enough it will start working and if not - your will see an error in jmeter.log file.
For example like this:
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
public class ApacheCommonsCSV {
public void readCSV() throws IOException {
String CSV_File_Path = "C:\\source\\Test.csv";
// read the file
Reader reader = Files.newBufferedReader(Paths.get(CSV_File_Path));
// parse the file into csv values
CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT);
for (CSVRecord csvRecord : csvParser) {
// Accessing Values by Column Index
String name = csvRecord.get(0);
String product = csvRecord.get(1);
// print the value to console
log.info("Record No - " + csvRecord.getRecordNumber());
log.info("---------------");
log.info("Name : " + name);
log.info("Product : " + product);
log.info("---------------");
}
}
readCSV(); // here is the entry point
}
Just make sure to have commons-csv.jar in JMeter Classpath
Last 2 cents:
Any reason for not using CSV Data Set Config?
Since JMeter 3.1 you should be using JSR223 Test Elements and Groovy language for scripting so consider migrating to Groovy right away. Check out Apache Groovy - Why and How You Should Use It for reasons, tips and tricks.

Convert Json keys into Columns in Spark

I have written a code which reads the data and picks the second element from the tuple. The second element happens to be a JSON.
Code to get the JSON:
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.conf.Configuration;
import com.amazon.traffic.emailautomation.cafe.purchasefilter.util.CodecAwareManifestFileSystem;
import com.amazon.traffic.emailautomation.cafe.purchasefilter.util.CodecAwareManifestInputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import amazon.emr.utils.manifest.input.ManifestItemFileSystem;
import amazon.emr.utils.manifest.input.ManifestInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat ;
import scala.Tuple2;
val configuration = new Configuration(sc.hadoopConfiguration);
ManifestItemFileSystem.setImplementation(configuration);
ManifestInputFormat.setInputFormatImpl(configuration, classOf[TextInputFormat]);
val linesRdd1 = sc.newAPIHadoopFile("location", classOf[ManifestInputFormat[LongWritable,Text]], classOf[LongWritable], classOf[Text], configuration).map(tuple2 => tuple2._2.toString());
Below is an example:
{"data": {"marketplaceId":7,"customerId":123,"eventTime":1471206800000,"asin":"4567","type":"OWN","region":"NA"},"uploadedDate":1471338703958}
Now, i want to create a data frame which has the json keys like marketplaceId, customerId etc as columns and the rows having its value. I am not sure how to proceed with this? Can someone help me with pointer which can help me achieve the same?
You can create a scala object for marshalling/unmarshalling JSON using this link
https://coderwall.com/p/o--apg/easy-json-un-marshalling-in-scala-with-jackson
And then use that object for reading JSON data using case class in scala:
import org.apache.spark.{SparkConf, SparkContext}
object stackover {
case class Data(
marketplaceId: Double,
customerId: Double,
eventTime: Double,
asin: String,
`type`: String,
region: String
)
case class R00tJsonObject(
data: Data,
uploadedDate: Double
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
conf.setAppName("example");
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
val data = sc.textFile("test1.json")
val parsed = data.map(row => JsonUtil.readValue[R00tJsonObject](row))
parsed.map(rec => (rec.data, rec.uploadedDate, rec.data.customerId,
rec.data.marketplaceId)).collect.foreach(println)
}
}
Output:
(Data(7.0,123.0,1.4712068E12,4567,OWN,NA),1.471338703958E12,123.0,7.0)

Spring Data Solr - Multiple FilterQueries separated by OR

I'm trying to implement a filter search using spring data solr. I've following filters types and all have a set of filters.
A
aa in (1,2,3)
ab between (2016-08-02 TO 2016-08-10)
B
ba in (2,3,4)
bb between (550 TO 1000)
The Solr query which I want to achieve using Spring data solr is:
q=*:*&fq=(type:A AND aa:(1,2,3) AND ab:[2016-08-02 TO 2016-08-10]) OR (type:B AND ba:(2,3,4) AND bb:[550 TO 1000])
I'm not sure how to group a number of clauses of a type of filter and then have an OR operator.
Thanks in advance.
The trick is to flag the second Criteria via setPartIsOr(true) with an OR-ish nature. This method returns void, so it cannot be chained.
First aCriteria and bCriteria are defined as described. Then bCriteria is flagged as OR-ish. Then both are added to a SimpleFilterQuery. That in turn can be added to the actual Query. That is left that out in the sample.
The DefaultQueryParser in the end is used only to generate a String that can be used in the assertion to check that the query is generated as desired.
import org.junit.jupiter.api.Test;
import org.springframework.data.solr.core.DefaultQueryParser;
import org.springframework.data.solr.core.query.Criteria;
import org.springframework.data.solr.core.query.FilterQuery;
import org.springframework.data.solr.core.query.SimpleFilterQuery;
import static org.junit.jupiter.api.Assertions.assertEquals;
public class CriteriaTest {
#Test
public void generateQuery() {
Criteria aCriteria =
new Criteria("type").is("A")
.connect().and("aa").in(1,2,3)
.connect().and("ab").between("2016-08-02", "2016-08-10");
Criteria bCriteria =
new Criteria("type").is("B")
.connect().and("ba").in(2,3,4)
.connect().and("bb").between("550", "1000");
bCriteria.setPartIsOr(true); // that is the magic
FilterQuery filterQuery = new SimpleFilterQuery();
filterQuery.addCriteria(aCriteria);
filterQuery.addCriteria(bCriteria);
// verify the generated query string
DefaultQueryParser dqp = new DefaultQueryParser(null);
String actualQuery = dqp.getQueryString(filterQuery, null);
String expectedQuery =
"(type:A AND aa:(1 2 3) AND ab:[2016\\-08\\-02 TO 2016\\-08\\-10]) OR "
+ "((type:B AND ba:(2 3 4) AND bb:[550 TO 1000]))";
System.out.println(actualQuery);
assertEquals(expectedQuery, actualQuery);
}
}

How to check data type inside pig UDF

I am new to Pig scripting.
I want to write a filter udf irrespective of data type of columns.
iput_data = load data '/emp.csv' using PigStorage(',') as (empid int, name chararray);
output = FILTER input_data by FilterUDF(empid);//data type is int
input_data1 = load data '/dept.csv' using pigStorage(',') as (deptid chararray, deptname chararray);
output1 = FILTER input_data by FilterUDF(deptid); //data type is chararray
Now,inside PigUdf, how to identify data type of input parameter? (i.e. data type of input.get(0) )
import org.apache.pig.FilterFunc;
import java.io.IOException;
import org.apache.pig.data.Tuple;
public class FilterUDF extends FilterFunc {
public Boolean exec(Tuple input) throws IOException {
//How to check data type inside UDF
}
}
You might want to use the getType() method to find the data type of individual elements of the tuple. See this link
Something like
if (input.getType(0) == INTEGER) {
// Do something here
}
Hope this helps.

Resources