Convert Json keys into Columns in Spark

Convert Json keys into Columns in Spark - hadoop

I have written a code which reads the data and picks the second element from the tuple. The second element happens to be a JSON.
Code to get the JSON:
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.conf.Configuration;
import com.amazon.traffic.emailautomation.cafe.purchasefilter.util.CodecAwareManifestFileSystem;
import com.amazon.traffic.emailautomation.cafe.purchasefilter.util.CodecAwareManifestInputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import amazon.emr.utils.manifest.input.ManifestItemFileSystem;
import amazon.emr.utils.manifest.input.ManifestInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat ;
import scala.Tuple2;
val configuration = new Configuration(sc.hadoopConfiguration);
ManifestItemFileSystem.setImplementation(configuration);
ManifestInputFormat.setInputFormatImpl(configuration, classOf[TextInputFormat]);
val linesRdd1 = sc.newAPIHadoopFile("location", classOf[ManifestInputFormat[LongWritable,Text]], classOf[LongWritable], classOf[Text], configuration).map(tuple2 => tuple2._2.toString());
Below is an example:
{"data": {"marketplaceId":7,"customerId":123,"eventTime":1471206800000,"asin":"4567","type":"OWN","region":"NA"},"uploadedDate":1471338703958}
Now, i want to create a data frame which has the json keys like marketplaceId, customerId etc as columns and the rows having its value. I am not sure how to proceed with this? Can someone help me with pointer which can help me achieve the same?

You can create a scala object for marshalling/unmarshalling JSON using this link
https://coderwall.com/p/o--apg/easy-json-un-marshalling-in-scala-with-jackson
And then use that object for reading JSON data using case class in scala:
import org.apache.spark.{SparkConf, SparkContext}
object stackover {
case class Data(
marketplaceId: Double,
customerId: Double,
eventTime: Double,
asin: String,
`type`: String,
region: String
)
case class R00tJsonObject(
data: Data,
uploadedDate: Double
)
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
conf.setAppName("example");
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
val data = sc.textFile("test1.json")
val parsed = data.map(row => JsonUtil.readValue[R00tJsonObject](row))
parsed.map(rec => (rec.data, rec.uploadedDate, rec.data.customerId,
rec.data.marketplaceId)).collect.foreach(println)
}
}
Output:
(Data(7.0,123.0,1.4712068E12,4567,OWN,NA),1.471338703958E12,123.0,7.0)

Related

Why HSQL not support count function in a spring batch project

I am new to SpringBatch And HSQL, I simple create an application. I
want to select all the rows from the table ,So I write custom query By
using EntityManager but when I Use COUNT(*),and pass Star symbol
inside the count function it show me an error Like this
<expression>, ALL, DISTINCT or identifier expected, got '*'
How to I use * symbol inside the Count function in Hsql
Here is the my query details
JobCompleteNotificationListner.kt
package com.nilmani.privatecricketleague.data
import com.nilmani.privatecricketleague.model.Team
import org.slf4j.Logger
import org.slf4j.LoggerFactory
import org.springframework.batch.core.BatchStatus
import org.springframework.batch.core.JobExecution
import org.springframework.batch.core.listener.JobExecutionListenerSupport
import java.util.*
import javax.persistence.EntityManager
import javax.persistence.Transient
import kotlin.collections.HashMap
class JobCompleteNotificationListner : JobExecutionListenerSupport {
val log:Logger = LoggerFactory.getLogger(JobCompleteNotificationListner::class.java)
val em:EntityManager
constructor(em:EntityManager){this.em= em}
#Transient
override fun afterJob(jobExcution:JobExecution){
if (jobExcution.status == BatchStatus.COMPLETED){
log.info("Job finished Time to verify the result")
val teamData:Map<String,Team> = HashMap()
em.createQuery("SELECT m.team1 , COUNT(*) FROM match m GROUP BY m.team1",Objects::class.java)
}
}
}
I get error at this below line
em.createQuery("SELECT m.team1 , COUNT(*) FROM match m GROUP BY m.team1",Objects::class.java)

Kotlin JPA query inner join using two types

I am new to Kotlin and JPA. I have an inner join query that gets data from two tables(Postgres). The query works fine. However, since I now have two types (the two tables), using either one only returns all the fields from one of the tables. In order to return all fields, I changed the type to List. However, when I do that, my oject that is returned has no fields, only the raw data. How can I change my code so my json response contains both the name of the fields, as well as the data.
Sorry if my question isn't clear, I'm very new to Kotlin.
UPDATED CODE
my repository code
package com.sg.xxx.XXXTTT.report.repository
import com.sg.xxx.XXXTTT.report.model.Report
import com.sg.xxx.XXXTTT.report.model.ReportWithBatches
import org.springframework.data.jpa.repository.JpaRepository
import org.springframework.data.jpa.repository.Query
import org.springframework.stereotype.Repository
import java.time.LocalDate
#Repository
interface IReportRepository : JpaRepository<Report, Long> {
fun findAllByCreationDate(date: LocalDate): List<Report>
fun findByReportName(name: String): Report?
fun findByAdlsFullPath(name: String): Report?
#Query("SELECT new com.sg.xxx.xxxttt.report.model.ReportWithBatches(r.adlsFullPath, r.sentToXXX, r.contentLength, r.creationDate, r.remoteFileNameOnFTA, b.dataPath , b.version, b.source, r.numberOfRecords) FROM Report r INNER JOIN BatchInfo b ON r.reportUuid = b.reportUuid WHERE r.creationDate = ?1")
fun findAllByCreationDateJoinBatches(date: LocalDate): List<ReportWithBatches>
}
my controller code
#GetMapping(value = ["/linkBatches/{today}"])
fun findAllByCreationDateJoinBatches(#PathVariable("today") #DateTimeFormat(pattern = "yyyyMMdd") date: LocalDate): List<ReportWithBatches> {
return eligibleService.findAllByCreationDateJoinBatches(date)
}
my DTO
package com.sg.xxx.xxxttt.report.model
import java.time.LocalDate
open class ReportWithBatches(
var adlsFullPath: String?,
var sentToXXX: Boolean?,
var contentLength: Long?,
var creationDate: LocalDate,
var remoteFileNameOnFTA: String?,
var dataPath: String?,
var version: Int?,
var source: String?,
var numberOfRecords: Long?
)
my function in the service
fun findAllByCreationDateJoinBatches(date: LocalDate): List<ReportWithBatches> {
return reportRepository.findAllByCreationDateJoinBatches(date)
}
}

As was correctly stated in the comments, the return type of your query is List<Array<Any?>>, not List<Any>.
Create a data class that would serve as your DTO and map results to it:
data class ReportWithBatchInfo(val azureFileName : String, /* more field here */)
fun findAllByCreationDateJoinBatches(date: LocalDate): List<ReportWithBatchInfo> {
return reportRepository.findAllByCreationDateJoinBatches(date).map {
ReportWithBatchInfo(it[0] as String, /* more mappings here */)
}
}

Trying to read a csv file in Jmeter using Beanshell Sampler

I am trying to read the cells in an xls file.This is the code I have.Please let me know where I am going wrong.I don't see any error in the logviewer but it isn't printing anything.
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Files;
import java.nio.file.Paths;
public class ApacheCommonsCSV {
public void readCSV() throws IOException {
String CSV_File_Path = "C:\\source\\Test.csv";
// read the file
Reader reader = Files.newBufferedReader(Paths.get(CSV_File_Path));
// parse the file into csv values
CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT);
for (CSVRecord csvRecord : csvParser) {
// Accessing Values by Column Index
String name = csvRecord.get(0);
String product = csvRecord.get(1);
// print the value to console
log.info("Record No - " + csvRecord.getRecordNumber());
log.info("---------------");
log.info("Name : " + name);
log.info("Product : " + product);
log.info("---------------");
}
}
}

You're declaring readCSV() function but not calling it anywhere, that's why your code doesn't even get executed.
You need to add this readCSV() function call and if you're lucky enough it will start working and if not - your will see an error in jmeter.log file.
For example like this:
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
public class ApacheCommonsCSV {
public void readCSV() throws IOException {
String CSV_File_Path = "C:\\source\\Test.csv";
// read the file
Reader reader = Files.newBufferedReader(Paths.get(CSV_File_Path));
// parse the file into csv values
CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT);
for (CSVRecord csvRecord : csvParser) {
// Accessing Values by Column Index
String name = csvRecord.get(0);
String product = csvRecord.get(1);
// print the value to console
log.info("Record No - " + csvRecord.getRecordNumber());
log.info("---------------");
log.info("Name : " + name);
log.info("Product : " + product);
log.info("---------------");
}
}
readCSV(); // here is the entry point
}
Just make sure to have commons-csv.jar in JMeter Classpath
Last 2 cents:
Any reason for not using CSV Data Set Config?
Since JMeter 3.1 you should be using JSR223 Test Elements and Groovy language for scripting so consider migrating to Groovy right away. Check out Apache Groovy - Why and How You Should Use It for reasons, tips and tricks.

Read partitionned table or a view in bigQuery with apache spark

I'm using dataproc-bigQuery connector to read a partitioned table ,it contains over 300GB of data and its partitioned by date ,but all I need is the data from today to read with the spark connector, I tried reading it with a view from bigquery already partitioned, but that doesn't work , is there a way to read a partition from a bigquery table with apache spark?
Update (now with code snippet):
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration
import com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.feature.{HashingTF, IDF}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
#transient
val conf = sc.hadoopConfiguration
//path to the view
val fullyQualifiedInputTableId = "XXXX"
val projectId conf.get("fs.gs.project.id")
val bucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
BigQueryConfiguration.configureBigQueryInput(conf,
fullyQualifiedInputTableId)
val outputTableId = projectId + ":sparkBigQuery.classifiedQueries"
val outputGcsPath = ("gs://" +bucket+"/hadoop/tmp/bigquery/wordcountoutput")
BigQueryOutputConfiguration.configure(conf,outputTableId,null,outputGcsPath,BigQueryFileFormat.NEWLINE_DELIMITED_JSON,classOf[TextOutputFormat[_,_]])
conf.set("mapreduce.job.outputformat.class",classOf[IndirectBigQueryOutputFormat[_,_]].getName)
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,"WRITE_TRUNCATE")
def convertToTuple(record: JsonObject) : (String, String,Double) = {
val user = record.get("user").getAsString
val query = record.get("query").getAsString.toLowerCase
val classifiedQuery= nb.predict(tf.transform(query.split(" ")))
return (user, query,classifiedQuery)
}
// Load data from BigQuery.
val tableData = sc.newAPIHadoopRDD(
conf,
classOf[GsonBigQueryInputFormat],
classOf[LongWritable],
classOf[JsonObject])
tableData.map(entry=>convertToReadbale(entry._2)).first()
val classifiedRDD = tableData.map(entry => convertToTuple(entry._2))
classifiedRDD.take(10).foreach(l => println(l))

Use the partition decorator ("$") documented here, it looks like the Hadoop connector does support the "$" in the table name string.

Play (Scala) form validation error

I want to make a form validation in Play (Scala), I have done this several times but this time it shows error.. The error says:
Overloaded method value [apply] cannot be applied to
(play.api.data.Mapping[models.PIdetail])
Model:
package models
import java.util.Date
import play.api.libs.json._
import play.api.libs.functional.syntax._
import anorm._
import anorm.SqlParser._
import play.api.db.DB
import play.api.Play.current
import models._
case class Purchase_Invoice(supplier_id: String, paid_to_num: String, staff_id: String, paid_to_name: String, staff_name: String, paid_to_addr: String, PI_date: Date, PI_due_date: Date, payment: String, purchase_invoice_items: List[PIdetail], other: String, additional_note: String, terms_and_cond: String)
case class PIdetail(RI_id: Int, PO_id: String, product_id: String, description: String, qty: Int, total: String)
case class RIheader_PI(id_counter: Long, date_RI: Date, staff_id: String, status: Int)
Controller:
package controllers
import play.api._
import play.api.Logger
import play.api.mvc._
import play.api.data._
import play.api.data.Forms._
import play.api.data.format.Formats._
import play.api.mvc.Flash
import play.api.libs.json.Json
import play.api.libs.json._
import models._
object PurchaseInvoices extends Controller {
val submitPIForm = Form(
mapping(
"supplier_id" -> text,
"paid_to_num" -> text,
"staff_id" -> text,
"paid_to_name" -> text,
"staff_name" -> text,
"paid_to_addr" -> text,
"PI_date" -> date,
"PI_due_date" -> date,
"payment" -> text,
"purchase_invoice_items" -> list(
mapping(
"RI_id" -> number,
"PO_id" -> text,
"product_id" -> text,
"description" -> text,
"qty" -> number,
"total" -> text
)(PIdetail.apply)(PIdetail.unapply)
),
"other" -> text,
"additional_note" -> text,
"terms_and_cond" -> text
)(Purchase_Invoice.apply)(Purchase_Invoice.unapply)
)
...................... Some codes
...................... Some codes
}
really need your help guys.. thanks before.. ^^

Found out the error by myself.. ^^
It's because I have def list = TODO in my Controller..
So make sure you don't define a function/variable that has the same name with scala function..
Sorry to bother you guys... thx.. ^^

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Convert Json keys into Columns in Spark - hadoop

Related

Why HSQL not support count function in a spring batch project

Kotlin JPA query inner join using two types

Trying to read a csv file in Jmeter using Beanshell Sampler

Read partitionned table or a view in bigQuery with apache spark

Play (Scala) form validation error

Categories

Resources