I've created a process that basically reads a csv, transforms it into a dataframe, and then loads data into a database. This process is triggered through nifi. The problem I have, is that in case there is a problem in the original csv file(this file is generally uploaded between 7:45 -8:00 am, and Nifi is triggered at 8:10 am), which is a daily file, which changes every day, and nifi is executed, but does not load the data, I will have lost the data of that day . I need suggestions on how I should modify the code of the process, in order to make sure that it loads, making sure that nifi is going to load, through a cycle that calls a backup file. Next, I share the code of the process:
import com.tchile.bigdata.hdfs.Hdfs
import com.tchile.bigdata.utilities.Utilities
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import java.sql.DriverManager
import java.util.concurrent.TimeUnit
class Process {
def process(shell_variable: String): Unit = {
val startTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis())
println("[INFO] Inicio de proceso Logistica Carga Input Simple Data")
// Productivo (Configuracion de Spark)
val sparkConf = new SparkConf()
.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
.set("spark.sql.parquet.output.committer.class", "org.apache.parquet.hadoop.ParquetOutputCommitter")
.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
val spark = SparkSession
val sc = spark.sparkContext
import spark.implicits._
val hdfs = new Hdfs
val utils = new Utilities
println("[INFO] Obteniendo variables desde shell/NIFI")
// Valores extraídos desde NIFI
val daysAgo = shell_variable.split(":")(0).toInt // Cantidad de días de reproceso: inicio
val daysAhead = shell_variable.split(":")(1).toInt // Cantidad de días de reproceso: límite
val repartition = shell_variable.split(":")(2).toInt
// Variables auxiliares para reproceso en HDFS
var pathToProcess = ""
var pathToDelete = ""
var deleteStatus = false
println("[INFO] Obteniendo variables desde Parametros.conf")
// Valores extraídos de Parametros.conf > exadata
val driver_jdbc = sc.getConf.get("spark.exadata.driver_jdbc")
val url_jdbc = sc.getConf.get("spark.exadata.url_jdbc")
val user_jdbc = sc.getConf.get("spark.exadata.user_jdbc")
val pass_jdbc = sc.getConf.get("spark.exadata.pass_jdbc")
val table_name = sc.getConf.get("spark.exadata.table_name")
val table_owner = sc.getConf.get("spark.exadata.table_owner")
val table = sc.getConf.get("spark.exadata.table")
// Valores extraídos de Parametros.conf > path
//val pathCsv = sc.getConf.get("spark.path.Csv")
// Cálculo de la fecha de los días atrás
val startDate = utils.getCalculatedDate("yyyy-MM-dd", -daysAgo)
// Cálculo de la fecha límite
val endDate = utils.getCalculatedDate("yyyy-MM-dd", -daysAhead)
// Se separan el año, el mes y el día a partir de la fecha_atras
val startDateYear = startDate.substring(0, 4)
val startDateMonth = startDate.substring(5, 7)
val startDateDay = startDate.substring(8, 10)
// Se separan el año, el mes y el día a partir del sub_day
val endDateYear = endDate.substring(0, 4)
val endDateMonth = endDate.substring(5, 7)
val endDateDay = endDate.substring(8, 10)
// Información para log
println("[INFO] Reproceso de: " + daysAgo + " días")
println("[INFO] Fecha inicio: " + startDate)
println("[INFO] Fecha límite: " + endDate)
try {
// ================= INICIO LÓGICA DE PROCESO =================
//Se crea df a partir de archivo diario input de acuerdo a la ruta indicada
val df_csv = spark.read.format("csv").option("header","true").option("sep",";").option("mode","dropmalformed").load("/applications/recup_remozo_equipos/equipos_por_recuperar/output/agendamientos_sin_pet_2")
val df_final = df_csv.select($"RutSinDV".as("RUT_SIN_DV"),
to_date(col("Dia_Agendado"), "yyyy-MM-dd").as("DIA_AGENDADO"),
// ================== FIN LÓGICA DE PROCESO ==================
// Limpieza en EXADATA
println("[INFO] Se inicia la limpieza por reproceso en EXADATA")
val query_particiones = "(SELECT * FROM (WITH DATA AS (select table_name,partition_name,to_date(trim('''' " +
"from regexp_substr(extractvalue(dbms_xmlgen.getxmltype('select high_value from all_tab_partitions " +
"where table_name='''|| table_name|| ''' and table_owner = '''|| table_owner|| ''' and partition_name = '''" +
"|| partition_name|| ''''),'//text()'),'''.*?''')),'syyyy-mm-dd hh24:mi:ss') high_value_in_date_format " +
"FROM all_tab_partitions WHERE table_name = '" + table_name + "' AND table_owner = '" + table_owner + "')" +
"SELECT partition_name FROM DATA WHERE high_value_in_date_format > DATE '" + startDateYear + "-" + startDateMonth + "-" + startDateDay + "' " +
"AND high_value_in_date_format <= DATE '" + endDateYear + "-" + endDateMonth + "-" + endDateDay + "') A)"
val db = DriverManager.getConnection(url_jdbc, user_jdbc, pass_jdbc)
val st = db.createStatement()
try {
val consultaParticiones = spark.read.format("jdbc")
.option("url", url_jdbc)
.option("driver", driver_jdbc)
.option("dbTable", query_particiones)
.option("user", user_jdbc)
.option("password", pass_jdbc)
for (partition <- consultaParticiones) {
st.executeUpdate("call " + table_owner + ".DO_THE_TRUNCATE_PARTITION('" + table + "','" + partition.getString(0) + "')")
} catch {
case e: Exception =>
println("[ERROR TRUNCATE] " + e)
println("[INFO] Se inicia la inserción en EXADATA")
df_final.filter($"DIA_AGENDADO" >= "2022-08-01")
.jdbc(url_jdbc, table, utils.jdbcProperties(driver_jdbc, user_jdbc, pass_jdbc))
println("[INFO] Inserción en EXADATA completada con éxito")
println("[INFO] Proceso Logistica Carga Input SimpleData")
val endTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()) - startTotal
println("[INFO] TIEMPO TOTAL EJECUCIÓN: " + utils.secondsToMinutes(endTotal))
catch {
case e: Exception =>
println("[EXCEPTION] " + e)
val endTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()) - startTotal
println("[INFO] TIEMPO TOTAL EJECUCIÓN (CON ERROR): " + utils.secondsToMinutes(endTotal))
throw e
I appreciate suggestions on how I should modify or add to the code, to ensure the daily data upload.


Failed to get #Query Result

Hello I'm trying to read tables related with ManyToOne , i get the result when i execute the query in Navicat :
but when i try to display data in the front with angular i failed i get only the main tables
this is the query :
//like this
#Query(value = "SELECT\n" +
"\tnotification.idnotif,\n" +
"\tnotification.message,\n" +
"\tnotification.\"state\",\n" +
"\tnotification.title,\n" +
"\tnotification.\"customData\",\n" +
"\tnotification.\"date\",\n" +
"\tnotification.receiver,\n" +
"\tnotification.sender,\n" +
"\tnotification.\"type\",\n" +
"\thospital.\"name\",\n" +
"\thospital.\"siretNumber\",\n" +
"\tusers.firstname,\n" +
"\tusers.\"isActive\" \n" +
"FROM\n" +
"\tnotification\n" +
"\tINNER JOIN hospital ON notification.receiver = :reciver\n" +
"\tINNER JOIN users ON notification.sender = :sender",nativeQuery = true)
List<Notification> findNotificationCustomQuery(#Param("reciver") Long reciver,#Param("sender") Long sender);
please what can i do to resolve this problem !
You are doing inner join in the native query. Follow as below. Change the return type to Object[] from Notification.
#Query(value = "SELECT\n" +
"\tnotification.idnotif,\n" +
"\tnotification.message,\n" +
"\tnotification.\"state\",\n" +
"\tnotification.title,\n" +
"\tnotification.\"customData\",\n" +
"\tnotification.\"date\",\n" +
"\tnotification.receiver,\n" +
"\tnotification.sender,\n" +
"\tnotification.\"type\",\n" +
"\thospital.\"name\",\n" +
"\thospital.\"siretNumber\",\n" +
"\tusers.firstname,\n" +
"\tusers.\"isActive\" \n" +
"FROM\n" +
"\tnotification\n" +
"\tINNER JOIN hospital ON notification.receiver = :reciver\n" +
"\tINNER JOIN users ON notification.sender =
:sender",nativeQuery = true)
List<Object []> findNotificationCustomQuery(#Param("reciver")
Long reciver,#Param("sender") Long sender);
Then you have to loop the result as below and get the attributes.
for(Object[] obj : result){
String is = obj[0];
//Get like above

MongoDB Native Query vs C# LINQ Performance

I am using the following two options, the Mongo C# driver seems to be taking more time. I'm using StopWatch to calculate the timings.
Case 1: Native Mongo QueryDocument (takes 0.0011 ms to return data)
string querytext = #"{schemas:{$elemMatch:{name: " + n + ",code : " + c + "} }},{schemas:{$elemMatch:{code :" + c1 + "}}}";
string printQueryname = "Query: " + querytext;
BsonDocument query1 = MongoDB.Bson.Serialization.BsonSerializer.Deserialize<BsonDocument>(querytext);
QueryDocument queryDoc1 = new QueryDocument(query1);
var queryResponse = collection.FindAs<BsonDocument>(queryDoc1);
Case 2: Mongo C# Driver (takes more than 3.2 ms to return data)
Schema _result = new Schema();
_result = (from c in _coll.AsQueryable<Schema>()
where c.schemas.Any(s => s.code.Equals(c) && s.name.Equals(n) ) &&
c.schemas.Any(s => s.code.Equals(c1))
select c).FirstOrDefault();
Any thoughts ? Anything wrong here ?

linq combine results from two tables to one select new statment?

With the following query how to I change that I dont have two sets of fields in the select new I want the information going into one set of columns not having two and a type field to say if its a traineeevent or a cpd event ?
List<EmployeeCPDReportRecord> employeeCPDRecords = new List<EmployeeCPDReportRecord>();
string employeeName;
var q = from cpd in pamsEntities.EmployeeCPDs
from traineeEvent in pamsEntities.TrainingEventTrainees
join Employee e in pamsEntities.Employees on cpd.EmployeeID equals e.emp_no
join TrainingEventPart tEventPart in pamsEntities.TrainingEventParts on traineeEvent.TrainingEventPartId equals tEventPart.RecordId
where (cpd.EmployeeID == id) && (startDate >= cpd.StartDate && endDate <= cpd.EndDate) &&
(traineeEvent.EmployeeId == id)
&& (traineeEvent.TraineeStatus == 1 || traineeEvent.TraineeStatus == 2)
&& (tEventPart.CPDHours > 0 || tEventPart.CPDPoints > 0)
&& (cpd.CPDHours > 0 || cpd.CPDPoints > 0)
|| traineeEvent.StartDate >= startDate
|| traineeEvent.EndDate <= endDate
orderby cpd.StartDate
select new
surname = e.surname,
forname1 = e.forename1,
forname2 = e.forename2,
EmployeeID = cpd.EmployeeID,
StartDate = cpd.StartDate,
EndDate = cpd.EndDate,
CPDHours = cpd.CPDHours,
CPDPoints = cpd.CPDPoints,
Description = cpd.Description,
TrainingStartDate = tEventPart.StartDate,
TrainingEndDate = tEventPart.EndDate,
TrainingCPDHours = tEventPart.CPDHours,
TrainingCPDPoints = tEventPart.CPDPoints,
TrainingEventDescription = tEventPart.Description
if (q != null)
Array.ForEach(q.ToArray(), i =>
if (ContextBase.encryptionEnabled)
employeeName = ContextBase.Decrypt(i.surname) + ", " + ContextBase.Decrypt(i.forname1) + " " + ContextBase.Decrypt(i.forname2);
employeeName = i.surname + ", " + i.forname1 + " " + i.forname2;
if (i.TrainingStartDate != new DateTime(1900, 1, 1))
employeeCPDRecords.Add(new EmployeeCPDReportRecord(employeeName, Convert.ToDateTime(i.StartDate), Convert.ToDateTime(i.EndDate), Convert.ToDecimal(i.CPDHours), Convert.ToDecimal(i.CPDPoints), i.Description,i.t,i.EndDate,Convert.ToDecimal(i.CPDHours),Convert.ToDecimal(i.CPDPoints),i.Description,"L&D"));
employeeCPDRecords.Add(new EmployeeCPDReportRecord(employeeName, Convert.ToDateTime(i.StartDate), Convert.ToDateTime(i.EndDate), Convert.ToDecimal(i.CPDHours), Convert.ToDecimal(i.CPDPoints), i.Description, i.StartDate, i.EndDate, Convert.ToDecimal(i.CPDHours), Convert.ToDecimal(i.CPDPoints), i.Description, "Employee CPD"));
Use this code
List<EmployeeCPDReportRecord> employeeCPDRecords = new List<EmployeeCPDReportRecord>();
var q = ( from cpd in pamsEntities.EmployeeCPDs
from traineeEvent in pamsEntities.TrainingEventTrainees
join Employee e in pamsEntities.Employees on cpd.EmployeeID equals e.emp_no
join TrainingEventPart tEventPart in pamsEntities.TrainingEventParts on traineeEvent.TrainingEventPartId equals tEventPart.RecordId
where (cpd.EmployeeID == id) && (startDate >= cpd.StartDate && endDate <= cpd.EndDate) &&
(traineeEvent.EmployeeId == id)
&& (traineeEvent.TraineeStatus == 1 || traineeEvent.TraineeStatus == 2)
&& (tEventPart.CPDHours > 0 || tEventPart.CPDPoints > 0)
&& (cpd.CPDHours > 0 || cpd.CPDPoints > 0)
|| traineeEvent.StartDate >= startDate
|| traineeEvent.EndDate <= endDate
orderby cpd.StartDate
select new EmployeeCPDReportRecord
YourEmployeColumnName=(ContextBase.encryptionEnabled==true?ContextBase.Decrypt(e.surname) + ", " + ContextBase.Decrypt(e.forname1) + " " + ContextBase.Decrypt(e.forname2):e.surname + ", " + e.forname1 + " " + e.forname2),
YourEmployeeCPDColumnName=(i.TrainingStartDate !=new DateTime(1900, 1, 1)?"L&D":"Employee CPD")
surname = e.surname,
forname1 = e.forename1,
forname2 = e.forename2,
EmployeeID = cpd.EmployeeID,
StartDate = cpd.StartDate,
EndDate = cpd.EndDate,
CPDHours = cpd.CPDHours,
CPDPoints = cpd.CPDPoints,
Description = cpd.Description,
TrainingStartDate = tEventPart.StartDate,
TrainingEndDate = tEventPart.EndDate,
TrainingCPDHours = tEventPart.CPDHours,
TrainingCPDPoints = tEventPart.CPDPoints,
TrainingEventDescription = tEventPart.Description

Running a mapreduce job on cloudera demo cdh3u4 (airline data example)

I'm doing the R-Hadoop tutorial (october 2012) of Jeffrey Breen.
At the moment I try to populate hdfs and then run the commands Jeffrey published in his tutorial in RStudio. Unfortunately I got some troubles with it:
UPDATE: I now moved the data folder to:
/home/cloudera/data/hadoop/wordcount (and same for airline-Data)
No when I run populate.hdfs.sh I get the following output:
[cloudera#localhost ~]$ /home/cloudera/TutorialBreen/bin/populate.hdfs.sh
mkdir: cannot create directory /user/cloudera: File exists
mkdir: cannot create directory /user/cloudera/wordcount: File exists
mkdir: cannot create directory /user/cloudera/wordcount/data: File exists
mkdir: cannot create directory /user/cloudera/airline: File exists
mkdir: cannot create directory /user/cloudera/airline/data: File exists
put: Target /user/cloudera/airline/data/20040325.csv already exists
And then I tried the commands in RStudio as shown in the tutorial but I get errors at the end. Can someone show me what I did wrong?
> if (LOCAL)
+ {
+ rmr.options.set(backend = 'local')
+ hdfs.data.root = 'data/local/airline'
+ hdfs.data = file.path(hdfs.data.root, '20040325-jfk-lax.csv')
+ hdfs.out.root = 'out/airline'
+ hdfs.out = file.path(hdfs.out.root, 'out')
+ if (!file.exists(hdfs.out))
+ dir.create(hdfs.out.root, recursive=T)
+ } else {
+ rmr.options.set(backend = 'hadoop')
+ hdfs.data.root = 'airline'
+ hdfs.data = file.path(hdfs.data.root, 'data')
+ hdfs.out.root = hdfs.data.root
+ hdfs.out = file.path(hdfs.out.root, 'out')
+ }
> asa.csvtextinputformat = make.input.format( format = function(con, nrecs) {
+ line = readLines(con, nrecs)
+ values = unlist( strsplit(line, "\\,") )
+ if (!is.null(values)) {
+ names(values) = c('Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime',
+ 'ArrTime','CRSArrTime','UniqueCarrier','FlightNum','TailNum',
+ 'ActualElapsedTime','CRSElapsedTime','AirTime','ArrDelay',
+ 'DepDelay','Origin','Dest','Distance','TaxiIn','TaxiOut',
+ 'Cancelled','CancellationCode','Diverted','CarrierDelay',
+ 'WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay')
+ return( keyval(NULL, values) )
+ }
+ }, mode='text' )
> mapper.year.market.enroute_time = function(key, val) {
+ if ( !identical(as.character(val['Year']), 'Year')
+ & identical(as.numeric(val['Cancelled']), 0)
+ & identical(as.numeric(val['Diverted']), 0) ) {
+ if (val['Origin'] < val['Dest'])
+ market = paste(val['Origin'], val['Dest'], sep='-')
+ else
+ market = paste(val['Dest'], val['Origin'], sep='-')
+ output.key = c(val['Year'], market)
+ output.val = c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime'])
+ return( keyval(output.key, output.val) )
+ }
+ }
> reducer.year.market.enroute_time = function(key, val.list) {
+ if ( require(plyr) )
+ val.df = ldply(val.list, as.numeric)
+ else { # this is as close as my deficient *apply skills can come w/o plyr
+ val.list = lapply(val.list, as.numeric)
+ val.df = data.frame( do.call(rbind, val.list) )
+ }
+ colnames(val.df) = c('crs', 'actual','air')
+ output.key = key
+ output.val = c( nrow(val.df), mean(val.df$crs, na.rm=T),
+ mean(val.df$actual, na.rm=T),
+ mean(val.df$air, na.rm=T) )
+ return( keyval(output.key, output.val) )
+ }
> mr.year.market.enroute_time = function (input, output) {
+ mapreduce(input = input,
+ output = output,
+ input.format = asa.csvtextinputformat,
+ output.format='csv', # note to self: 'csv' for data, 'text' for bug
+ map = mapper.year.market.enroute_time,
+ reduce = reducer.year.market.enroute_time,
+ backend.parameters = list(
+ hadoop = list(D = "mapred.reduce.tasks=2")
+ ),
+ verbose=T)
+ }
> out = mr.year.market.enroute_time(hdfs.data, hdfs.out)
Error in file(f, if (format$mode == "text") "r" else "rb") :
cannot open the connection
In addition: Warning message:
In file(f, if (format$mode == "text") "r" else "rb") :
cannot open file 'data/local/airline/20040325-jfk-lax.csv': No such file or directory
> if (LOCAL)
+ {
+ results.df = as.data.frame( from.dfs(out, structured=T) )
+ colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'in.air')
+ print(head(results.df))
+ }
Error in to.dfs.path(input) : object 'out' not found
Thank you so much!
First of all, it looks like the command:
/usr/bin/hadoop fs -mkdir /user/cloudera/wordcount/data
Is being split into multiple lines. Make sure you're entering it as-is.
Also, it is saying that the local directory data/hadoop/wordcount does not exist. Verify that you're running this command from the correct directory and that your local data is where you expect it to be.

Paste data validation only using Google Apps Script

This one should be an easy yes or no. Is it possible to paste data validation only with Google Apps Script?
What I want to do is have the code copy the data validation from the row above the active cell, then paste the data validation into the row of the active cell.
I tried copyTo:
function updateFormat() {
var rowNumber = SpreadsheetApp.getActiveSpreadsheet().getActiveSelection().getRow();
var rowAbove = rowNumber -1 ;
var targetRange = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet().getRange(rowNumber, 1, 1, 36);
var templateRange = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet().getRange(rowAbove, 1, 1, 36);
but--obviously--that copied the data validation and the contents of the row above, which is not the goal.
Any ideas?
Thanks in advance!
Adding the optArgument {formatOnly:true} works (See Ref). So
templateRange.copyTo(targetRange, {formatOnly:true});
I've refactored your code slightly and tested using:
function updateFormat() {
var sheet = SpreadsheetApp.getActiveSheet();
var rowNumber = sheet.getActiveSelection().getRow();
var rowAbove = rowNumber -1 ;
var maxCols = sheet.getMaxColumns();
var rangeToCopy = sheet.getRange(rowAbove, 1, 1, maxCols);
rangeToCopy.copyTo(sheet.getRange(rowNumber, 1, 1, maxCols), {formatOnly:true});
Just for the record, I think that there is a new validation class to solve these issues.
But as a workaround, you could read all other values, formulas and formatting of your targetRange before you copy the template over it, then restore them using their specific formulas, e.g. setFormula, setValue and so on. Basically leaving only the data validation from the template.
There is a variant of copyTo that paste only the data validations:
copyTo(destination, SpreadsheetApp.CopyPasteType.PASTE_DATA_VALIDATION)
On your specific case, instead of
templateRange.copyTo(targetRange, SpreadsheetApp.CopyPasteType.PASTE_DATA_VALIDATION);
Ref. range#copytodestination-copypastetype-transposed
* Copia en la fila nueva el formato y validaciones de la fila maestra, es decir,
* la cabecera si index = 1 o la que prefiramos, si no expecificamos index, cuando
* insertamos al final coge de maestra la anterior y al principo coge la siguiente.
* #example function onEdit() { updateRules(); }
function sheetRowRules(index)
var spread = SpreadsheetApp.getActive();
var sheet = SpreadsheetApp.getActiveSheet();
var row = sheet.getActiveCell().getRowIndex();
var cols = sheet.getMaxColumns();
if (index)
var i = index;
else if (row > 2)
var i = row-1; // anterior
var i = row+1; // siguiente a la cabecera
var rg = sheet.getRange(i, 1, 1, cols);
var rango = sheet.getRange(row, 1, 1, cols);
// Copia el formato del anterior registro al actual y fórmulas de celdas para conservar validaciones de datos
rg.copyTo(rango, {formatOnly:true});
var formulas = rg.getFormulasR1C1().toString().split(",");
//rango.setFormulasR1C1(formulas); // borra los valores, por eso así:
for (var n = 0; n < formulas.length; i++)
if (formulas[n].length)
var col = n+1;
var rango = sheet.getRange(row, col, 1, 1);
spread.toast("Formatos y fórmulas del replicados de "+i+" al registro "+row);
