SparkR: "Cannot resolve column name..." when adding a new column to Spark data frame - sparkr

I am trying to add some computed columns to a SparkR data frame, as follows:
Orders <- withColumn(Orders, "Ready.minus.In.mins",
(unix_timestamp(Orders$ReadyTime) - unix_timestamp(Orders$InTime)) / 60)
Orders <- withColumn(Orders, "Out.minus.In.mins",
(unix_timestamp(Orders$OutTime) - unix_timestamp(Orders$InTime)) / 60)
The first command executes ok, and head(Orders) reveals the new column. The second command throws the error:
15/12/29 05:10:02 ERROR RBackendHandler: col on 359 failed
Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function
'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name
"Ready.minus.In.mins" among (ASAP, AddressLine, BasketCount, CustomerEmail, CustomerID, CustomerName, CustomerPhone, DPOSCustomerID, DPOSOrderID, ImportedFromOldDb, InTime, IsOnlineOrder, LineItemTotal, NetTenderedAmount, OrderDate, OrderID, OutTime, Postcode, ReadyTime, SnapshotID, StoreID, Suburb, TakenBy, TenderType, TenderedAmount, TransactionStatus, TransactionType, hasLineItems, Ready.minus.In.mins);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame$$anonfun$col$1.apply(DataFrame.scala:650)
at org.apa
Do I need to do something to the data frame after adding the new column before it will accept another one?

From the link, just use backsticks, when accessing the column, e.g.:
From using
df['Fields.fields1']
or something, use:
df['`Fields.fields1`']

Found it here: spark-issues mailing list archives
SparkR isn't entirely happy with "." in a column name.

Related

Getting anonymous error on JoinEnrichment processor org.apache.calcite.sql.validate.SqlValidatorExceptionannot

I am getting error on JoinEnrichment and don't know why it is coming.
My query in JoinEnrichment :
SELECT original.*, enrichment.*
FROM original
FULL JOIN enrichment
ON original.name = enrichment.name
Error
JoinEnrichment[id=a17b1f06-f80c-3641-945f-c2ff331f8028] Failed to join 'original' FlowFile FlowFile[filename=cmdbci-mtaas] and 'enrichment' FlowFile FlowFile[filename=cmdbci-mtaas]; routing to failure: java.sql.SQLException: Error while preparing statement [SELECT original., enrichment.
FROM original
FULL JOIN enrichment
ON original.name = enrichment.name]
Caused by: org.apache.calcite.runtime.CalciteContextException: From line 4, column 4 to line 4, column 34: Cannot apply '=' to arguments of type '<JAVATYPE(CLASS JAVA.LANG.OBJECT)> = <JAVATYPE(CLASS JAVA.LANG.STRING)>'. Supported form(s): '<COMPARABLE_TYPE> = <COMPARABLE_TYPE>'
Caused by: : Corg.apache.calcite.sql.validate.SqlValidatorExceptionannot apply '=' to arguments of type '<JAVATYPE(CLASS JAVA.LANG.OBJECT)> = <JAVATYPE(CLASS JAVA.LANG.STRING)>'. Supported form(s): '<COMPARABLE_TYPE> = <COMPARABLE_TYPE>'
Sample input from original
Location,Environment,ip_address,category,name,dv_sys_updated_on
"",,,Hardware,Ndiggan,2022-12-17 22:37:28
"",,,Hardware,class,2022-12-31 22:37:38
"",,,,Vlan2,2022-12-27 02:17:13
Sample input from enrichment
Location,Environment,ip_address,category,name,dv_sys_updated_on
"",,,Hardware,vpna,2022-12-17 22:36:02
"",,,Hardware,dlcccno,2022-12-17 22:37:04
"",,,Hardware,Ndiggan,2022-12-17 22:37:28

ElasticSearch hive SerializationError handler

Using Elastic search version 6.8.0
hive> select * from provider1;
OK
{"id","k11",}
{"id","k12",}
{"id","k13",}
{"id","k14",}
{"id":"K1","name":"Ravi","salary":500}
{"id":"K2","name":"Ravi","salary":500}
{"id":"K3","name":"Ravi","salary":500}
{"id":"K4","name":"Ravi","salary":500}
{"id":"K5","name":"Ravi","salary":500}
{"id":"K6","name":"Ravi","salary":"sdfgg"}
{"id":"K7","name":"Ravi","salary":"sdf"}
{"id":"k8"}
{"id":"K9","name":"r1","salary":522}
{"id":"k10","name":"r2","salary":53}
Time taken: 0.179 seconds, Fetched: 14 row(s)
ADD JAR /home/smrafi/elasticsearch-hadoop-6.8.0/dist/elasticsearch-hadoop-6.8.0.jar;
CREATE external TABLE hive_es_with_handler( data STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.resource' = 'test_eshadoop/healthCareProvider',
'es.nodes' = 'vpc-pid-pre-prod-es-cluster-b7thvqfj3tp45arxl34gge3yyi.us-east-2.es.amazonaws.com',
'es.input.json' = 'yes',
'es.index.auto.create' = 'true',
'es.write.operation'='upsert',
'es.nodes.wan.only' = 'true',
'es.port' = '443',
'es.net.ssl'='true',
'es.batch.size.entries'='1',
'es.mapping.id' ='id',
'es.batch.write.retry.count'='-1',
'es.batch.write.retry.wait'='60s',
'es.write.rest.error.handlers' = 'es, ignoreBadRecords',
'es.write.data.error.handlers' = 'customLog',
'es.write.data.error.handler.customLog' = 'com.verisys.elshandler.CustomLogOnError',
'es.write.rest.error.handler.es.client.resource'="error_es_index/error",
'es.write.rest.error.handler.es.return.default'='HANDLED',
'es.write.rest.error.handler.log.logger.name' = 'BulkErrors',
'es.write.data.error.handler.log.logger.name' = 'SerializationErrors',
'es.write.rest.error.handler.ignoreBadRecords' = 'com.verisys.elshandler.IgnoreBadRecordHandler',
'es.write.rest.error.handler.es.return.error'='HANDLED');
insert into hive_es_with_handler10 select * from provider1;
Below is exception trace, it failed complaining the error.handler index is not present
Caused by: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: org.codehaus.jackson.JsonParseException: Unexpected character (',' (code 44)): was expecting a colon to separate field name and value at [Source: [B#1e3f0aea; line: 1, column: 7]
at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:95)
at org.elasticsearch.hadoop.serialization.ParsingUtils.doFind(ParsingUtils.java:168)
at org.elasticsearch.hadoop.serialization.ParsingUtils.values(ParsingUtils.java:151)
at org.elasticsearch.hadoop.serialization.field.JsonFieldExtractors.process(JsonFieldExtractors.java:213)
at org.elasticsearch.hadoop.serialization.bulk.JsonTemplatedBulk.preProcess(JsonTemplatedBulk.java:64)
at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:54)
at org.elasticsearch.hadoop.hive.EsSerDe.serialize(EsSerDe.java:171)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:725)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 9 more
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character (',' (code 44)): was expecting a colon to separate field name and value at [Source: [B#1e3f0aea; line: 1, column: 7]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:500)
at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:93) ... 22 more
I tried to use the custom SerializationErrorHandler But it is of no use and Handler is not coming into context, Its completely stopping the job instead of continuing for the good records even After having default (HANDLED as the constant)
Seems you have invalid JSON
Mentioned in the docs, this is not handled by Hive
Serialization Error Handlers are not yet available for Hive. Elasticsearch for Apache Hadoop uses Hive’s SerDe constructs to convert data into bulk entries before being sent to the output format. SerDe objects do not have a cleanup method that is called when the object ends its lifecycle. Because of this, we do not support serialization error handlers in Hive as they cannot be closed at the end of the job execution

On running an aggregate command, error message displayed: "Dataset byQuarter has no data"

This is the code:
COMPUTE Quarter=TRUNC((MONTH_-1)/3)+1.
EXECUTE.
DELETE VARIABLES MONTH_ DATE_.
DATASET NAME KV.
DATASET DECLARE byQuarter.
AGGREGATE /OUTFILE='byQuarter' /BREAK=YEAR_ Quarter /KV.1 TO KV.126=MEAN(KV.1 TO KV.126).
DATASET ACTIVATE byQuarter.
There is no select case before this code. Can someone help with this error?
Try this:
AGGREGATE /OUTFILE='byQuarter' /BREAK=YEAR_ Quarter
/KV.1_mean=MEAN(KV.1)
/KV.2_mean=MEAN(KV.2)
/KV.3_mean=MEAN(KV.3)
/KV.4_mean=MEAN(KV.4)
/KV.5_mean=MEAN(KV.5)
...

1003 error (unable to find an operator for alias ) in group function in pig

I have written a .pig file whose content is :
register /home/tuhin/Documents/PigWork/pigdata/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '/pigdata/salaryTravelReport.csv' using csvloader();
x = foreach xyz generate $0 as name:chararray, $1 as title:chararray, replace($2, ',','') as salary:bytearray, replace($3, ',', '') as travel:bytearray, $4 as orgtype:chararray, $5 as org:chararray, $6 as year:bytearray;
refined = foreach x generate name, title, (float)salary, (float)travel, orgtype, org, (int)year;
year2010 = filter refined by year == 2010;
byjobtitile = GROUP year2010 by title;
The purpose is to remove ',' in dollar value in 2 columns and then group the data by jobtitle. When I am running this using run command there is not error. Even dumping of year2010 is working fine. But dumping byjobtitiel is giving error:
error in dumping
The output of the log file is:
Pig Stack Trace
--------------- ERROR 1003: Unable to find an operator for alias byjobtitle
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1003: Unable
to find an operator for alias byjobtitle at
org.apache.pig.PigServer$Graph.buildPlan(PigServer.java:1544) at
org.apache.pig.PigServer.storeEx(PigServer.java:1029) at
org.apache.pig.PigServer.store(PigServer.java:997) at
org.apache.pig.PigServer.openIterator(PigServer.java:910) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at
org.apache.pig.Main.run(Main.java:565) at
org.apache.pig.Main.main(Main.java:177) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I am new to bigdata and dont have much knowledge. But it looks like there is a problem in data type. Can anyone help me out?
The issue is due to "CSVLoader" you are using. This will have ',' as default delimiter. Since your data also has "," in some of its field like salary and travel, the positional index is getting changed. So if your data is something like this
name title salary travel orgtype org year
A B 10,000 23,1357 ORG_TYPE ORG 2016
then using CSVLoader will make "A B 10" as the first field, "000 23" as the second field and "1357 ORG_TYPE ORG 2016" as the third field based on ","
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '<path to your file>' using csvloader();
a = foreach xyz generate $0;
2016-06-07 12:28:12,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1<br>
(A B 10)<br>
You can make your delimiter different so that it is not present in any field value.
Try using CSVExcelStorage. You can use its constructor to explicitly define the delimiter
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage('|','NO_MULTILINE','NOCHANGE');
It will work fine as long as same identifier is not present as ;
delimiter
any field value

How to solve distinct-values error

How to solve this error in XQuery. I want the data to be distinct with out duplication in XML result. I tried to add distinct-values in front of the doc in for statement, but this error was depicted.
Engine name: Saxon-PE XQuery 9.5.1.3
Severity: fatal
Description: XPTY0019: Required item type of first operand of '/' is node(); supplied value has item type xs:anyAtomicType
Start location: 23:0
URL: http://www.w3.org/TR/xpath20/#ERRXPTY0019
This is code :
for $sv1 in distinct-values(doc('tc.xml')//term/year)
let $sv2 := doc('tc.xml')//term[year= $sv1]
let $sv3 := doc('tc.xml')//student[idStudent= $sv1/idStudent](:HERE IS THE ERROR LINE:)
let $sv4 := doc('tc.xml')//program[idStudent= $sv3/idStudent]
return
<Statistics>
{$sv1 }
<Count_Student>{count($sv2)}</Count_Student>
<a50_60>{count(doc('tc.xml')/mydb//program[doc('tc.xml')/mydb//term/year =$sv1][avg>= 50 and avg < 60])}</a50_60>
</Statistics>
thank you in advance.
distinct-values() will atomize your input, this means that $sv1/idStudent won't work because $sv1 is not an element. Instead of using $sv1 on the line that give an error I think you should be using $sv2.

Resources