SparkR Error while writing dataframe to csv and parquet - rstudio

I'm getting error while writing spark dataframe to csv and parquet. I already try to install winutil but still not solving the error.
my code
INVALID_IMEI <- c("012345678901230","000000000000000")
setwd("D:/Revas/Jatim Old")
fileList <- list.files()
cdrSchema <- structType(structField("date","string"),
structField("time","string"),
structField("a_number","string"),
structField("b_number", "string"),
structField("duration","integer"),
structField("lac_cid","string"),
structField("imei","string"))
file <- fileList[1]
filePath <- paste0("D:/Revas/Jatim Old/",file)
dataset <- read.df(filePath, header="false",source="csv",delimiter="|",schema=cdrSchema)
dataset <- filter(dataset, ifelse(dataset$imei %in% INVALID_IMEI,FALSE,TRUE))
dataset <- filter(dataset, ifelse(isnan(dataset$imei),FALSE,TRUE))
dataset <- filter(dataset, ifelse(isNull(dataset$imei),FALSE,TRUE))
To export the dataframe, i try the following code
write.df(dataset, "D:/spark/dataset",mode="overwrite")
write.parquet(dataset, "D:/spark/dataset",mode="overwrite")
And i get the following error
Error: Error in save : org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.comma

I already found the possible cause. The issue seem to lie in the winutil version, previously im using 2.6. Changing it to 2.8 seem to solve the issue

Related

Loading of json files from S3 to sparkR dataframe

I have jason files saved in S3 bucket. I am trying to load them as dataframe in spark R and I am getting error logs. Following is my code. Where am I going wrong?
devtools::install_github('apache/spark#v2.2.0',subdir='R/pkg',force=TRUE)
library(SparkR)
sc=sparkR.session(master='local')
Sys.setenv("AWS_ACCESS_KEY_ID"="xxxx",
"AWS_SECRET_ACCESS_KEY"= "yyyy",
"AWS_DEFAULT_REGION"="us-west-2")
movie_reviews <-SparkR::read.df(path="s3a://bucketname/reviews_Movies_and_TV_5.json",sep = "",source="json")
I have tried all combinations of s3a , s3n, s3 and none seems to work.
I get following error log in my sparkR console
17/12/09 06:56:06 WARN FileStreamSink: Error while looking for metadata directory.
17/12/09 06:56:06 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
java.lang.reflect.InvocationTargetException
For me it works
read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
What #Ankit said should work, but if you are trying to get something that looks more like a dataframe, you need to use a select statement. i.e.
rdd<- read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
Then do a printSchema(rdd) to see the structure of the data.
If you see something that has root followed by no indentations to your data, you can probably go ahead and select using the names of the "columns" you want. If you see branching down your schema tree, you may have to put a headers.blah or a payload.blah in you select statement. Like this:
sdf<- SparkR::select(rdd, "headers.something", "headers.somethingElse", "payload.somethingInPayload", "payload.somethingElse")

sparkr 2.0 read.df throws path does not exist error

My spark r 1.6 code does not work in spark2.0, I made necessary changes like creating sparkr.session() instead of sparkr.init() and not passing sqlcontext parameter etc…
In the code below I am loading data from couple folders into a dataframe
read.df in spark1.6 that works
sales <- read.df(sqlContext, path= "gs://dev.appspot.com/myData/2014/20*,gs://dev.appspot.com/myData/2015/20*", source = "com.databricks.spark.csv", delimiter
="\t")
read.df in spark2.0 that does not work
sales <- read.df("gs://dev.appspot.com/myData/2014/20*,gs://dev.appspot.c
om/myData/2015/20*", source = "com.databricks.spark.csv", delimiter="\t")
the above line throws following error:
6/09/25 19:28:52 ERROR org.apache.spark.api.r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils faile d Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.spark.sql.AnalysisException: **Path does not exist: gs://dev.appspot.com/myData/2014/ 20*,gs://dev.appspot.com/myData/2015/20***;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122 Calls: read.df -> dispatchFunc -> f -> callJStatic -> invokeJava Execution halted 16/09/25 19:28:53 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#148bd6fd{HTTP/1.1}{0 .0.0.0:4040}
spark2.0 read.df is failing on reading files that has ","(comma) in the file name.
Data files that I generated has a comma in
the files names, something like these 201448-0,004 201448-0,005
201448-0,006
After painfull hours in debugging through the issue, finally it started reading the data when I removed "," from files names.

SparkR dapply not working

I'm trying to call lapply within a function applied on spark data frame. According to documentation it's possible since Spark 2.0.
wrapper = function(df){
out = df
out$len <- unlist(lapply(df$value, function(y) length(y)))
return(out)
}
# dd is Spark Data Frame with one column (value) of type raw
dapplyCollect(dd, wrapper)
It returns error:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 37, localhost): org.apache.spark.SparkException: R computation failed with
Error in (function (..., deparse.level = 1, make.row.names = TRUE) :
incompatible types (from raw to logical) in subassignment type fix
The following works fine:
wrapper(collect(dd))
But we want computation to run on nodes (not on driver).
What could be the problem? There is a related question but it does not help.
Thanks.
You need to add the schema as it can only be defaulted if the columns of the output are the same mode as the input.

error while storing in Hbase using Pig

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.
solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

error in running auto.arima function

I am downloading some stock's daily close data using quantmod package:
library(quantmod)
library(dygraphs)
library(forecast)
date <- as.Date("2014-11-01")
getSymbols("SBIN.BO",from = date )
close <- SBIN.BO[, 4]
dygraph(close)
dat <- data.frame(date = index(SBIN.BO),SBIN.BO)
acf1 <- acf(close)
When I tried to execute auto arima function from forecast package:
fit <- auto.arima(close, seasonal=FALSE, xreg=fourier(close, K=4))
I encountered the following error:
Error in ...fourier(x, K, 1:length(x)) :
K must be not be greater than period/2
So I want to know why there is this error? Did I do any mistake in writing code, which based upon tutorials available on Rob's website/blogs...

Resources