error spark-shell, falling back to uploading libraries under SPARK_HOME - hadoop

I'm trying to connect a spark-shell amazon hadoop, but I esart all the time giving the following error and do not know how to fix it or configure what is missing.
spark.yarn.jars, spark.yarn.archive
spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/12 07:47:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/08/12 07:47:28 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Thx!!!
Error1
I'm trying to run a SQL query, something totally simple as:
val sqlDF = spark.sql("SELECT col1 FROM tabl1 limit 10")
sqlDF.show()
WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Error2
Then I try to run a script scala, something simple collected in:
https://blogs.aws.amazon.com/bigdata/post/Tx2D93GZRHU3TES/Using-Spark-SQL-for-ETL
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
import java.util.HashMap
var ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "tableDynamoDB")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
var genreRatingsCount = sqlContext.sql("SELECT col1 FROM table1 LIMIT 1")
var ddbInsertFormattedRDD = genreRatingsCount.map(a => {
var ddbMap = new HashMap[String, AttributeValue]()
var col1 = new AttributeValue()
col1.setS(a.get(0).toString)
ddbMap.put("col1", col1)
var item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
)
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving object InterfaceAudience
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1502)
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1500)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)

Looks like spark UI not started, tried to start spark shell and also check sparkUI localhost:4040 running correctly.

Related

pyspark An error occurred while calling o23.jdbc. : java.lang.ClassNotFoundException: com.mariadb.jdbc.Driver

I'm going to read tables from mariadb database using pyspark.And an error occur while running the below code
'''
jdbcHostname = "localhost"
jdbcDatabase = "pucsl"
jdbcPort = 3307
jdbcUrl = "jdbc:mariadb://{0}:{1}/{2}?user={3}&password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, "root", "ravi")
df = spark.read.jdbc(url=jdbcUrl, table="m00_02_lic_lic_reln",properties={"driver": 'com.mariadb.jdbc.Driver'})
Currently Spark does not correctly recognize mariadb specific jdbc connect strings and so the jdbc:mysql syntax must be used. The followings shows a simple pyspark script to query the results from ColumnStore UM server columnstore_1 into a spark dataframe:
from pyspark import SparkContext
from pyspark.sql import DataFrameReader, SQLContext
url = 'jdbc:mysql://columnstore_1:3306/test'
properties = {'user': 'root', 'driver': 'org.mariadb.jdbc.Driver'}
sc = SparkContext("local", "ColumnStore Simple Query Demo")
sqlContext = SQLContext(sc)
df = DataFrameReader(sqlContext).jdbc(url='%s' % url, table='results', properties=properties)
df.show()
p.s~ I believe you have successfully added MariaDB jar in place(Something like /spark3.1.2/lib/maridabjar...)

Accessing HDFS configured as High availability from Client program

I am trying to understand the context of the working and not working program which connects HDFS via nameservice(which connects active name node - High availability Namenode) outside HDFS cluster.
Not working program:
When i read both config files (core-site.xml and hdfs-site.xml) and accessing HDFS file , it throws an error
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
println("hadoopConf : " + hadoopConf.get("fs.defaultFS"))
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("/apps/hive"));
//println("Checked : "+ check)
}
}
Error : We see that Unknownhost Exception
hadoopConf :
hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
at HadoopAccess$.main(HadoopAccess.scala:28)
at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster
Working Program : I specifically set the High availability into hadoopConf object and passing to Filesystem object , the program works
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
//hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
//hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
hadoopConf.set("dfs.client.failover.proxy.provider.mycluster",
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
println(hadoopConf)
/* val namenode = hadoopConf.get("fs.defaultFS")
println("namenode: "+ namenode) */
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
//println("Checked : "+ check)
}
}
Any reason why we need to set values for this configs like dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1 in hadoopconf object as this values already present in hdfs-site.xml file and core-site.xml. These configs are High availability Namenode settings.
The above program which I am running via Edge mode or local IntelliJ.
Hadoop version : 2.7.3.2
Hortonworks : 2.6.1
My observation in Spark Scala REPL :
When I do val hadoopConf = new Configuration(false) and val fs = FileSystem.get(hadoopConf) .This gives me Local FileSystem .So when I perform below
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
,now the file System changed to DFSFileSysyem ..My assumption is that some client library which is in Spark that is not available in somewhere in during build or edge node common place .
some client library which is in Spark that is not available in somewhere in during build or edge node common place
This common place would be $SPARK_HOME/conf and/or $HADOOP_CONF_DIR. But if you are just running a regular Scala app with java jar or with IntelliJ, that has nothing to do with Spark.
... this values already present in hdfs-site.xml file and core-site.xml
Then, they should be read, accordingly, however overriding in the code shouldn't hurt either.
The values are necessary because they dicate where the actual namenodes are running; otherwise, it thinks mycluster is a real DNS name of only one server, when it isn't

Disable info logs during testing a buffalo application

I am unable to disable INFO logs during testing.
Is there a way to do so?
Thanks.
Set the buffalo.Options.Logger.Out to ioutil.Discard, you might have to create an instance of a logger to do it:
import (
"github.com/gobuffalo/logger"
"ioutil"
// etc.
)
var noopLogger logger.Logrus
noopLogger.Out = ioutil.Discard
noopLogger.SetOutput(ioutil.Discard) // can't remember which one you need to do
buffalo.Options.Logger = noopLogger

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file

Resources