Spark-submit not running code as in Intellij - windows

Below code is running fine in Intellij and displaying output. When I try to run it using spark-submit using command:
spark-submit --class com.sohail.popular_movies_pkg C:\spark\bin\popular_movies_pkg.jar
It just terminates with warning, nothing is displayed on console. Anything I am doing wrong or do I have to include something?
C:\spark\bin>spark-submit --class com.sohail.popular_movies_pkg popular_movies_pkg.jar
19/06/20 01:42:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
package com.sohail
/** Find the movies with the most ratings. */
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
object popular_movies_pkg {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\winutils\\")
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "popular_movies_pkg")
// Read in each rating line
val lines = sc.textFile("C:\\spark\\bin\\u.data")
//data format: user id, movie id, rating, timestamp
val movie_rating_map = lines.map(x => (x.split("\t")(1).toInt,1))
val movie_rating_count = movie_rating_map.reduceByKey((x,y) => x+y);
val flip = movie_rating_count.map(x => (x._2,x._1) )
flip.sortByKey(false).collect().foreach(println)
}
}

Related

Cannot import "org.jetbrains.exposed.sql.Database" in KTOR

I was recently working on a MySQL database and wanted to use the database as my data source in KTOR. To use the database, I decided to use the org.jetbrains.exposed.sql.Database
and javax.sql.DataSource imports. I'm working in IntelliJ.
My test code looks like this:
import org.jetbrains.exposed.sql.Database
import javax.sql.DataSource
fun main(args: Array<String>): Unit = io.ktor.server.netty.EngineMain.main(args)
val databaseUrl = "jdbc:mysql://localhost:3307/databaseName"
val username = "root"
val password = " "
// Create a DataSource object
val dataSource: DataSource = Database.connect(
url = databaseUrl,
driver = "com.mysql.jdbc.Driver",
user = username,
password = password
)
Somehow, I can't import the org.jetbrains.exposed.sql.Database , even though I added the dependency in my build.gradle.kts file:
dependencies {
implementation("com.mysql.jdbc:mysql-connector-java:8.0.22")
implementation("org.jetbrains.exposed:exposed:0.18.7")
implementation("io.ktor:ktor-server-core:$ktor_version")
implementation("io.ktor:ktor-server-netty:$ktor_version")
implementation("io.ktor:ktor-server-content-negotiation:$ktor_version")
implementation("io.ktor:ktor-serialization-kotlinx-json:$ktor_version")
implementation("ch.qos.logback:logback-classic:$logback_version")
testImplementation("io.ktor:ktor-server-test-host:$ktor_version")
testImplementation("org.jetbrains.kotlin:kotlin-test-junit:$kotlin_version")
implementation(kotlin("stdlib-jdk8"))
}
I tried syncing the gradle file, rebuilding the project and cleaning the project. Am I missing something? Thanks!
Ok so I solved it by using a different dependency in my build.gradle file:
implementation("org.jetbrains.exposed:exposed-core:0.41.1")
Instead of:
implementation("org.jetbrains.exposed:exposed:0.18.7")

Can't access assets.open() in OnBindBiewHolder in Kotlin. Trying to load an image from my assets folder into certain rows

val padder = holder?.view?.padImage
val inputStream = assets.open("greenface.jpg")
val drawableNew = Drawable.createFromStream(inputStream, null)
padder.setImageDrawable(drawableNew)
It throws an error saying unresolved reference open. This worked perfectly fine in my Main activity class
Thanks Taseer,
In main activity I passed the context of the activity to the adapter:
recyclerView_main.adapter = MainAdapter(Model, this)
In my adapter class I added the context argument:
class MainAdapter(val boulderProblems: List<BoulderProblems>, var context:
Context): RecyclerView.Adapter<CustomViewHolder>() {
And I adjusted my assets code to this:
val padder = holder?.view?.padImage
val inputStream = this.context.assets.open("greenface.jpg")
val drawableNew = Drawable.createFromStream(inputStream, null)
padder.setImageDrawable(drawableNew)

Issues in connecting to Apache Derby From Play Application

I am stuck trying to connect to an embedded Apache Derby Database through my Play (2.3.9) application. Have the following configurations in application.conf:
DATABASE_URL_DB = "derby:MyDB"
db.default.driver = org.apache.derby.jdbc.EmbeddedDriver
db.default.url="jdbc:"${DATABASE_URL_DB}
(I have the MyDB DB directory inside the Derby installation directory - which is the default).
Following is the controller code (a fragment of the file) i am trying to execute:
package controllers
import play.api.db._
import play.api.mvc._
import play.api.Play.current
object Application extends Controller {
def test = Action {
var outString = "Number is "
val conn = DB.getConnection()
try {
val stmt = conn.createStatement
val rs = stmt.executeQuery("SELECT 9 as testkey ")
while (rs.next()) {
outString += rs.getString("testkey")
}
} finally {
conn.close()
}
Ok(outString)
}
}
The dependencies in place (alongside others):
libraryDependencies ++= Seq( jdbc , cache , ws)
libraryDependencies += "org.apache.derby" % "derby" % "10.12.1.1"
In Routes (alongside others):
GET /globalTest controllers.Application.test
I get the error: NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver
Could someone point out the problem? Thanks in advance.
The issue was resolved by minimizing the number of dependencies in build.sbt. Certainly seems like one of the dependencies in my project was interfering with the Derby drivers.

Spark Scala how to execute

I have written the following code, which returns a "Class not found" exception. I'm not sure what I need to do to load data from a csv file into SparkSQL.
import org.apache.spark.SparkContext
/**
* Loading sales csv using DataFrame API
*/
object CsvDataInput {
def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Csv loading example")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1),"header"->"true"))
df.printSchema()
df.registerTempTable("data")
val aggDF = sqlContext.sql("select * from data")
println(aggDF.collectAsList())
}
}
Try replacing this line
import org.apache.spark.SparkContext
with this
import org.apache.spark.*
You are importing just part of the library, but using classes from outside this part. Also, your import is actually misspelled - it should read org.apache.spark.sql.SQLContext, and you used some other package, not related to the code presented.

How can I get Hadoop with Cascading to show me debug log output?

I'm having trouble getting Hadoop and Cascading 1.2.6 to show me the output that's supposed to come from using the Debug filter. The Cascading guide says this is how you can view the current tuples. I'm using this to try to see any debug output:
Debug debug = new Debug(Debug.Output.STDOUT, true);
debug.setPrintTupleEvery(1);
debug.setPrintFieldsEvery(1);
assembly = new Each( assembly, DebugLevel.VERBOSE, debug );
I'm pretty new to Hadoop and Cascading, but it's possible I'm not looking in the right place or that there's some simple log4j setting that I'm missing (I haven't made any changes to the defaults you get with Cloudera hadoop-0.20.2-cdh3u3.
This is the WordCount sample class that I'm using (copied from the cascading user guide) with Debug statements added in:
package org.cascading.example;
import cascading.flow.Flow;
import cascading.flow.FlowConnector;
import cascading.operation.Aggregator;
import cascading.operation.Debug;
import cascading.operation.DebugLevel;
import cascading.operation.Function;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.scheme.Scheme;
import cascading.scheme.TextLine;
import cascading.tap.Hfs;
import cascading.tap.SinkMode;
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;
public class WordCount {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
// define source and sink Taps.
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
// the 'head' of the pipe assembly
Pipe assembly = new Pipe( "wordcount" );
// For each input Tuple
// using a regular expression
// parse out each word into a new Tuple with the field name "word"
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
Debug debug = new Debug(Debug.Output.STDOUT, true);
debug.setPrintTupleEvery(1);
debug.setPrintFieldsEvery(1);
assembly = new Each( assembly, DebugLevel.VERBOSE, debug );
// group the Tuple stream by the "word" value
assembly = new GroupBy( assembly, new Fields( "word" ) );
// For every Tuple group
// count the number of occurrences of "word" and store result in
// a field named "count"
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
// initialize app properties, tell Hadoop which jar file to use
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, WordCount.class );
// plan a new Flow from the assembly using the source and sink Taps
FlowConnector flowConnector = new FlowConnector();
FlowConnector.setDebugLevel( properties, DebugLevel.VERBOSE );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
// execute the flow, block until complete
flow.complete();
// Ask Cascading to create a GraphViz DOT file
// brew install graphviz # install viewer to look at dot file
flow.writeDOT("build/flow.dot");
}
}
It works fine, I just can't find any debug statements anywhere showing me the words. I've looked both through the HDFS filesystem with hadoop dfs -ls as well as through the jobtracker web ui. The log output for a mapper in the jobtracker doesn't have any output for STDOUT:
Task Logs: 'attempt_201203131143_0022_m_000000_0'
stdout logs
stderr logs
2012-03-13 14:32:24.642 java[74752:1903] Unable to load realm info from SCDynamicStore
syslog logs
2012-03-13 14:32:24,786 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
2012-03-13 14:32:25,278 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2012-03-13 14:32:25,617 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2012-03-13 14:32:25,903 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : null
2012-03-13 14:32:25,945 INFO cascading.tap.hadoop.MultiInputSplit: current split input path: hdfs://localhost/usr/tnaleid/shakespeare/input/comedies/cymbeline
2012-03-13 14:32:25,980 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2012-03-13 14:32:25,988 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2012-03-13 14:32:26,002 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2012-03-13 14:32:26,246 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2012-03-13 14:32:26,247 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2012-03-13 14:32:27,623 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2012-03-13 14:32:28,274 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0
2012-03-13 14:32:28,310 INFO org.apache.hadoop.mapred.Task: Task:attempt_201203131143_0022_m_000000_0 is done. And is in the process of commiting
2012-03-13 14:32:28,337 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201203131143_0022_m_000000_0' done.
2012-03-13 14:32:28,361 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
At the end, I'm also writing out the DOT file, which does not have the Debug statement in it that I'd expect (though maybe those are stripped out):
Is there some log file somewhere that I'm missing, or is it a config setting that I need to set?
I got an answer from the mailing list on this.
Changing it to this works:
assembly = new Each( assembly, new Fields( "line" ), function );
// simpler debug statement
assembly = new Each( assembly, new Debug("hello", true) );
assembly = new GroupBy( assembly, new Fields( "word" ) );
That outputs this in the jobdetails UI under stderr:
Task Logs: 'attempt_201203131143_0028_m_000000_0'
stdout logs
stderr logs
2012-03-13 16:21:41.304 java[78617:1903] Unable to load realm info from SCDynamicStore
hello: ['word']
hello: ['CYMBELINE']
<SNIP>
I had tried this directly from the docs, and that doesn't work for me (even though I've also set the FlowConnector debugLevel to VERBOSE):
assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );
It seems that it's something related to the DebugLevel.VERBOSE from the documentation as when I try this, I still get no output:
assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug("hello", true) );
Changing it to remove the DebugLevel also gives me output
assembly = new Each( assembly, new Debug() );
I can also get it to switch to stdout by doing this:
assembly = new Each( assembly, new Debug(Debug.Output.STDOUT) );
I'm betting there's still something I've got misconfigured with the VERBOSE log level stuff, or 1.2.6 doesn't match the documentation anymore, but at least now I can see the output in the logs.
Did you try setting
flow.setDebugLevel( DebugLevel.VERBOSE );

Resources