Spark Scala how to execute - hadoop

I have written the following code, which returns a "Class not found" exception. I'm not sure what I need to do to load data from a csv file into SparkSQL.
import org.apache.spark.SparkContext
/**
* Loading sales csv using DataFrame API
*/
object CsvDataInput {
def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Csv loading example")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1),"header"->"true"))
df.printSchema()
df.registerTempTable("data")
val aggDF = sqlContext.sql("select * from data")
println(aggDF.collectAsList())
}
}

Try replacing this line
import org.apache.spark.SparkContext
with this
import org.apache.spark.*
You are importing just part of the library, but using classes from outside this part. Also, your import is actually misspelled - it should read org.apache.spark.sql.SQLContext, and you used some other package, not related to the code presented.

Related

Cannot import "org.jetbrains.exposed.sql.Database" in KTOR

I was recently working on a MySQL database and wanted to use the database as my data source in KTOR. To use the database, I decided to use the org.jetbrains.exposed.sql.Database
and javax.sql.DataSource imports. I'm working in IntelliJ.
My test code looks like this:
import org.jetbrains.exposed.sql.Database
import javax.sql.DataSource
fun main(args: Array<String>): Unit = io.ktor.server.netty.EngineMain.main(args)
val databaseUrl = "jdbc:mysql://localhost:3307/databaseName"
val username = "root"
val password = " "
// Create a DataSource object
val dataSource: DataSource = Database.connect(
url = databaseUrl,
driver = "com.mysql.jdbc.Driver",
user = username,
password = password
)
Somehow, I can't import the org.jetbrains.exposed.sql.Database , even though I added the dependency in my build.gradle.kts file:
dependencies {
implementation("com.mysql.jdbc:mysql-connector-java:8.0.22")
implementation("org.jetbrains.exposed:exposed:0.18.7")
implementation("io.ktor:ktor-server-core:$ktor_version")
implementation("io.ktor:ktor-server-netty:$ktor_version")
implementation("io.ktor:ktor-server-content-negotiation:$ktor_version")
implementation("io.ktor:ktor-serialization-kotlinx-json:$ktor_version")
implementation("ch.qos.logback:logback-classic:$logback_version")
testImplementation("io.ktor:ktor-server-test-host:$ktor_version")
testImplementation("org.jetbrains.kotlin:kotlin-test-junit:$kotlin_version")
implementation(kotlin("stdlib-jdk8"))
}
I tried syncing the gradle file, rebuilding the project and cleaning the project. Am I missing something? Thanks!
Ok so I solved it by using a different dependency in my build.gradle file:
implementation("org.jetbrains.exposed:exposed-core:0.41.1")
Instead of:
implementation("org.jetbrains.exposed:exposed:0.18.7")

Spark-submit not running code as in Intellij

Below code is running fine in Intellij and displaying output. When I try to run it using spark-submit using command:
spark-submit --class com.sohail.popular_movies_pkg C:\spark\bin\popular_movies_pkg.jar
It just terminates with warning, nothing is displayed on console. Anything I am doing wrong or do I have to include something?
C:\spark\bin>spark-submit --class com.sohail.popular_movies_pkg popular_movies_pkg.jar
19/06/20 01:42:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
package com.sohail
/** Find the movies with the most ratings. */
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
object popular_movies_pkg {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\winutils\\")
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "popular_movies_pkg")
// Read in each rating line
val lines = sc.textFile("C:\\spark\\bin\\u.data")
//data format: user id, movie id, rating, timestamp
val movie_rating_map = lines.map(x => (x.split("\t")(1).toInt,1))
val movie_rating_count = movie_rating_map.reduceByKey((x,y) => x+y);
val flip = movie_rating_count.map(x => (x._2,x._1) )
flip.sortByKey(false).collect().foreach(println)
}
}

Okta api python sdk throwing json error

I'm trying to create a user with the python sdk. When I run my script, I get the following error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/sitepackages/oktasdk-python/okta/framework/ApiClient.py", line 53, in post
if self.__check_response(resp, attempts):
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/sitepackages/oktasdk-python/okta/framework/ApiClient.py", line 88, in __check_response
raise OktaError(json.loads(resp.text))
okta.framework.OktaError.OktaError: The request body was not well-formed: Could not read JSON
Here is a snippet of my code:
from okta.UsersClient import UsersClient
from collections import namedtuple
def main():
create_okta_user()
def create_okta_user():
usersClient = UsersClient("https://example.okta.com", "0d0d0dexamplekey")
User = namedtuple("User", ["login", "email", "firstName", "lastName"], verbose=False, rename=False)
user = User(login="test#example.com",
email="test#example.com",
firstName="user",
lastName="tester")
usersClient.create_user(user, activate=False)
#usersClient.activate_user(user)
main()
It looks like you're trying to use a namedtuple, which is serialized to a json list, not an object.
Try using the User model like this:
from okta import UsersClient
from okta.models.user import User
def main():
create_okta_user()
def create_okta_user():
usersClient = UsersClient("https://example.okta.com", "0d0d0dexamplekey")
user = User(login="test#example.com",
email="test#example.com",
firstName="user",
lastName="tester")
user = usersClient.create_user(user, activate=False)
#usersClient.activate_user(user)
main()
http://developer.okta.com/docs/sdk/core/python_api_sdk/quickstart.html#create-a-user

Exception : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=hbase, access=EXECUTE

I am trying to perform BulkLoad into Hbase. The input to map reduce is hdfs file(from Hive).
Using the below code in Tool(Job) class to initiate the bulk loading process
HFileOutputFormat.configureIncrementalLoad(job, new HTable(config, TABLE_NAME));
In Mapper, using the following as output of Mapper
context.write(new ImmutableBytesWritable(Bytes.toBytes(hbaseTable)), put);
Once the mapper is completed. performing the actual bulk loading using,
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles(configuration);
HTable hTable = new HTable(configuration, tableName);
loadFfiles.doBulkLoad(new Path(pathToHFile), hTable);
The job runs fine, but once the Loadincrement start, it hangs on for ever. I have to stop the job from running after many attempts. However after long wait of may be 30 mins, I finally got the above error. After extensive search I found, that Hbase would be trying to access the files(HFiles) which are placed in the output folder, and that folder do not have permission to be written or executed. So throwing the above error. So the alternative solutions are to add file access permissions as below in java code before Bulk Loading is performed.
FileSystem fileSystem = FileSystem.get(config);
fileSystem.setPermission(new Path(outputPath),FsPermission.valueOf("drwxrwxrwx"));
Is this the correct approach, as we move from development to production. Also once I added the above code, I got the similar error for the folder created inside the output folder. This time its the column family folder. This is dynamic action at runtime.
As a temporary workaround, I did as below and was able to move ahead.
fileSystem.setPermission(new Path(outputPath+"/col_fam_folder"),FsPermission.valueOf("drwxrwxrwx"));
Both the steps seems to be workarounds, and I need a correct solution to move to production. Thanks in advance
Try this
System.setProperty("HADOOP_USER_NAME", "hadoop");
Secure bulk load seems to be an appropriate answer. This thread explains a sample implementation. The snippet is copied over as below.
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HRegionInfo;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.client.coprocessor.SecureBulkLoadClient;
import org.apache.hadoop.hbase.security.UserProvider;
import org.apache.hadoop.hbase.security.token.FsDelegationToken;
import org.apache.hadoop.hbase.util.Pair;
import org.apache.hadoop.security.UserGroupInformation;
String keyTab = "pathtokeytabfile";
String tableName = "tb_name";
String pathToHFile = "/tmp/tmpfiles/";
Configuration configuration = new Configuration();
configuration.set("hbase.zookeeper.quorum","ZK_QUORUM");
configuration.set("hbase.zookeeper"+ ".property.clientPort","2181");
configuration.set("hbase.master","MASTER:60000");
configuration.set("hadoop.security.authentication", "Kerberos");
configuration.set("hbase.security.authentication", "kerberos");
//Obtaining kerberos authentication
UserGroupInformation.setConfiguration(configuration);
UserGroupInformation.loginUserFromKeytab("here keytab", path to the key tab);
HBaseAdmin.checkHBaseAvailable(configuration);
System.out.println("HBase is running!");
HBaseConfiguration.addHbaseResources(configuration);
Connection conn = ConnectionFactory.createConnection(configuration);
Table table = conn.getTable(TableName.valueOf(tableName));
HRegionInfo tbInfo = new HRegionInfo(table.getName());
//path to the HFiles that need to be loaded
Path hfofDir = new Path(pathToHFile);
//acquiring user token for authentication
UserProvider up = UserProvider.instantiate(configuration);
FsDelegationToken fsDelegationToken = new FsDelegationToken(up, "name of the key tab user");
fsDelegationToken.acquireDelegationToken(hfofDir.getFileSystem(configuration));
//preparing for the bulk load
SecureBulkLoadClient secureBulkLoadClient = new SecureBulkLoadClient(table);
String bulkToken = secureBulkLoadClient.prepareBulkLoad(table.getName());
System.out.println(bulkToken);
//creating the family list (list of family names and path to the hfile corresponding to the family name)
final List<Pair<byte[], String>> famPaths = new ArrayList<>();
Pair p = new Pair();
//name of the family
p.setFirst("nameofthefamily".getBytes());
//path to the HFile (HFile are organized in folder with the name of the family)
p.setSecond("/tmp/tmpfiles/INTRO/nameofthefilehere");
famPaths.add(p);
//bulk loading ,using the secure bulk load client
secureBulkLoadClient.bulkLoadHFiles(famPaths, fsDelegationToken.getUserToken(), bulkToken, tbInfo.getStartKey());
System.out.println("Bulk Load Completed..");

Interpreting output from mahout clusterdumper

I ran a clustering test on crawled pages (more than 25K docs ; personal data set).
I've done a clusterdump :
$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt
The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :
VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}
How to interpret this output?
In short : I am looking for document ids which belong to a particular cluster.
What is the meaning of :
VL-x ?
n=y c=[z:z', ...]
r=[z'':z''', ...]
Does 0:0.017 means "0" is the document id which belongs to this cluster?
I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?
Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.
By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles
NamedVecotr:
vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.
import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.NamedVector;
public class ClusterOutput {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
BufferedWriter bw;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
File pointsFolder = new File(args[0]);
File files[] = pointsFolder.listFiles();
bw = new BufferedWriter(new FileWriter(new File(args[1])));
HashMap<String, Integer> clusterIds;
clusterIds = new HashMap<String, Integer>(5000);
for(File file:files){
if(file.getName().indexOf("part-m")<0)
continue;
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (reader.next(key, value)) {
NamedVector vector = (NamedVector) value.getVector();
String vectorName = vector.getName();
bw.write(vectorName + "\t" + key.toString()+"\n");
if(clusterIds.containsKey(key.toString())){
clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
}
else
clusterIds.put(key.toString(), 1);
}
bw.flush();
reader.close();
}
bw.flush();
bw.close();
bw = new BufferedWriter(new FileWriter(new File(args[2])));
Set<String> keys=clusterIds.keySet();
for(String key:keys){
bw.write(key+" "+clusterIds.get(key)+"\n");
}
bw.flush();
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
To complete the answer:
VL-x: is the identifier of the cluster
n=y: is the number of elements in the cluster
c=[z, ...]: is the centroid of the cluster, with the
z's being the weights of the different dimensions
r=[z, ...]: is the radius of the cluster.
More info here:
https://mahout.apache.org/users/clustering/cluster-dumper.html
I think you need to read the source code -- download from http://mahout.apache.org. VL-24130 is just a cluster identifier for a converged cluster.
You can use mahout clusterdump
https://cwiki.apache.org/MAHOUT/cluster-dumper.html

Resources