SparkRDD Operations - hadoop

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.
Input Table :
COLUMN A Column B
56 Walmart
72 Flipkart
96 Amazon
100 AliExpress
Output Table:
COLUMN A Column B
100 AliExpress
This is what i tried till now
SourceCode:
SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));
From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.

You can use either top or takeOrdered functions to achieve it.
rdd.top(1) //gives you top element in your RDD

Data:
COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress
Creating df using Spark 2
val df = sqlContext.read.option("header", "true")
.option("inferSchema", "true")
.csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
Using Dataframe functions
df.orderBy(desc("COLUMN_A")).take(1).foreach(println)
OUTPUT:
[100,AliExpress]
Using RDD functions
df.rdd
.map(row => (row(0).toString.toInt, row(1)))
.sortByKey(false)
.take(1).foreach(println)
OUTPUT:
(100,AliExpress)

Related

issues returning pyspark dataframe using for loop

I am applying for loop in pyspark. How can I get the actual values in dataframe . I am doing dataframe joins and filtering too.
I havent added dataset here, I need the approach or psuedo code just to figure out what I am doing worng here.
Help is really appreciated, I am stuck since long.
values1 = values.collect()
temp1 = []
for index, row in enumerate(sorted(values1, key=lambda x:x.w_vote, reverse = False)):
tmp = data_int.filter(data_int.w_vote >= row.w_vote)
# Left join service types to results
it1 = dt.join(master_info,dt.value == master_info.value, 'left').drop(dt.value)
print(tmp)
it1 = it1.withcolumn('iteration',F.lit('index')).otherwise(it1.iteration1)
it1 = it1.collect()[index]
# concatenate the results to the final hh list
temp1.append(it1)
print ('iterations left:', total_values - (index+1), "Threshold:", row.w_vote)
The problem I am facing is the output of temp1 comes as below
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 240 Threshold: 0.1
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 239 Threshold: 0.2
Why my actual values are not getting displayed in uutput as a list
print applied to a Dataframe execute the __repr__ method of the dataframes, which is what you get. If you want to print the content of the dataframe, use either show to display the first 20 lines, or collect to get the full dataframe.

Why Spark DataFrame Repartition not working correctly

Spark 1.6.2 HDP 2.5.2
I am using spark sql to fetch data from a hive table and then repartitioning on a particular column "serial" with 100 partitions but spark does not repartition the data into 100 partitions (can be seen as number of tasks in spark ui) instead has 126 tasks.
val data = sqlContext.sql("""select * from default.tbl_orc_zlib""")
val filteredData = data.filter( data("day").isNotNull ) // NULL check
//Repartition on serial column with 100 partitions
val repartData = filteredData.repartition(100,filteredData("serial"))
val repartSortData = repartData.sortWithinPartitions("serial","linenr")
val mappedData = repartSortData.map(s => s.mkString("\t"))
val res = mappedData.pipe("xyz.dll")
res.saveAsTextFile("hdfs:///../../../")
But if i use a coalesce first and then repartition then the number of tasks become 150 (correct 50 of coalesce and 100 of repartition)
filteredData.coalesce(50)//works fine
Can someone please explain me why is this happening

HBase Get values where rowkey in

How do I get all the values in HBase given Rowkey values?
val tableName = "myTable"
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tableName)
val theget= new Get(Bytes.toBytes("1001-A")) // rowkey values (1001-A, 1002-A, 2010-A, ...)
val result = hTable.get(theget)
val values = result.listCells()
The code above only works for one rowkey.
You can use Batch operations. Please refer the link below for Javadoc : Batch Operations on HTable
Another approach is to Scan with a start row key & end row key (First & Last row keys from an sorted ascending set of keys). This makes more sense if there are too many values.
There is htable.get method that take list of Gets:
List<Get> gets = ....
List<Result> results = htable.get(gets)

How to split rows of a Spark RDD by Deliminator

I am trying to split data in Spark into the form of an RDD of Array[String]. Currently I have loaded the file into an RDD of String.
> val csvFile = textFile("/input/spam.csv")
I would like to split on a a , deliminator.
This:
val csvFile = textFile("/input/spam.csv").map(line => line.split(","))
returns you RDD[Array[String]].
If you need first column as one RDD then using map function return only first index from Array:
val firstCol = csvFile.map(_.(0))
You should be using the spark-csv library which is able to parse your file considering headers and allow you to specify the delimitor. Also, it makes a pretty good job at infering the schema. I'll let you read the documentation to discover the plenty of options at your disposal.
This may look like this :
sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","your delimitor")
.load(pathToFile)
Be aware, this returns a DataFrame that you may have to convert to an rdd using .rdd function.
Of course, you will have to load the package into the driver for it to work.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// convert dataframe to rdd[row]
val rddRow = df.rdd
// print 2 rows
rddRow.take(2)
// convert df to rdd[string] for specific column
val oneColumn = df.select("colName").as[(String)].rdd
oneColumn.take(2)
// convert df to rdd[string] for multiple columns
val multiColumn = df.select("col1Name","col2Name").as[(String, String)].rdd
multiColumn.take(2)

Hbase - get column names for row by column name prefix

I have a Hbase Table with the following description.
For a row key, my column would be of the form a_1, a_2,a_3,b_1,c_1,C_2 and so on, a compound key format.
Suppose one of my row is of the form
row key - row1
column family - c1
columns - a_1, a_2,a_3,b_1,b_2,c_1,C_2,d_9,d_99
Can I, by any operation retrieve a,b,c,d as the columns corresponding to row1, I am not bothered about whatever be the suffixes for a,b,c...
I can get all column names for a given row, add them to set by splitting the row keys by their first part and emit the set. I am worried, if there would be a better way of doing it by filters or some other hbase way of getting it done, please comment...
You can use COlumnPrefixFilter for that. You can see the following code
Configuration hadoopConf = new Configuration();
hadoopConf.set("hbase.zookeeper.quorum", "localhost");
hadoopConf.set("hbase.zookeeper.property.clientPort", "2181");
HTable hTable = new HTable(hadoopConf, "KunderaExamples");
Scan scan = new Scan();
scan.setFilter(new ColumnPrefixFilter("A".getBytes()));
ResultScanner scanner = hTable.getScanner(scan);
Iterator<Result> resultsIter = scanner.iterator();
while (resultsIter.hasNext())
{
Result result = resultsIter.next();
List<KeyValue> values = result.list();
for (KeyValue value : values)
{
System.out.println(value.getKey());
System.out.println(new String(value.getQualifier()));
System.out.println(value.getValue());
}
}

Resources