.cartesian() in PySpark

.cartesian() in PySpark - hadoop

I created rdd = sc.parallelize(range(200)). Then I set rdd2 = rdd.cartesian(rdd). I found that as expected rdd2.count() was 40,000. However, when I set rdd3 = rdd2.cartesian(rdd), rdd3.count() was less than 20,000. Why is this the case?

This is a bug tracked by SPARK-16589.

Related

Very slow connection to Snowflake from Databricks

I am trying to connect to Snowflake using R in databricks, my connection works and I can make queries and retrieve data successfully, however my problem is that it can take more than 25 minutes to simply connect, but once connected all my queries are quick thereafter.
I am using the sparklyr function 'spark_read_source', which looks like this:
query<- spark_read_source(
sc = sc,
name = "query_tbl",
memory = FALSE,
overwrite = TRUE,
source = "snowflake",
options = append(sf_options, client_Q)
)
where 'sf_options' are a list of connection parameters which look similar to this;
sf_options <- list(
sfUrl = "https://<my_account>.snowflakecomputing.com",
sfUser = "<my_user>",
sfPassword = "<my_pass>",
sfDatabase = "<my_database>",
sfSchema = "<my_schema>",
sfWarehouse = "<my_warehouse>",
sfRole = "<my_role>"
)
and my query is a string appended to the 'options' arguement e.g.
client_Q <- 'SELECT * FROM <my_database>.<my_schema>.<my_table>'
I can't understand why it is taking so long, if I run the same query from RStudio using a local spark instance and 'dbGetQuery', it is instant.
Is spark_read_source the problem? Is it an issue between Snowflake and Databricks? Or something else? Any help would be great. Thanks.

runGdal start date issue

Upon running the following line of code, the download is starting from 2019001 instead of 2015001. How can this be fixed?
runGdal(product= "MOD13Q1",begin = "2015001",end = "2020366",tileH = h,tileV = v,SDSstring = B,
job = "Quality", outProj="GEO", datum="WGS84")

Changing the output path to a new/different folder under MODISoptions solves this issue.

asyncronously pulling large data from pyodbc using fetchmany

I am trying to pull a large dataset from pyodbc. My code below works ok, but it is serial, hence slow. I want to make it able to initiate multiple IO calls asynchronously. I see many examples using asyncio - but cannot find anything i can use with fetchmany. I appreciate any suggestions! I attempted to pool using asyncio but couldn't make it work.
conn = pyodbc.connect('DSN=Denodo Interfaces')
cursor = conn.cursor()
strng = strng.replace('myWellName', well_name)
cursor.execute(strng)
cols = [column[0] for column in cursor.description]
mylist=[]
while True:
rows = cursor.fetchmany(10000)
if not rows:
break
df = pd.DataFrame([tuple(t) for t in rows], columns = cols)
mylist.append(df)
df = pd.concat(mylist, axis=0).reset_index(drop=True)

How to use Hadoop's MapFileOutputFormat in Flink?

I've got stuck while I'm writing a program using Apache Flink. The problem is that I'm trying to generate Hadoop's MapFile as a result of computation but Scala compiler complains about type mismatch.
To illustrate the problem, let me show you the below code snippet which tries to generate two kinds of output: one is Hadoop's SequenceFile and the other is MapFile.
val dataSet: DataSet[(IntWritable, BytesWritable)] =
env.readSequenceFile(classOf[Text], classOf[BytesWritable], inputSequenceFile.toString)
.map(mapper(_))
.partitionCustom(partitioner, 0)
.sortPartition(0, Order.ASCENDING)
val seqOF = new HadoopOutputFormat(
new SequenceFileOutputFormat[IntWritable, BytesWritable](), Job.getInstance(hadoopConf)
)
val mapfileOF = new HadoopOutputFormat(
new MapFileOutputFormat(), Job.getInstance(hadoopConf)
)
val dataSink1 = dataSet.output(seqOF) // it typechecks!
val dataSink2 = dataSet.output(mapfileOF) // syntax error
As commented above, dataSet.output(mapfileOF) causes Scala compiler to complain as follows:
FYI, compared to SequenceFile, MapFile calls for a stronger condition that a key must be WritableComparable.
Before writing the application using Flink, I implemented it using Spark as below and it worked okay (no compilation error and it runs okay without any error).
val rdd = sc
.sequenceFile(inputSequenceFile.toString, classOf[Text], classOf[BytesWritable])
.map(mapper(_))
.repartitionAndSortWithinPartitions(partitioner)
rdd.saveAsNewAPIHadoopFile(
outputPath.toString,
classOf[IntWritable],
classOf[BytesWritable],
classOf[MapFileOutputFormat]
)

Did you check: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/hadoop_compatibility.html#using-hadoop-outputformats
It contains the following example:
// Obtain your result to emit.
val hadoopResult: DataSet[(Text, IntWritable)] = [...]
val hadoopOF = new HadoopOutputFormat[Text,IntWritable](
new TextOutputFormat[Text, IntWritable],
new JobConf)
hadoopOF.getJobConf.set("mapred.textoutputformat.separator", " ")
FileOutputFormat.setOutputPath(hadoopOF.getJobConf, new Path(resultPath))
hadoopResult.output(hadoopOF)

How to export data from Spark SQL to CSV

This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.

You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")

Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.

The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/

The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}

With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`

The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.

enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

.cartesian() in PySpark - hadoop

I created rdd = sc.parallelize(range(200)). Then I set rdd2 = rdd.cartesian(rdd). I found that as expected rdd2.count() was 40,000. However, when I set rdd3 = rdd2.cartesian(rdd), rdd3.count() was less than 20,000. Why is this the case?

This is a bug tracked by SPARK-16589.

Related

Very slow connection to Snowflake from Databricks

runGdal start date issue

asyncronously pulling large data from pyodbc using fetchmany

How to use Hadoop's MapFileOutputFormat in Flink?

How to export data from Spark SQL to CSV

Categories

Resources