How to lazily build a cache from spark streaming data - caching

I am running a streaming job and want to build a lookup map incrementally( track unique items, filter duplicated incomings for example), initially I was thinking to keep one DataFrame in cache, and union it with new DataFrame created in each batch, something like this
items.foreachRDD((rdd: RDD[String]) => {
...
val uf = rdd.toDF
cached_df = cached_df.unionAll(uf)
cached_df.cache
cached_df.count // materialize the
...
})
My concern is that the cached_df seems remember all the lineages to previous RDDs appended from every batch iteration, in my case, if I don't care to recompute this cached RDD if it crashes, is that an overhead to maintain the growing DAG?
As an alternative, at the beginning of each batch, I load the lookup from parquet file, instead of keeping it in memory, then at the end of each batch I append the new RDD to the same parquet file:
noDuplicatedDF.write.mode(SaveMode.Append).parquet("lookup")
This works as expected, but is there straight forward way that keep the lookup in memory?
Thanks
Wanchun

Appending to Parquet is definitely the right approach. However you could optimize the lookup. If you are okay with the in-memory cache to be slightly delayed (that is, does not have the latest second data), then you could periodically (say, every 5 minutes) load the current "lookup" parquet table in memory (assuming it fits). And all lookup queries will lookup the latest 5 min snapshot.
You could also pipeline the loading to memory and serving of queries in different thread.

Related

exception: org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update

I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?
When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.

Would spark dataframe read from external source on every action?

On a spark shell I use the below code to read from a csv file
val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()
Assuming this displays 10 rows. If I add a new row in the csv by editing it, would calling df.show() again show the new row? If so, does it mean that the dataframe reads from an external source (in this case a csv file) on every action?
Note that I am not caching the dataframe nor I am recreating the dataframe using the spark session
After each action spark forgets about the loaded data and any intermediate variables value you used in between.
So, if you invoke 4 actions one after another, it computes everything from beginning each time.
Reason is simple, spark works by building DAG, which lets it visualize path of operation from reading of data to action, and than it executes it.
That is the reason cache and broadcast variables are there. Onus is on developer to know and cache, if they know they are going to reuse that data or dataframe N number of times.
TL;DR DataFrame is not different than RDD. You can expect that the same rules apply.
With simple plan like this the answer is yes. It will read data for every show although, if action doesn't require all data (like here0 it won't read complete file.
In general case (complex execution plans) data can accessed from the shuffle files.

Spark streaming get pre computed features

I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.

HBase Data Access performance improvement using HBase API

I am trying to scan some rows using prefix filter from the HBase table. I am on HBase 0.96.
I want to increase the throughput of each RPC call so as to reduce the number of request hitting the region.
I tried getCaching(int) and setCacheBlocks(true) on the scan object. I also tried adding resultScanner.next(int). Using all these combination I am still not able to reduce the number of RPC calls. I am still hitting HBase region for each key instead of bringing the multiple keys per RPC call.
The HBase region server/ Datanode has enough CPU and Memory allocated. Also my data is evenly distributed across different region servers. Also the data that I am bring back per key is not a lot.
I observed that when I add more data to the table the time taken for the request increases. It also increases when the number of request increases.
Thank you for your help.
R
Prefix filter is usually a performance killer because they perform full table scan, always use a start and stop row in your scans rather than prefix filter.
Scan scan = new Scan(Bytes.toBytes("prefix"),Bytes.toBytes("prefix~"));
when iterate over the Result from the ResultScanner, every iteration is an RPC call, you can call resultScanner.next(n) to get a batch of results in one go.

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Resources