HBase Data Access performance improvement using HBase API - hadoop

I am trying to scan some rows using prefix filter from the HBase table. I am on HBase 0.96.
I want to increase the throughput of each RPC call so as to reduce the number of request hitting the region.
I tried getCaching(int) and setCacheBlocks(true) on the scan object. I also tried adding resultScanner.next(int). Using all these combination I am still not able to reduce the number of RPC calls. I am still hitting HBase region for each key instead of bringing the multiple keys per RPC call.
The HBase region server/ Datanode has enough CPU and Memory allocated. Also my data is evenly distributed across different region servers. Also the data that I am bring back per key is not a lot.
I observed that when I add more data to the table the time taken for the request increases. It also increases when the number of request increases.
Thank you for your help.
R

Prefix filter is usually a performance killer because they perform full table scan, always use a start and stop row in your scans rather than prefix filter.
Scan scan = new Scan(Bytes.toBytes("prefix"),Bytes.toBytes("prefix~"));
when iterate over the Result from the ResultScanner, every iteration is an RPC call, you can call resultScanner.next(n) to get a batch of results in one go.

Related

Looking for an Equivalent of GenerateTableFetch

I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.

Spark streaming get pre computed features

I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.

How to make a cached from a finished Spark Job still accessible for the other job?

My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.

How to lazily build a cache from spark streaming data

I am running a streaming job and want to build a lookup map incrementally( track unique items, filter duplicated incomings for example), initially I was thinking to keep one DataFrame in cache, and union it with new DataFrame created in each batch, something like this
items.foreachRDD((rdd: RDD[String]) => {
...
val uf = rdd.toDF
cached_df = cached_df.unionAll(uf)
cached_df.cache
cached_df.count // materialize the
...
})
My concern is that the cached_df seems remember all the lineages to previous RDDs appended from every batch iteration, in my case, if I don't care to recompute this cached RDD if it crashes, is that an overhead to maintain the growing DAG?
As an alternative, at the beginning of each batch, I load the lookup from parquet file, instead of keeping it in memory, then at the end of each batch I append the new RDD to the same parquet file:
noDuplicatedDF.write.mode(SaveMode.Append).parquet("lookup")
This works as expected, but is there straight forward way that keep the lookup in memory?
Thanks
Wanchun
Appending to Parquet is definitely the right approach. However you could optimize the lookup. If you are okay with the in-memory cache to be slightly delayed (that is, does not have the latest second data), then you could periodically (say, every 5 minutes) load the current "lookup" parquet table in memory (assuming it fits). And all lookup queries will lookup the latest 5 min snapshot.
You could also pipeline the loading to memory and serving of queries in different thread.

Reduce job pending in HFileOutputFormat

I am using
Hbase:0.92.1-cdh4.1.2, and
Hadoop:2.0.0-cdh4.1.2
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/127.0.0.1:43455 has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.setName(Bytes.toBytes(tableName));
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
http://databuzzprd.blogspot.in/2013/11/bulk-load-data-in-hbase-table.html

Resources