LeaseExpiredException with custom UDF in Hive - hadoop

I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library:
https://github.com/tobie/ua-parser
The UDF is rather simple:
public class DeviceTypeExtractTest extends UDF{
private Text result = new Text();
private static final Parser uaParser;
static {
try {
uaParser = new Parser();
}
catch(IOException e) {
throw new RuntimeException("Could not instantiate User-Agent parser.");
}
}
public Text evaluate( Text uaField){
if (uaField == null ) {
return null;
}
try
{
String uaString = uaField.toString();
Client client = uaParser.parse(uaString);
result.set(client.device.family);
return result;
}
catch(Exception e)
{
return null;
}
}
}
And it works just fine when run on a small dataset.
create table categories(
cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;
However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt:
http://pastebin.com/yK6Qmx6r
And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.
I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).
I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.
Any ideas?

Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.
Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.

Related

Benchmarking spring data vs JDBI in select from postgres Database

I wanted to compare the performence for Spring data vs JDBI
I used the following versions
Spring Boot 2.2.4.RELEASE
vs
JDBI 3.13.0
the test is fairly simple select * from admin table and convert to a list of Admin object
here is the relevant details
with spring boot
public interface AdminService extends JpaRepository<Admin, Integer> {
}
and for JDBI
public List<Admin> getAdmins() {
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
try {
handle = Sql2oConnection.getInstance().getJdbi().open();
return handle.createQuery(sql).mapToBean(Admin.class).list();
}catch(Exception ex) {
log.error("Could not select admins from admins: {}", ex.getMessage(), ex );
return null;
} finally {
handle.close();
}
}
the test class is executed using junit 5
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
adminService.getAdmins(); // for spring its findAll()
for(Admin admin: admins) {
if(admin.getAdminId() == 654) {
System.out.println("just to simulate work with the data");
}
}
}
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
i was quite shocked to get the following results
Spring Data: 2 seconds
JDBI: 59 seconds
any idea why i got these results? i was expecting JDBI to be faster
The issue was that spring manages the connection life cycle for us and for a good reason
after reading the docs of JDBI
There is a performance penalty every time a connection is allocated
and released. In the example above, the two insertFullContact
operations take separate Connection objects from your database
connection pool.
i changed the test code of the JDBI test to the following
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
handle = Sql2oConnection.getInstance().getJdbi().open();
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
List<Admin> admins = handle.createQuery(sql).mapToBean(Admin.class).list();
if(!admins.isEmpty()) {
for(Admin admin: admins) {
System.out.println(admin.getUsername());
}
}
}
handle.close();
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
this way the connection is opened once and the query runs 1000 times
the final result was 1 second
twice as fast as spring
On the one hand you seem to make some basic mistakes of benchmarking:
You are not warming up the JVM.
You are not using the results in any way.
Therefore what you are seeing might just be effects of different optimisations of the VM.
Look into JMH in order to improve your benchmarks.
Benchmarks with an external resource are extra hard, because you have so many more parameters to control.
One big question is for example if the connection to the database is realistically slow as in most production systems the database will be on a different machine at least virtually, quite possibly on different hardware.
Is that true in your test as well?
Assuming your results are real, the next step is to investigate where the extra time gets spent.
I would expect the most time to be spent with executing the SQL statements and obtaining the result via the network.
Therefore you should inspect what SQL statements actually get executed.
This might point you to one possible answer that JPA is doing lots of lazy loading and hasn't even loaded most of you really need.

Kafka state store not available in distributed environment

I have a business application with the following versions
spring boot(2.2.0.RELEASE) spring-Kafka(2.3.1-RELEASE)
spring-cloud-stream-binder-kafka(2.2.1-RELEASE)
spring-cloud-stream-binder-kafka-core(3.0.3-RELEASE)
spring-cloud-stream-binder-kafka-streams(3.0.3-RELEASE)
We have around 20 batches.Each batch using 6-7 topics to handle the business.Each service has its own state store to maintain the status of the batch whether its running/Idle.
Using the below code to query th store
#Autowired
private InteractiveQueryService interactiveQueryService;
public ReadOnlyKeyValueStore<String, String> fetchKeyValueStoreBy(String storeName) {
while (true) {
try {
log.info("Waiting for state store");
return new ReadOnlyKeyValueStoreWrapper<>(interactiveQueryService.getQueryableStore(storeName,
QueryableStoreTypes.<String, String> keyValueStore()));
} catch (final IllegalStateException e) {
try {
Thread.sleep(1000);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
}
}
When deploying the application in one instance(Linux machine) every thing is working fine.While deploying the application in 2 instance we find the folowing observations
state store is available in one instance and other dosen't have.
When the request is being processed by the instance which has the state store every thing is fine.
If the request falls to the instance which does not have state store the application is waiting in the while loop indefinitley(above code snippet).
While the instance without store waiting indefinitely and if we kill the other instance the above code returns the store and it was processing perfectly.
No clue what we are missing.
When you have multiple Kafka Streams processors running with interactive queries, the code that you showed above will not work the way you expect. It only returns results, if the keys that you are querying are on the same server. In order to fix this, you need to add the property - spring.cloud.stream.kafka.streams.binder.configuration.application.server: <server>:<port> on each instance. Make sure to change the server and port to the correct ones on each server. Then you have to write code similar to the following:
org.apache.kafka.streams.state.HostInfo hostInfo = interactiveQueryService.getHostInfo("store-name",
key, keySerializer);
if (interactiveQueryService.getCurrentHostInfo().equals(hostInfo)) {
//query from the store that is locally available
}
else {
//query from the remote host
}
Please see the reference docs for more information.
Here is a sample code that demonstrates that.

Spring data Page/Pageable returns duplicates on large data sets?

When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.

Integrate key-value database with Spark

I'm having trouble understanding how Spark interacts with storage.
I would like to make a Spark cluster that fetches data from a RocksDB database (or any other key-value store). However, at this moment, the best I can do is fetch the whole dataset from the database into memory in each of the cluster nodes (into a map for example) and build an RDD from that object.
What do I have to do to fetch only the necessary data (like Spark does with HDFS)? I've read about Hadoop Input Format and Record Readers, but I'm not completely grasping what I should implement.
I know this is kind of a broad question, but I would really appreciate some help to get me started. Thank you in advance.
Here is one possible solution. I assume you have client library for the key-value store(RocksDB in your case) that you want to access.
KeyValuePair represents a bean class representing one Key-value pair from your key-value store.
Classes
/*Lazy iterator to read from KeyValue store*/
class KeyValueIterator implements Iterator<KeyValuePair> {
public KeyValueIterator() {
//TODO initialize your custom reader using java client library
}
#Override
public boolean hasNext() {
//TODO
}
#Override
public KeyValuePair next() {
//TODO
}
}
class KeyValueReader implements FlatMapFunction<KeyValuePair, KeyValuePair>() {
#Override
public Iterator<KeyValuePair> call(KeyValuePair keyValuePair) throws Exception {
//ignore empty 'keyValuePair' object
return new KeyValueIterator();
}
}
Create KeyValue RDD
/*list with a dummy KeyValuePair instance*/
ArrayList<KeyValuePair> keyValuePairs = new ArrayList<>();
keyValuePairs.add(new KeyValuePair());
JavaRDD<KeyValuePair> keyValuePairRDD = javaSparkContext.parallelize(keyValuePairs);
/*Read one key-value pair at a time lazily*/
keyValuePairRDD = keyValuePairRDD.flatMap(new KeyValueReader());
Note:
Above solution creates an RDD with two partitions by default(one of them will be empty). Increase the partitions before applying any transformation on keyValuePairRDD to distribute the processing across executors.
Different ways to increase partitions:
keyValuePairRDD.repartition(partitionCounts)
//OR
keyValuePairRDD.partitionBy(...)

Twitter crawler: why does the memory grow?

I have been trying to crawl Twitter via the Streaming API and by filtering the retrieved tweets by keywords/hashtags/users.
Here is my example using HBC (although the same problem happens with Twitter4J):
// After connection:
final BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10000);
StatusesFilterEndpoint filterQuery = new StatusesFilterEndpoint();
filterQuery.followings(myListOfUserIDs);
filterQuery.trackTerms(myListOfKeywordsAndHashtags);
final ExecutorService executor = Executors.newFixedThreadPool(4);
Runnable tweetAnalyzer = defineRunnable(queue);
for (int i = 0; i < NUM_THREADS; i++)
executor.execute(tweetAnalyzer);
where the analyzer tweetAnalyzer is returned by:
private Runnable defineRunnable(final BlockingQueue<String> queue) {
return new Runnable() {
#Override
public void run() {
while (true)
try {
System.out.println(queue.take());
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
};
}
However, the process continues to grow in memory.
Two questions:
How to design this crawler properly, so that it does not grow in memory and does not saturate the RAM?
How to select the best queue length (here set to 10000) so that it does not saturate? I have seen that using this length the queue continues to be full of tweets (it never goes empty) and I am able to crawl 700 tweets/min, which is huge)
Thank you in advance.
It's a bit hard to determine from the snippets that you provide. Do you register StatusesFilterEndpoint correctly?
I would recommend that you write a separate thread to monitor the size of the queue.
Obvious you are not able to proceed all the twitter messages you download. So you can only:
reduce the number of tweets you download by filtering more aggressively
Sample the input by throwing away every n message.
use a faster machine although for the tweetAnalyzer you display in the question this might not help.
deploy on a cluster

Resources