Twitter crawler: why does the memory grow? - performance

I have been trying to crawl Twitter via the Streaming API and by filtering the retrieved tweets by keywords/hashtags/users.
Here is my example using HBC (although the same problem happens with Twitter4J):
// After connection:
final BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10000);
StatusesFilterEndpoint filterQuery = new StatusesFilterEndpoint();
filterQuery.followings(myListOfUserIDs);
filterQuery.trackTerms(myListOfKeywordsAndHashtags);
final ExecutorService executor = Executors.newFixedThreadPool(4);
Runnable tweetAnalyzer = defineRunnable(queue);
for (int i = 0; i < NUM_THREADS; i++)
executor.execute(tweetAnalyzer);
where the analyzer tweetAnalyzer is returned by:
private Runnable defineRunnable(final BlockingQueue<String> queue) {
return new Runnable() {
#Override
public void run() {
while (true)
try {
System.out.println(queue.take());
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
};
}
However, the process continues to grow in memory.
Two questions:
How to design this crawler properly, so that it does not grow in memory and does not saturate the RAM?
How to select the best queue length (here set to 10000) so that it does not saturate? I have seen that using this length the queue continues to be full of tweets (it never goes empty) and I am able to crawl 700 tweets/min, which is huge)
Thank you in advance.

It's a bit hard to determine from the snippets that you provide. Do you register StatusesFilterEndpoint correctly?
I would recommend that you write a separate thread to monitor the size of the queue.
Obvious you are not able to proceed all the twitter messages you download. So you can only:
reduce the number of tweets you download by filtering more aggressively
Sample the input by throwing away every n message.
use a faster machine although for the tweetAnalyzer you display in the question this might not help.
deploy on a cluster

Related

JCache Hazelcast embedded does not scale

Hello, Stackoverflow Community.
I have a Spring Boot application that uses Jcache with Hazelcast implementation as a cache Framework.
Each Hazelcast node has 5 caches with the size of 50000 elements each. There are 4 Hazelcast Instances that form a cluster.
The problem that I face is the following:
I have a very heavy call that reads data from all four caches. On the initial start, when all caches are yet empty, this call takes up to 600 seconds.
When there is one Hazelcast instance running and all 5 caches are filled with data, then this call happens relatively fast, it takes on average only 4 seconds.
When I start 2 Hazelcast instances and they form a cluster, then the response time gets worse, and the same call takes already 25 seconds on average.
And the more Hazelcast instances I add in a cluster, the longer the response time gets. Of course, I was expecting to see some worse delivery time when data is partitioned among Hazelcast nodes in a cluster. But I did not expect that just by adding one more hazelcast instance, the response time would get 6 - 7 times slower...
Please note, that for simplicity reasons and for testing purposes, I just start four Spring Boot Instances with each Hazelcast embedded node embedded in it on one machine. Therefore, such poor performance cannot be justified by network delays. I assume that this API call is so slow even with Hazelcast because much data needs to be serialized/deserialized when sent among Hazelcast cluster nodes. Please correct me if I am wrong.
The cache data is partitioned evenly among all nodes. I was thinking about adding near cache in order to reduce latency, however, according to the Hazelcast Documentation, the near cache is not available for Jcache Members. In my case, because of some project requirements, I am not able to switch to Jcache Clients to make use of Near Cache. Is there maybe some advice on how to reduce latency in such a scenario?
Thank you in advance.
DUMMY CODE SAMPLES TO DEMONSTRATE THE PROBLEM:
Hazelcast Config: stays default, nothing is changed
Caches:
private void createCaches() {
CacheConfiguration<?, ?> cacheConfig = new CacheConfig<>()
.setEvictionConfig(
new EvictionConfig()
.setEvictionPolicy(EvictionPolicy.LRU)
.setSize(150000)
.setMaxSizePolicy(MaxSizePolicy.ENTRY_COUNT)
)
.setBackupCount(5)
.setInMemoryFormat(InMemoryFormat.OBJECT)
.setManagementEnabled(true)
.setStatisticsEnabled(true);
cacheManager.createCache("books", cacheConfig);
cacheManager.createCache("bottles", cacheConfig);
cacheManager.createCache("chairs", cacheConfig);
cacheManager.createCache("tables", cacheConfig);
cacheManager.createCache("windows", cacheConfig);
}
Dummy Controller:
#GetMapping("/dummy_call")
public String getExampleObjects() { // simulates a situatation where one call needs to fetch data from multiple cached sources.
Instant start = Instant.now();
int i = 0;
while (i != 50000) {
exampleService.getBook(i);
exampleService.getBottle(i);
exampleService.getChair(i);
exampleService.getTable(i);
exampleService.getWindow(i);
i++;
}
Instant end = Instant.now();
return String.format("The heavy call took: %o seconds", Duration.between(start, end).getSeconds());
}
Dummy service:
#Service
public class ExampleService {
#CacheResult(cacheName = "books")
public ExampleBooks getBook(int i) {
try {
Thread.sleep(1); // just to simulate slow service here!
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Book(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "bottles")
public ExampleMooks getBottle(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Bottle(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "chairs")
public ExamplePooks getChair(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Chair(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "tables")
public ExampleRooks getTable(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Table(Integer.toString(i), Integer.toString(i));
}
#CacheResult(cacheName = "windows")
public ExampleTooks getWindow(int i) {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
return new Window(Integer.toString(i), Integer.toString(i));
}
}
If you do the math:
4s / 250 000 lookups is 0.016 ms per local lookup. This seems rather high, but let's take that.
When you add a single node then the data gets partitioned and half of the requests will be served from the other node. If you add 2 more nodes (4 total) then 25 % of the requests will be served locally and 75 % will be served over network. This should explain why the response time grows when you add more nodes.
Even simple ping on localhost takes twice or more time. On a real network the read latency we see in benchmarks is 0.3-0.4 ms per read call. This makes:
0.25 * 250k *0.016 + 0.75 * 250k * 0.3 = ~57 s
You simply won't be able to make so many calls serially over the network (even local one), you need to either
parallelize the calls - use javax.cache.Cache#getAll to reduce the number of calls
you can try enabling reading local backups via com.hazelcast.config.MapConfig#setReadBackupData so there is less requests over the network.
The read backup data feature is only available for IMap, so you would need to use Spring caching with hazelcast-spring module and its com.hazelcast.spring.cache.HazelcastCacheManager:
#Bean
HazelcastCacheManager cacheManager(HazelcastInstance hazelcastInstance) {
return new HazelcastCacheManager(hazelcastInstance);
}
See documentation for more details.

How to prevent data loss from redis where server is stopped forcefully which results in RedisCommandInterruptedException

#Autowired
private StringRedisTemplate stringRedisTemplate;
public List<Object> getDataFromRedis(String redisKey) {
try {
long numberOfEntriesToRead = 60000;
return stringRedisTemplate.executePipelined(
(RedisConnection connection) -> {
StringRedisConnection stringRedisConn =(StringRedisConnection)connection;
for (int index = 0; index < numberOfEntriesToRead; index++) {
stringRedisConn.lPop(redisKey);
}
return null;
});
}catch (RedisCommandInterruptedException e) {
LOGGER.error("Interrupted EXCEPTION :::", e);
}
}
}
I have a method which reads redis content for given key. Now the problem is when my application server is stopped while this method is trying to fetch data from redis i am getting RedisCommandInterruptedException exception which results in loss of some data from redis. So how can i overcome this problem Any suggestions are appreciable.
Pipelines are not atomic operations therefore there is no guarantee that all/none of the commands are executed when an exception happens.
You can use lua scripts or multi command to make run operations in a single transaction.
You can read more about using multi in spring boot data redis in this SO thread and this site.

Kafka Streams - The state store may have migrated to another instance

I'm writing a basic application to test the Interactive Queries feature of Kafka Streams. Here is the code:
public static void main(String[] args) {
StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier waypointsStoreSupplier = Stores.persistentKeyValueStore("test-store");
StoreBuilder waypointsStoreBuilder = Stores.keyValueStoreBuilder(waypointsStoreSupplier, Serdes.ByteArray(), Serdes.Integer());
final KStream<byte[], byte[]> waypointsStream = builder.stream("sample1");
final KStream<byte[], TruckDriverWaypoint> waypointsDeserialized = waypointsStream
.mapValues(CustomSerdes::deserializeTruckDriverWaypoint)
.filter((k,v) -> v.isPresent())
.mapValues(Optional::get);
waypointsDeserialized.groupByKey().aggregate(
() -> 1,
(aggKey, newWaypoint, aggValue) -> {
aggValue = aggValue + 1;
return aggValue;
}, Materialized.<byte[], Integer, KeyValueStore<Bytes, byte[]>>as("test-store").withKeySerde(Serdes.ByteArray()).withValueSerde(Serdes.Integer())
);
final KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(createStreamsProperties()));
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<byte[], Integer> keyValueStore = streams.store("test-store", QueryableStoreTypes.keyValueStore());
KeyValueIterator<byte[], Integer> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<byte[], Integer> next = range.next();
System.out.println(next.value);
}
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
protected static Properties createStreamsProperties() {
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "random167");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Serdes.Integer().getClass().getName());
//streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10000);
return streamsConfiguration;
}
So my problem is, every time I run this I get this same error:
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, test-store, may have migrated to another instance.
I'm running only 1 instance of the application, and the topic I'm consuming from has only 1 partition.
Any idea what I'm doing wrong ?
Looks like you have a race condition. From the kafka streams javadoc for KafkaStreams::start() it says:
Start the KafkaStreams instance by starting all its threads. This function is expected to be called only once during the life cycle of the client.
Because threads are started in the background, this method does not block.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html
You're calling streams.store() immediately after streams.start(), but I'd wager that you're in a state where it hasn't initialized fully yet.
Since this is code appears to be just for testing, add a Thread.sleep(5000) or something in there and give it a go. (This is not a solution for production) Depending on your input rate into the topic, that'll probably give a bit of time for the store to start filling up with events so that your KeyValueIterator actually has something to process/print.
Probably not applicable to OP but might help others:
In trying to retrieve a KTable's store, make sure the the KTable's topic exists first or you'll get this exception.
I failed to call Storebuilder before consuming the store.
Typically this happens for two reasons:
The local KafkaStreams instance is not yet ready (i.e., not yet in
runtime state RUNNING, see Run-time Status Information) and thus its
local state stores cannot be queried yet. The local KafkaStreams
instance is ready (e.g. in runtime state RUNNING), but the particular
state store was just migrated to another instance behind the scenes.
This may notably happen during the startup phase of a distributed
application or when you are adding/removing application instances.
https://docs.confluent.io/platform/current/streams/faq.html#handling-invalidstatestoreexception-the-state-store-may-have-migrated-to-another-instance
The simplest approach is to guard against InvalidStateStoreException when calling KafkaStreams#store():
// Example: Wait until the store of type T is queryable. When it is, return a reference to the store.
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}

LeaseExpiredException with custom UDF in Hive

I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library:
https://github.com/tobie/ua-parser
The UDF is rather simple:
public class DeviceTypeExtractTest extends UDF{
private Text result = new Text();
private static final Parser uaParser;
static {
try {
uaParser = new Parser();
}
catch(IOException e) {
throw new RuntimeException("Could not instantiate User-Agent parser.");
}
}
public Text evaluate( Text uaField){
if (uaField == null ) {
return null;
}
try
{
String uaString = uaField.toString();
Client client = uaParser.parse(uaString);
result.set(client.device.family);
return result;
}
catch(Exception e)
{
return null;
}
}
}
And it works just fine when run on a small dataset.
create table categories(
cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;
However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt:
http://pastebin.com/yK6Qmx6r
And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.
I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).
I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.
Any ideas?
Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.
Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.

(Twitter) Storm's Window On Aggregation

I'm playing around with Storm, and I'm wondering where Storm specifies (if possible) the (tumbling/sliding) window size upon an aggregation. E.g. If we want to find the trending topics for the previous hour on Twitter. How do we specify that a bolt should return results for every hour? Is this done programatically inside each bolt? Or is it some way to specify a "window" ?
Disclaimer: I wrote the Trending Topics with Storm article referenced by gakhov in his answer above.
I'd say the best practice is to use the so-called tick tuples in Storm 0.8+. With these you can configure your own spouts/bolts to be notified at certain time intervals (say, every ten seconds or every minute).
Here's a simple example that configures the component in question to receive tick tuples every ten seconds:
// in your spout/bolt
#Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
int tickFrequencyInSeconds = 10;
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, tickFrequencyInSeconds);
return conf;
}
You can then use a conditional switch in your spout/bolt's execute() method to distinguish "normal" incoming tuples from the special tick tuples. For instance:
// in your spout/bolt
#Override
public void execute(Tuple tuple) {
if (isTickTuple(tuple)) {
// now you can trigger e.g. a periodic activity
}
else {
// do something with the normal tuple
}
}
private static boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
Again, I wrote a pretty detailed blog post about doing this in Storm a few days ago as gakhov pointed out (shameless plug!).
Add a new spout with parallelism degree of 1, and have it emit an empty signal and then Utils.sleep until next time (all done in nextTuple). Then, link all relevant bolts to that spout using all-grouping, so all of their instances will receive that same signal.

Resources