I am a newbie on storm. just thinking if I can use storm to merge/join two tables from two different dbs(of coz, two tables have some sort of Foreign Key relationship, just happen to exist in different dbs/systems), any ideas How I'd make up the topology? like having two separated spouts reading periodically from two dbs and having a bolt to do the join work?
Is this even a proper use case for storm?
any ideas are appreciated!
This may be a good use of Storm, but it really depends on your dataset. If you just have two tables in separate DBMSs that you want to join and store in some third place (DBMS or otherwise), Storm will only make really make sense if this is a streaming join, i.e. the two tables are frequently written to and you want to join the stuff that was just recently written together.
Also, it almost goes without saying that you should only employ the complexity Storm will bring if this is for something relatively large and high volume.
If it's small, you will probably be better served with a traditional ETL tool, even if that's just some code you whip up to access the two databases and combine the data.
If the data set is large and you need to do joins across more than a short timeframe, I'd consider doing this another way, such as using a map-reduce job that pulls data from the two DBs and spreads the join out over a cluster.
like having two separated spouts reading periodically from two dbs and having a bolt to do the join work
Yes, this is very much possible. Storm can have multiple spouts. And A bolt consumes any number of input streams, does some processing, and possibly emits new streams. typically its better to have your spout read from a queue like Kafka or RabbitMQ (you can find spout integration with most of the queuing system). So in that case you can feed the queue with the data from DB and then let spout consumes the same.
UPDATE:
Here is a very nice Article about how storm parallelism works
Related
I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.
I have a lot of data saved into Cassandra on a daily basis and I want to compare one datapoint with last 5 versions of data for different regions.
Lets say there is a price datapoint of a product and there are 2000 products in a context/region(say US). I want to show a heat map dash board showing when the price change happened for different regions.
I am new to hadoop, hive and pig. Which path would help me achieve my goal and some details appreciated.
Thanks.
This sounds like a good use case for either traditional mapreduce or spark. You have relatively infrequent updates, so a batch job running over the data and updating a table that in turn provides the data for the heatmap seems like the right way to go. Since the updates are infrequent, you probably don't need to worry about spark streaming- just a traditional batch job run a few times a day is fine.
Here's some info from datastax on reading from cassandra in a spark job: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSCcontext.html
For either spark or mapreduce, you are going to want to leverage the (spark or MR) framework's ability to partition the task- if you are manually connecting to cassandra and reading/writing the data like you would from a traditional RDBMS, you are probably doing something wrong. If you write your job correctly, the framework will be responsible for spinning up multiple readers (one for each node that contains the source data that you are interested in), distributing the calculation tasks, and routing the results to the appropriate machine to store them.
Some more examples are here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkIntro.html
and
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/byoh/byohIntro.html
Either way, MapReduce is probably a little simpler, and Spark is probably a little more future proof.
We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.
By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.
A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Yes.
I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.
Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:
Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.
Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.
Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.
I have two different files which each contain different data. I would like to do some processing with these files then merge the data together based on matching keys. What is the best way to implement this in Hadoop? I was thinking of somehow creating two mappers that would each process one file then a reducer to combine the data? I'm not sure if this is even possible. Does anyone have any suggestion as to how I can combine data from two files in Hadoop?
There are many ways to write map/reduce job (Hive, Pig, Cascading, Java etc.) but essentially a join is a multi-input job where the mappers emit record in the key_to_join_by and rest_of_data format and the reducer does the actual join (unless one of the files is small enough to hold in memory where you can do the join in the mapper)
You can see an example of how to do this in Pig here
Can you give examples of your file? It is not clear what you are asking. Are you talking about doing joins in Hadoop? If so you will need to have two mapper classes. Or you can use Hive which makes performing joins easier. Please look at this for examples of both the possible solutions: Joins in Hadoop
If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.