Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr - hadoop

I am relatively a newbie to big data processing looking for some specific guidance from the SO community.
We are currently setup with a monolithic/sequential ETL, needless to say it is not scalable as our data grows. What are our options (sure distributing and parallelizing are but need specifics)? I have played with Hadoop and it may be appropriate to use here, but I am wondering what are some of the other options out there? May be something that's easier to transition to for a database developer?
Kind of related to question above is we also have an OLAP cube for aggregated data. Is Elasticsearch or Solr good candidates for replacing an OLAP cube? Has anyone successfully done this? What are the gotchas?

same kind of use case currently we are working on.
our approach may be use full.
step 1: we are sqooping data to Hdfs from dbs
step 2: ETL logic in Pig scripting
step 3: building index on aggregated table data to solr.
step 4: search on solr through web interface.
in our use case we are developing pig jobs to perform transformation logic storing them to final folders incrementally. later MR indexer tool will index the data to solr.we are using cloudera-search. let me know if any thing.

Related

Modern Business Intelligence solution

What is the modern way of building a Business Intelligence solution? I have looked at PowerBI, but I'm wondering what would be the best datasource for it. Is it still traditional datawarehouse solutions that should be used as a datasource? I also hear a lot talk about data lakes, but don't know much about. Or should I just use a regular relational database as the source? Do anyone have any opinions and tips on this?
I think your starting point in your thinking is wrong. You don't chose a front end BI / Dashboard tool and then think what source would be best to connect to it.
You start from your data & information that you want to analyze, report & visualize. Think of structure & variety of data and complexity of analysis, correlations, integrations & business logic.
Then decide how are you going to
Store the data
Process / Transform the data to correlate, integrate or enrich
report or visualize the data
And its only in step 3 from above high level tasks that you come to start thinking of which Analysis / visualization tool is best fit for such data & its integrations with data storage platform I have as well as nature of the data itself.
That will most likely bring you more success than thinking about it the way you posed that question.
I hope it helps.
Start with your data.
Do you have a data warehouse now? If no,
Where is your data, databases, Excel, email? Data in databases, like MySQL, is structured. Data in email or other documents is unstructured. Depending on where your data lives impacts how you will analyze it (which is what BI is all about, in the end.) (And a side note, data lakes are best for analyzing structured, semi-structured and unstructured data together. For example if you queried for data in documentation, a SQL DB and older MS Access data dumps.)
If you have data in different databases and systems, then I would recommend you start with a data warehouse. There are many options, one of the easier ones today is using a cloud-based solution (AWS, Microsoft, etc.)
Once your data is in a location(s) where it can be queried and analyzed as a total data set then you can look at the BI tools that fit your needs.
4.a. What type of analysis do you need? Queries? Trends? Complex data calculations and transformations?
Based on 4.a. look at the tools in the market. PowerBI is just one of a whole variety of data analysis tools and systems on the market. There are many resources on the web, Google ETL tools.
After all of this you can narrow down your choices and select the solution that works best for you.

How do I improve my ETL performance?

Background Info:
I have a tradition ETL (on SQL Server) which takes around 6 hours to complete. I am looking to optimise the ETL. Below are the steps I have already taken:
Removed unnecessary CURSOR from the logic. For the remaining one's that I am not able to remove, I used READ_ONLY, FAST_FORWARD, INSENSITIVE.
There was some sorting of data happening, which I removed.
Tune the SQL Queries that were long running by using compiler hints or Join hints.
Removed unnecessary columns that were being fetched from source.
Partitioned the tables as well. I use partition switch which did improve some performance.
Is there any other method I am missing which could help make the ETL faster? At this point, we don't have the option of adding more powerful hardware resources or migrating into Hadoop.
Any help would be appreciated.
Few questions:
Are your sources SQL Server databases?
Have you reviewed your destination database?
Is this a dimensional datawarehouse or a normalised data store?
Without much knowledge on your source and destination, some other things I might recommend:
1)Remove unwanted lookup transformations, if you have any.
2)If you can afford to, I would look at creating indexes on some of your source tables. Not always feasible, but this helps believe me.
3)Remove unwanted UNIONs
If its possible please share further info on your ETL/Database architecture and I am sure many brains over here would be able to shed more wisdom.
Cheers
Nithin

Pros and cons of integrating hadoop with OBIEE

I am studying about integrating hadoop with OBIEE. However I am unable to find any good article highlighting pros and cons of integrating Hadoop with OBIEE.if anyone has this information kindly share the link/details.
Pros: You can get your data from Hadoop
Cons: Pointless, unless your data is in Hadoop
As a question this really doesn't make much sense. You integrate OBIEE with wherever your data is, in order to analyse it.
+1 to Robin. The point if a source-agnostic tool is to analyze data wherever it lies.
Pushing data to a new storage "just because" isn't adding value. You have to have a reason like performance, explicit physical modelling (multidimensional cubes for example) or the likes.

Pure spark vs spark SQL for quering data on HDFS

I have (tabular) data on a hdfs cluster and need to do some slightly complex querying on it. I expect to face the same situation many times in the future, with other data. And so, question:
What are the factors to take into account to choose where to use (pure) Spark and where to use Spark-SQL when implementing such task?
Here is the selection factors I could think of:
Familiarity with language:
In my case, I am more of a data-analyst than a DB guy, so this would lead me to use spark: I am more comfortable to think of how to (efficiently) implement data selection in Java/Scala than in SQL. This however depends mostly on the query.
Serialization:
I think that one can run Spark-SQL query without sending home-made-jar+dep to the spark worker (?). But then, returned data are raw and should be converted locally.
Efficiency:
I have no idea what differences there are between the two.
I know this question might be too general for SO, but maybe not. So, could anyone with more knowledge provides some insight?
About point 3, depending on your input-format, the way in which the data is scanned can be different when you use a pure-Spark vs Spark SQL. For example if your input format has multiple columns, but you need only few of them, it's possible to skip the retrieval using Spark SQL, whereas this is a bit trickier to achieve in pure Spark.
On top of that Spark SQL has a query optimizer, when using DataFrame or a query statement, the resulting query will go through the optimizer such that it will be executed more efficiently.
Spark SQL does not exclude Spark; combined usage is probably for the best results.

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Resources