Actually, I have a project to deal with. I'm Asking for help.
My project is in the field of Business intelligence and creating datawarehouses.
I extracted Data that I need (ETL) and then what should I do ?
I am working with MS SQL Server 2014.
How to create my dimensions and my Fact table?
looking for advises
Please do accept my salvation.
This is a big question! Unfortunately, Stack Overflow's Q&A format isn't the best place to answer this. But here are few pointers:
Everything starts with the requirements. Before you write any code, figure out exactly what your data warehouse will be used for (it can also be helpful to work out what your data warehouse will not be used for).
Analyse the raw data. Make sure you know what is and is not available. Be aware of the source systems shortcomings. Example: If your reports need to split your customers by country, is this data available? If so, is it consistently populated (some records have US, others USA, others still America)? Make a plan for dealing with these issues (see data cleansing below).
Prototype your data model. Excel and Power BI are great places to test the design. Once you start using a database it becomes much harder to change. Get it right at the very beginning and your life will be much easier.
Pick an ETL tool. Make sure you understand it, and it plays to the strengths of you and your team. I like SSIS.
Import the raw data into staging tables. This can help to simplify the analysis phase.
Cleanse the data. In a data warehouse, you have 100% control over every row, column and cell. Make use of this fact. Ensure only quality, useful, well-conformed data makes it into your published tables.
Like all projects, planning and administration is the key. Writing code and building tables comes last.
Here are some resources which should help you:
Kimball Group. Ralph Kimball literally wrote the book on data warehousing (see next tip). His company's website contains a few hints and tips.
If you cannot attend a training course, buy a good book. I'd recommend this one. It's a big subject. Blogs and the internet can only teach you so much.
Download and try out Adventure Works DW. This is a sample data warehouse and ETL package, built by Microsoft. It demonstrates some the techniques you can use in SSIS.
Related
So I will be embarking on designing a dashboard that will display KPI's and other relevant information for my team. Since I am in the early stages of this project and am not very familiar on the technical process behind designing a dashboard, I need some questions vetted out first before I go and shop for some solutions to avoid reinventing the wheel.
Here are some of my questions:
We want a dashboard that can provide live-time information via our data sources (or as close to live-time as possible). What function allows a dashboard to update itself with concurrent datasources? From a conceptual standpoint, I can understand creating a dashboard out of Microsoft Excel, and having the dashboard dependent on the values you may have set within your pivot table.
How do you make a dashboard request information from multiple datasources on its own? Just like the excel example, a user may have to go into the pivot tables to update values, but I want to know how would a dashboard request this by itself and what is the exact method from a programming standpoint? Does the code execute itself every time you refresh the webpage?
How do you create datasources organically? I know for some solutions such as SharePoint BI Center, there are pre-supported datasources like an excel sheet or SharePoint and it's as easy as uploading your document and letting the design handle the rest. However, there are going to be some datasources that I know that will need to be fetched. Do I need to understand something else like an event recorder in order to navigate this issue?
Introduction
The dashboard (or a report, respectively) is usually the result of a long chain of steps. Very much simplified it could look like this:
src1
|------\
src2 | /---- Dashboards
|------+---[DWH]-[BR]-+
src n | | \---- Reports etc.
|------/ [Big Data]
Keep in mind, this is only a very, very simple structure of a data backend / frontend.
DWH means Data Warehouse, where data might be stored temporarily (you referred to this as fetching). This could be a database, could be a Big Data engine, could be a combination of both...
Afterwards, there are Business Rules (BR). Those might be specific rules in how different departments calculate and relate to data, but also simple things like algebra.
Questions
So, the main question should not be about the technology:
What software should we choose?
How can we create a dashboard?
but on the contrary focused on your business processes (see it like a top-down view):
How does our core process look like? Where would I like to measure data?
How would department a calculate sales in difference to department b? Should all use the same rule?
Where does everyone store the data? Can we access it? Do we need structural data?
And, very easy to forget but also easily sometimes one of the biggest parts: Is the identifier of a business object (say, sales id) everywhere build and formatted in the same way?
Conclusion
When those questions are at least in the back of your head and you keep working in this direction, more or less automatically data will spill out at certain points of that process.
Then it won't matter if you use Excel, a small-to medium app like Tableau, Tibco Spotfire, QlikView, Power BI or you want to go full scale with a big Hadoop backend, databases and JasperReports, Apache Drill, Pentaho, SSIS on top of it... it will come out eventually.
TL;DR
Focus on the processes first. Make sure to understand them. Draft in Excel. Then proceed in getting the data and the tools you need to help your use cases. It will work out much better from a "top-down" approach than trying to solve your requirements with tools only.
What is the best considered way to save and use Views, SPs and UDFs for SSRS reporting services that will be used by many users and some reports subscribed being sent out?
Do I:
Write it to a table overnight via scheduled jobs to do a direct read to the pre-saved query results?
Use a SP with temp tables with indexes based on each Views SQL to have it all in one place for the SSRS?
If the answer is that 'it depends on what I want', I would be grateful if you point me to any resources that can give me an idea of ideal setup to get query data to SSRS with minimal performance issues.
Thanking you kindly
Background/Explanation
SQL Server is not foreign to me but I don’t consider myself experienced (1 year) enough in developing 'etiquette' when it comes to crafting the parts of SQL Server. I feel I'm developing a lot of bad habits formed from using basic SQL knowledge, online searches and the odd MS SQL server course. The amount of searching done has been endless and I’m not saying there isn’t an answer out there for each part of SQL Server (UDFs, SPs and Views) out there.
The company I work for has many servers, many databases, for many outsourced front end systems being used. The issue is performance and the more I search the more I realize the setup of our databases could now maybe completely negligent and amateur. When I joined the setup used a lot of views each 'end' view had a dependency tree of over 4+ views including use of functions, each view ranging from aggregate calculations for Statistics to rearranging via pivots and unpivots. The reason given to me was so that we can pick out the parts that have gone wrong in which view. To no surprise the server has now suddenly had enough of this and peaks at 100% every time a report or view ran affecting the front end systems performance for the users.
My PP stresses my frustration and my position with the company (code monkey) in finding an answer myself which has resulted in pushing the keys back in the keyboard with opposable thumbs and appeal to the experts here.
This question is really too broad for stackoverflow. I'll try to give you a quick overview of what I think you're asking but really you're asking for way too much for a single answer here. This site is mainly focused around solving a specific problem and not the general process of development. I expect someone will probably come along and close your question.
Nightly table loads
Depending on the complexity of the task this is exactly what SSIS (SQL Server Integration Services) is for. You can build automated processes that do data transformations and data loads. It is used to build maintainable data integration solutions. Learning to use SSIS (especially properly) is a whole task though. In fact the 3rd exam for the SQL Server 2012 MCSA is exclusively about SSIS. Though if your table loads are not that complicated running them as SQL Tasks could be just as effective.
Database structure and use of views/SPs/functions/etc
This is an incredibly deep subject and it is totally dependent on what you're trying to do, how your data is structured, what kind of hardware you've got running, etc. Certainly using views, functions, and stored procedures can be good. They enable code re-use and allow you to encapsulate the logic for SSRS reports away from the actual report writers.
However, the SQL needs to be well written or it will suffer from performance problems. But, of course, that is just how it is no matter where you put the code. Even if the SQL is just a dataset in an SSRS report it will run slowly and hammer the server if it isn't written well. If the database isn't configured correctly it can have terrible performance. Indexes and other techniques for speeding up databases will always be important.
Above all everything needs to be documented so that someone else (or your later self) can make sense of it in ten months when something breaks.
Training
I would highly recommend trying to convince your employer to send you on some courses to learn SQL Server if they expect you to be developing complex database solutions. Certainly taking the courses to get your MCSA in SQL Server 2012 would be very useful. Getting the certification certainly opened my eyes to many possibilities for achieving things that I didn't know about before or just hadn't thought of.
The first exam will cover writing SQL queries and the different things that can help performance and the many cool features that you can leverage when retrieving or writing data. The second exam will cover database server administration, troubleshooting, and some performance tuning. The third exam is all about SSIS and how to warehouse your data to enable better analysis and reporting.
Even if you just read the Microsoft Learning books for these exams and never take the tests you will gain a lot of knowledge. There are other books that are good too such as T-SQL Fundamentals by Itzik Ben-Gan but ultimately it sounds like you need to get a lot deeper knowledge of the SQL Server platform before you can really make good design decisions about how to implement your solutions.
Conclusion
In the end, programming is programming. Trying to make a maintainable solution that works is your first goal. Tuning the performance of the system comes after that. The specifics of the languages and platforms don't take away from any of that. But in order to get the best performance out of a system you need knowledge about that system. An answer on here isn't going to be able to give you everything you need to know.
I have read lot of blogs\article on how different type of industries are using Big Data Analytic. But most of these article fails to mention
What kinda data these companies used. What was the size of the data
What kinda of tools technologies they used to process the data
What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
How they selected the tool\technology to suit their need.
What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I wonder if someone can provide me answer to all these questions or a link which at-least answer some of the the questions.
It would be great if someone share how finance industry is making use of Big Data Analytic.
Your question is very large but I will try to answer with my own experience
1 - What kinda data these companies used ?
One of the strength of Hadoop is that you can use a very large origin for your data. It can be .csv / .txt files, json, mysql, photos, videos ...
It can contains data about marketing, social network, server logs ...
What was the size of the data ?
There is no rules about that. It can start from 50 - 60 Go to 1Po. Depends of the data and the company.
2 - What kinda of tools technologies they used to process the data
No rules about that. Depends of the needs. To organize and process data they use Hadoop with Hive and Pig. To query data, they want some short response time so they use NoSQL / in-memory database with a shorter dataset (refined by Hadoop). In some cases, company use ETL like Talend in order to go faster.
3 - What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
The main issue for company is the growth of their data. At a moment, the data are too big and it is impossible to process with traditional tools like Mysql or others. So they start to use Hadoop for example.
4 - How they selected the tool\technology to suit their need.
I think it's an internal problematic. Company choose their tools because of the price of the licence, their own skills, their finals needs ...
5 - What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I don't really understand this question
Hope it will help you.
I think getting what you want is a difficult job getting data little by little from different resources. just make sure to visit these links:
a bunch of free reports. I am studying the list right now.
http://www.oreilly.com/data/free/
and the famous McKinsey Report:
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx
At work, we recently started a project using CouchDB (a document-oriented database). I've been having a hard time un-learning all of my relational db knowledge.
I was wondering how some of you overcame this obstacle? How did you stop thinking relationally and start think documentally (I apologise for making up that word).
Any suggestions? Helpful hints?
Edit: If it makes any difference, we're using Ruby & CouchPotato to connect to the database.
Edit 2: SO was hassling me to accept an answer. I chose the one that helped me learn the most, I think. However, there's no real "correct" answer, I suppose.
I think, after perusing about on a couple of pages on this subject, it all depends upon the types of data you are dealing with.
RDBMSes represent a top-down approach, where you, the database designer, assert the structure of all data that will exist in the database. You define that a Person has a First,Last,Middle Name and a Home Address, etc. You can enforce this using a RDBMS. If you don't have a column for a Person's HomePlanet, tough luck wanna-be-Person that has a different HomePlanet than Earth; you'll have to add a column in at a later date or the data can't be stored in the RDBMS. Most programmers make assumptions like this in their apps anyway, so this isn't a dumb thing to assume and enforce. Defining things can be good. But if you need to log additional attributes in the future, you'll have to add them in. The relation model assumes that your data attributes won't change much.
"Cloud" type databases using something like MapReduce, in your case CouchDB, do not make the above assumption, and instead look at data from the bottom-up. Data is input in documents, which could have any number of varying attributes. It assumes that your data, by its very definition, is diverse in the types of attributes it could have. It says, "I just know that I have this document in database Person that has a HomePlanet attribute of "Eternium" and a FirstName of "Lord Nibbler" but no LastName." This model fits webpages: all webpages are a document, but the actual contents/tags/keys of the document vary soo widely that you can't fit them into the rigid structure that the DBMS pontificates from upon high. This is why Google thinks the MapReduce model roxors soxors, because Google's data set is so diverse it needs to build in for ambiguity from the get-go, and due to the massive data sets be able to utilize parallel processing (which MapReduce makes trivial). The document-database model assumes that your data's attributes may/will change a lot or be very diverse with "gaps" and lots of sparsely populated columns that one might find if the data was stored in a relational database. While you could use an RDBMS to store data like this, it would get ugly really fast.
To answer your question then: you can't think "relationally" at all when looking at a database that uses the MapReduce paradigm. Because, it doesn't actually have an enforced relation. It's a conceptual hump you'll just have to get over.
A good article I ran into that compares and contrasts the two databases pretty well is MapReduce: A Major Step Back, which argues that MapReduce paradigm databases are a technological step backwards, and are inferior to RDBMSes. I have to disagree with the thesis of the author and would submit that the database designer would simply have to select the right one for his/her situation.
It's all about the data. If you have data which makes most sense relationally, a document store may not be useful. A typical document based system is a search server, you have a huge data set and want to find a specific item/document, the document is static, or versioned.
In an archive type situation, the documents might literally be documents, that don't change and have very flexible structures. It doesn't make sense to store their meta data in a relational databases, since they are all very different so very few documents may share those tags. Document based systems don't store null values.
Non-relational/document-like data makes sense when denormalized. It doesn't change much or you don't care as much about consistency.
If your use case fits a relational model well then it's probably not worth squeezing it into a document model.
Here's a good article about non relational databases.
Another way of thinking about it is, a document is a row. Everything about a document is in that row and it is specific to that document. Rows are easy to split on, so scaling is easier.
In CouchDB, like Lotus Notes, you really shouldn't think about a Document as being analogous to a row.
Instead, a Document is a relation (table).
Each document has a number of rows--the field values:
ValueID(PK) Document ID(FK) Field Name Field Value
========================================================
92834756293 MyDocument First Name Richard
92834756294 MyDocument States Lived In TX
92834756295 MyDocument States Lived In KY
Each View is a cross-tab query that selects across a massive UNION ALL's of every Document.
So, it's still relational, but not in the most intuitive sense, and not in the sense that matters most: good data management practices.
Document-oriented databases do not reject the concept of relations, they just sometimes let applications dereference the links (CouchDB) or even have direct support for relations between documents (MongoDB). What's more important is that DODBs are schema-less. In table-based storages this property can be achieved with significant overhead (see answer by richardtallent), but here it's done more efficiently. What we really should learn when switching from a RDBMS to a DODB is to forget about tables and to start thinking about data. That's what sheepsimulator calls the "bottom-up" approach. It's an ever-evolving schema, not a predefined Procrustean bed. Of course this does not mean that schemata should be completely abandoned in any form. Your application must interpret the data, somehow constrain its form -- this can be done by organizing documents into collections, by making models with validation methods -- but this is now the application's job.
may be you should read this
http://books.couchdb.org/relax/getting-started
i myself just heard it and it is interesting but have no idea how to implemented that in the real world application ;)
One thing you can try is getting a copy of firefox and firebug, and playing with the map and reduce functions in javascript. they're actually quite cool and fun, and appear to be the basis of how to get things done in CouchDB
here's Joel's little article on the subject : http://www.joelonsoftware.com/items/2006/08/01.html
In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.