Hi guys i need a nudge in the right direction please
So i have a final year project that requires me to make use of a webcrawler to populate a NoSQL database(MongoDB at request of the lecturer). Then we have to filter/merge/sort(for lack of better understanding?) the NoSQL data into a Relational database that we can then use to generate reports with a dashboard.
Alot of words with very little guidance. I have spent hours looking into this subject and im nowhere near an answer.
So im calling on the collective minds of a bunch of IT pros to help me and 5 other guys finish our degree
We have a relative understanding of how crawlers work and should be able to populate the mongoDB with that information(we are crawling realtor websites for information regarding houses and apartments)
I have a very good relational database background as we spent a lot of time in that are
the problem i am running into is that i do not get how we can sort the nosql data into the Oracle database that we set up and im running out of time.
Any suggestions?
Thanks guys!
Related
Can i find any resources like PDF or User Guide for learning vertica DB?
As i am beginner in vertica, also I am looking for the performance factor which affects while loading the data as well.
All of the documentation is posted publicly on my.vertica.com. Data-load performance depends on many factors; you should probably start with Bulk-Loading Data and then review the many COPY parameters. For a general beginner introduction to Vertica, see Getting Started.
Actually, I have a project to deal with. I'm Asking for help.
My project is in the field of Business intelligence and creating datawarehouses.
I extracted Data that I need (ETL) and then what should I do ?
I am working with MS SQL Server 2014.
How to create my dimensions and my Fact table?
looking for advises
Please do accept my salvation.
This is a big question! Unfortunately, Stack Overflow's Q&A format isn't the best place to answer this. But here are few pointers:
Everything starts with the requirements. Before you write any code, figure out exactly what your data warehouse will be used for (it can also be helpful to work out what your data warehouse will not be used for).
Analyse the raw data. Make sure you know what is and is not available. Be aware of the source systems shortcomings. Example: If your reports need to split your customers by country, is this data available? If so, is it consistently populated (some records have US, others USA, others still America)? Make a plan for dealing with these issues (see data cleansing below).
Prototype your data model. Excel and Power BI are great places to test the design. Once you start using a database it becomes much harder to change. Get it right at the very beginning and your life will be much easier.
Pick an ETL tool. Make sure you understand it, and it plays to the strengths of you and your team. I like SSIS.
Import the raw data into staging tables. This can help to simplify the analysis phase.
Cleanse the data. In a data warehouse, you have 100% control over every row, column and cell. Make use of this fact. Ensure only quality, useful, well-conformed data makes it into your published tables.
Like all projects, planning and administration is the key. Writing code and building tables comes last.
Here are some resources which should help you:
Kimball Group. Ralph Kimball literally wrote the book on data warehousing (see next tip). His company's website contains a few hints and tips.
If you cannot attend a training course, buy a good book. I'd recommend this one. It's a big subject. Blogs and the internet can only teach you so much.
Download and try out Adventure Works DW. This is a sample data warehouse and ETL package, built by Microsoft. It demonstrates some the techniques you can use in SSIS.
What is the best considered way to save and use Views, SPs and UDFs for SSRS reporting services that will be used by many users and some reports subscribed being sent out?
Do I:
Write it to a table overnight via scheduled jobs to do a direct read to the pre-saved query results?
Use a SP with temp tables with indexes based on each Views SQL to have it all in one place for the SSRS?
If the answer is that 'it depends on what I want', I would be grateful if you point me to any resources that can give me an idea of ideal setup to get query data to SSRS with minimal performance issues.
Thanking you kindly
Background/Explanation
SQL Server is not foreign to me but I don’t consider myself experienced (1 year) enough in developing 'etiquette' when it comes to crafting the parts of SQL Server. I feel I'm developing a lot of bad habits formed from using basic SQL knowledge, online searches and the odd MS SQL server course. The amount of searching done has been endless and I’m not saying there isn’t an answer out there for each part of SQL Server (UDFs, SPs and Views) out there.
The company I work for has many servers, many databases, for many outsourced front end systems being used. The issue is performance and the more I search the more I realize the setup of our databases could now maybe completely negligent and amateur. When I joined the setup used a lot of views each 'end' view had a dependency tree of over 4+ views including use of functions, each view ranging from aggregate calculations for Statistics to rearranging via pivots and unpivots. The reason given to me was so that we can pick out the parts that have gone wrong in which view. To no surprise the server has now suddenly had enough of this and peaks at 100% every time a report or view ran affecting the front end systems performance for the users.
My PP stresses my frustration and my position with the company (code monkey) in finding an answer myself which has resulted in pushing the keys back in the keyboard with opposable thumbs and appeal to the experts here.
This question is really too broad for stackoverflow. I'll try to give you a quick overview of what I think you're asking but really you're asking for way too much for a single answer here. This site is mainly focused around solving a specific problem and not the general process of development. I expect someone will probably come along and close your question.
Nightly table loads
Depending on the complexity of the task this is exactly what SSIS (SQL Server Integration Services) is for. You can build automated processes that do data transformations and data loads. It is used to build maintainable data integration solutions. Learning to use SSIS (especially properly) is a whole task though. In fact the 3rd exam for the SQL Server 2012 MCSA is exclusively about SSIS. Though if your table loads are not that complicated running them as SQL Tasks could be just as effective.
Database structure and use of views/SPs/functions/etc
This is an incredibly deep subject and it is totally dependent on what you're trying to do, how your data is structured, what kind of hardware you've got running, etc. Certainly using views, functions, and stored procedures can be good. They enable code re-use and allow you to encapsulate the logic for SSRS reports away from the actual report writers.
However, the SQL needs to be well written or it will suffer from performance problems. But, of course, that is just how it is no matter where you put the code. Even if the SQL is just a dataset in an SSRS report it will run slowly and hammer the server if it isn't written well. If the database isn't configured correctly it can have terrible performance. Indexes and other techniques for speeding up databases will always be important.
Above all everything needs to be documented so that someone else (or your later self) can make sense of it in ten months when something breaks.
Training
I would highly recommend trying to convince your employer to send you on some courses to learn SQL Server if they expect you to be developing complex database solutions. Certainly taking the courses to get your MCSA in SQL Server 2012 would be very useful. Getting the certification certainly opened my eyes to many possibilities for achieving things that I didn't know about before or just hadn't thought of.
The first exam will cover writing SQL queries and the different things that can help performance and the many cool features that you can leverage when retrieving or writing data. The second exam will cover database server administration, troubleshooting, and some performance tuning. The third exam is all about SSIS and how to warehouse your data to enable better analysis and reporting.
Even if you just read the Microsoft Learning books for these exams and never take the tests you will gain a lot of knowledge. There are other books that are good too such as T-SQL Fundamentals by Itzik Ben-Gan but ultimately it sounds like you need to get a lot deeper knowledge of the SQL Server platform before you can really make good design decisions about how to implement your solutions.
Conclusion
In the end, programming is programming. Trying to make a maintainable solution that works is your first goal. Tuning the performance of the system comes after that. The specifics of the languages and platforms don't take away from any of that. But in order to get the best performance out of a system you need knowledge about that system. An answer on here isn't going to be able to give you everything you need to know.
I have read lot of blogs\article on how different type of industries are using Big Data Analytic. But most of these article fails to mention
What kinda data these companies used. What was the size of the data
What kinda of tools technologies they used to process the data
What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
How they selected the tool\technology to suit their need.
What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I wonder if someone can provide me answer to all these questions or a link which at-least answer some of the the questions.
It would be great if someone share how finance industry is making use of Big Data Analytic.
Your question is very large but I will try to answer with my own experience
1 - What kinda data these companies used ?
One of the strength of Hadoop is that you can use a very large origin for your data. It can be .csv / .txt files, json, mysql, photos, videos ...
It can contains data about marketing, social network, server logs ...
What was the size of the data ?
There is no rules about that. It can start from 50 - 60 Go to 1Po. Depends of the data and the company.
2 - What kinda of tools technologies they used to process the data
No rules about that. Depends of the needs. To organize and process data they use Hadoop with Hive and Pig. To query data, they want some short response time so they use NoSQL / in-memory database with a shorter dataset (refined by Hadoop). In some cases, company use ETL like Talend in order to go faster.
3 - What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
The main issue for company is the growth of their data. At a moment, the data are too big and it is impossible to process with traditional tools like Mysql or others. So they start to use Hadoop for example.
4 - How they selected the tool\technology to suit their need.
I think it's an internal problematic. Company choose their tools because of the price of the licence, their own skills, their finals needs ...
5 - What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I don't really understand this question
Hope it will help you.
I think getting what you want is a difficult job getting data little by little from different resources. just make sure to visit these links:
a bunch of free reports. I am studying the list right now.
http://www.oreilly.com/data/free/
and the famous McKinsey Report:
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx
I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David