Can i find any resources like PDF or User Guide for learning vertica DB?
As i am beginner in vertica, also I am looking for the performance factor which affects while loading the data as well.
All of the documentation is posted publicly on my.vertica.com. Data-load performance depends on many factors; you should probably start with Bulk-Loading Data and then review the many COPY parameters. For a general beginner introduction to Vertica, see Getting Started.
Related
Actually, I have a project to deal with. I'm Asking for help.
My project is in the field of Business intelligence and creating datawarehouses.
I extracted Data that I need (ETL) and then what should I do ?
I am working with MS SQL Server 2014.
How to create my dimensions and my Fact table?
looking for advises
Please do accept my salvation.
This is a big question! Unfortunately, Stack Overflow's Q&A format isn't the best place to answer this. But here are few pointers:
Everything starts with the requirements. Before you write any code, figure out exactly what your data warehouse will be used for (it can also be helpful to work out what your data warehouse will not be used for).
Analyse the raw data. Make sure you know what is and is not available. Be aware of the source systems shortcomings. Example: If your reports need to split your customers by country, is this data available? If so, is it consistently populated (some records have US, others USA, others still America)? Make a plan for dealing with these issues (see data cleansing below).
Prototype your data model. Excel and Power BI are great places to test the design. Once you start using a database it becomes much harder to change. Get it right at the very beginning and your life will be much easier.
Pick an ETL tool. Make sure you understand it, and it plays to the strengths of you and your team. I like SSIS.
Import the raw data into staging tables. This can help to simplify the analysis phase.
Cleanse the data. In a data warehouse, you have 100% control over every row, column and cell. Make use of this fact. Ensure only quality, useful, well-conformed data makes it into your published tables.
Like all projects, planning and administration is the key. Writing code and building tables comes last.
Here are some resources which should help you:
Kimball Group. Ralph Kimball literally wrote the book on data warehousing (see next tip). His company's website contains a few hints and tips.
If you cannot attend a training course, buy a good book. I'd recommend this one. It's a big subject. Blogs and the internet can only teach you so much.
Download and try out Adventure Works DW. This is a sample data warehouse and ETL package, built by Microsoft. It demonstrates some the techniques you can use in SSIS.
The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck
I have read lot of blogs\article on how different type of industries are using Big Data Analytic. But most of these article fails to mention
What kinda data these companies used. What was the size of the data
What kinda of tools technologies they used to process the data
What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
How they selected the tool\technology to suit their need.
What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I wonder if someone can provide me answer to all these questions or a link which at-least answer some of the the questions.
It would be great if someone share how finance industry is making use of Big Data Analytic.
Your question is very large but I will try to answer with my own experience
1 - What kinda data these companies used ?
One of the strength of Hadoop is that you can use a very large origin for your data. It can be .csv / .txt files, json, mysql, photos, videos ...
It can contains data about marketing, social network, server logs ...
What was the size of the data ?
There is no rules about that. It can start from 50 - 60 Go to 1Po. Depends of the data and the company.
2 - What kinda of tools technologies they used to process the data
No rules about that. Depends of the needs. To organize and process data they use Hadoop with Hive and Pig. To query data, they want some short response time so they use NoSQL / in-memory database with a shorter dataset (refined by Hadoop). In some cases, company use ETL like Talend in order to go faster.
3 - What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
The main issue for company is the growth of their data. At a moment, the data are too big and it is impossible to process with traditional tools like Mysql or others. So they start to use Hadoop for example.
4 - How they selected the tool\technology to suit their need.
I think it's an internal problematic. Company choose their tools because of the price of the licence, their own skills, their finals needs ...
5 - What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I don't really understand this question
Hope it will help you.
I think getting what you want is a difficult job getting data little by little from different resources. just make sure to visit these links:
a bunch of free reports. I am studying the list right now.
http://www.oreilly.com/data/free/
and the famous McKinsey Report:
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx
I want to test some performance tuning techniques on a realistic database with many database tables and a lot of data. I would like to do this in Oracle 11g Release 1 and would like to know how best to go about this or if there is a website I could get realistic datasets/database from.
Many thanks for your audience.
Cheers,
Tunde
Good timing. There was a blog entry today on generating the data for the TPC-H benchmark.
May you can choose something from this.
I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David