I am looking for any piece of technical paper explaining how access control is conducted on unstructured data ingested by HDFS.
Can the granularity level be smaller than POSIX-ish file permissions?
Similarly, how would products like RecordService (from Cloudera), which provide an abstraction layer for security on storage components, work on unstructured data?
For instance, if I have a very big emails archive file (more than a terabyte), would I be able to specify a more fine-grained ACL than one on the entire file itself? I am thinking about email headers, etc.
The granularity supported is to the row and column levels. See details.
Presently, for RecordService to work, your data must be organized as Hive Metastore tables. In the future, RecordService may infer structure/schema from the files themselves (but, not the case today).
Related
We have billions of records formatted with relational data format (e.g transaction id, user name, user id and some other fields), my requirement is to create system where user can request data export from this data store (user will provide some filters like user id, date and so on), typically exported file will be having thousand to 100s of thousands to millions of records based on selected filters (output file will be CSV or similar format)
Other than raw data, I am also looking for some dynamic aggregation on few of the fields during data export.
Typical time between user submitting request and exported data file available should be within 2-3 minutes (max can be 4-5 minutes).
I am seeking suggestions on backend noSQLs for this use case, I've used Hadoop map-reduce so far but hadoop batch job execution with typical HDFS data map-reduce might not give expected SLA in my opinion.
Another option is to use Spark map-reduce which I've never used but it should be way faster then typical Hadoop map-reduce batch job.
We've already tried production grade RDBMS/OLTP instance but it clearly seems not a correct option due to size of data we are exporting and dynamic aggregation.
Any suggestion on using Spark here? or any other better noSQL?
In summary SLA, dynamic aggregation and raw data (millions) are the requirement considerations here.
If system only requires to export data after doing some ETL - aggregations, filtering and transformations then answer is very straight forward. Apache Spark is the best. You would have to fine tune the system and decide whether you want to use only memory or memory + disk or serialization etc.. However, most of the times one needs to think about other aspects too; I am considering them as well.
This is a wide topic of discussion and it involves many aspects such aggregations involved, search related queries (if any), development time. As per the description, it seems to be an interactive/near-real-time-interactive system. Other aspect is whether any analysis involved? And another important point is type of system (OLTP/OLAP, only reporting etc..).
I see there are two questions involved -
Which computing/data processing engine to use?
Which data storage/NoSQL?
- Data processing -
Apache Spark would be a best choice for computing. We are using for the same purpose, along with the filtering, we also have xml transformations to perform which are also done in Spark. Its superfast as compared to Hadoop MapReduce. Spark can run standalone and it can also run on the top of Hadoop.
- Storage -
There are many noSQL solutions available. Selection depends upon many factors such as volume, aggregations involved, search related queries etc..
Hadoop - You can go with Hadoop with HDFS as a storage system. It has many benefits as you get entire Hadoop ecosystem.If you have analysts/data scientists who require to get insights of data/ play with data then this would be a better choice as you would get different tools such as Hive/Impala. Also, resource management would be easy. But for some applications it can be too much.
Cassendra - Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. It brings wonders when used with Spark. For example, performing complex aggregations. By the way, we are using it. For visualization (to view data for analyzing), options are Apache Zeppelin, Tableau (lot of options)
Elastic Search - Elastic Search is also a suitable option if your storage is in few TBs upto 10 TBs. It comes with Kibana (UI) which provides limited analytics capabilities including aggregations. Development time is minimal, its very quick to implement.
So, depending upon your requirement I would suggest Apache Spark for data processing (transformations/filtering/aggregations) and you may also require to consider other technology for storage and data visualization.
I am wondering that if such a large datasets are used in Hadoop Map Reduce then what are the data structures which are used by hadoop. If possible please somebody provide me a detail view of underlying data structures in hadoop.
HDFS is the default underlying storage platform of Hadoop.
Its like any other file system in the sense that - it does not care what structure the files have. It only ensures that files will be saved in a redundant fashion and available for retrieval quickly.
So it is totally upto you the user, to store files with whatever structure you like inside them.
A Map Reduce program simply gets the file data fed to it as an input. Not necessarily the entire file, but parts of it depending on InputFormats etc. The Map program then can make
use of the data in whatever way it wants to.
'Hive' - on the other hand deals with TABLES (columns/rows). And you can query them in a SQL like fashion using Hive-QL.
Thanks to all of you
I got the answer of my question. The underlying HDFS uses block as a storing units a detail description of which is mentioned in the following book and networking streaming concepts.
All the details are available in the third chapter of Hadoop: The Definitive Guide.
I am trying to convert a application that have relational database as backend. Can I store the data relationaly in HDFS as well?
Just for the sake of storing, you can store anything in HDFS. But that won't make any sense. First of all, you should not think of Hadoop as a replacement to your RDBMS(which you are trying to do here). Both are meant for totally different purposes. Hadoop is not a good fit for your transactional, relational or real-time kind of needs. It was meant to serve your offline batch processing needs. So, it's better to analyze your use case properly and then freeze your decision.
As a suggestion I would like to point you to Hive. It provides you warehousing capabilities on top of your existing Hadoop cluster. It also provides an SQL like interface to your warehouse, which will make your life much easier if you are coming from SQL background. But again, Hive is also a batch processing system and is not a good fit if you need something real time.
You can have a look at HBase though, as suggested by abhinav. It's a DB that can run on top of your Hadoop cluster and provides you random, real time read/write access to your data. But you should keep 1 thing in mind that it's a NoSQL db. It doesn't follow the SQL terminologies and conventions. So, you might find it a bit alien initially. You might have to think about issues like how to store your data in a new storage style(columnar) unlike the row style storage of your RDBMS. Otherwise it's not a problem to setup and use it.
HTH
Any file can be stored in HDFS. But if you want an SQL type DB you should go for HBASE. If you directly store your data into HDFS you will not be able to store rationality.
Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.
In Hadoop, I can easily create Map/Reduce apps which access and process data in huge text files and csv files. My question is can Hbase do the same and access such huge files, or HBase has other uses?
Hbase runs queries just as relational databases; so, I kind of have a hard time to understand the advantage of HBase, unless it can access huge text and csv files just as Hadoop does.
First of all Hbase is just a store. And a store never accesses anything. Rather you access the store to fetch or put the data. Like any other datastore Hbase has only one job to do, store your data and make it available to you whenever you need it. You can write MapReduce jobs or sequential Java programs etc etc to put data into Hbase or fetch data from it. It's totally upto you which path you prefer.
Coming to the second part of your question, Hbase never ever works like traditional relational databases. Everything, starting from storing the data to accessing the data, is totally different. The advantage of using Hbase is that you can store really really huge amount of data into it and have random read/write access. The data can be of any type viz. text, csv, tsv, binary etc etc. But, before going ahead, you must think well whether Hbase is a suitable choice for you or not, as one size doesn't fit all.
HTH