How to serve data from HDFS fast/in realtime in my services? - hadoop

For now, in my company, every team that needs to serve data from HDFS to users creates own tool for that task.
We want to create a generic tool for fast/in real-time serving that data through HTTP from HDFS to my services. By generic I mean the tool should serve data only for my selected services added to the configuration and that should be the only action that users have to perform to use this generic tool. This new tool should be informed about new data that appeared in HDFS and then invoke some kind of job that moves data from HDFS to our fast storage.
Applications can update their data every day or every hour but every service can do it at different times (service A can be updated every day at 7 AM and service B can be updated every hour). I think we do not want to use any schemas and want to access our data using the only key and partition date. Querying is not necessary.
We do not know yet how much capacity or read/writes per second our tool needs to withstand.
We’ve worked some solution for our problem but we are interested if there are similar solutions in open source already or maybe any of you had a similar use case?
This is our proposal of architecture:
architecture

If you need to access HDFS over HTTP then WebHDFS might fit your use case. You could add a caching layer to speed up requests for hot files, but I think as long as you are using HDFS you'll never get sub-second response for any file that isn't already cached. You must decide if that is acceptable for you.
I'm unsure of how well WebHDFS works with huge files.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

Related

When using asp.net MVC core + EF core + ability to encrypt files. What will be the differences between Blob, FileStream & File System, to manage files

I am working on an asp.net core mvc web application, the web application is a document management workflow. where inside each of the workflow steps users can upload documents, as follow:-
users can upload documents with the following restriction; a file can not exceed 5 MB + all the documents inside a workflow can not exceed 50 MB, unless admin approves it. they can upload as many documents as they want.
we will have lot of views which will show the step and all its documents attached to it, and users can chose to download the documents.
we can have unlimited number of workflows. as the more users register with our application the more workflow will be created.
certain files can be marked as confidential, so they should be encrypted when storing them either inside the database or inside the file system.
we are planning to use EF core as the data access layer for our web application + SQL server 2016 or 2017.
now my question is how we should manage our files, where i found these 3 approaches.
Blob.
FileStream
File system.
now the first approach, will allow us to encrypt the files inside the database + will work with EF. but it will have a huge drawback on performance, since opening a file or querying the files from database means they will be loaded inside the hosting server memory. so since we are seeking for an extensible approach, so i think this approach will not work for us since it is less scalable.
Second approach. will have better performance compared to first approach (Blob), but FileStream are not supported with EF + does not allow encryption. so we have to exclude this also.
third approach. of storing the files inside a folder which have the workflow ID + store the link to the file/folder inside the DB. will allow us to encrypt the files + will work with EF. and have a better performance compared to Blob (not sure if this is valid for FileStream). the only drawback, is that we can not achieve Atomic-ity between the files and their related records inside the database. but with adding some code we can handle this by our-self. for example deleting a database record will delete all its documents inside the folder, and we can add some background jobs to make sure all the documents have database records, other wise to delete the documents..
so based on the above i found that the third approach is the best fit for our need? so can anyone advice on this please? are my assumption correct? and is there a fourth appraoch or a hybrid appraoch that can be a better fit for us?
Although modern RDBMS have been optimised for data storage with the perks of integrity and atomicity, databases should be considered the least most alternative (StackOverflow posts like this and this shall corroborate the above) and therefore the third option mentioned or an improvement thereof shall be the vote.
For instance, a potential improvement would be to store the files renamed to a hash of the content and database the hash which shall eliminate all OS restrictions on subdirectories/files, filenames, and paths. Moreover, with a well structured directory layout duplicates could be filtered out.
The User-defined Database Functions shall aid in achieving atomicity which will efface the need of background jobs. An excellent guide on UDFs particularly for the use of accessing filesytem and invoking an executable can be found here.

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

Designing an Analytics System with Hadoop

I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!
If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?
This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

Images in load balanced environment

I have a load balanced enviorment with over 10 web servers running IIS. All websites are accessing a single file storage that hosts all the pictures. We currently have 200GB of pictures - we store them in directories of 1000 images per directory. Right now all the images are in a single storage device (RAID 10) connected to a single server that serves as the file server. All web servers are connected to the file server on the same LAN.
I am looking to improve the architecture so that we would have no single point of failure.
I am considering two alternatives:
Replicate the file storage to all of the webservers so that they all access the data locally
replicate the file storage to another storage so if something happens to the current storage we would be able to switch to it.
Obviously the main operations done on the file storage are read, but there are also a lot of write operations. What do you think is the preferred method? Any other idea?
I am currently ruling out use of CDN as it will require an architecture change on the application which we cannot make right now.
Certain things i would normally consider before going for arch change is
what are the issues of current arch
what am i doing wrong with the current arch.(if this had been working for a while, minor tweaks will normally solve a lot of issues)
will it allow me to grow easily (here there will always be a upper limit). Based on the past growth of data, you can effectively plan it.
reliability
easy to maintain / monitor / troubleshoot
cost
200GB is not a lot of data, and you can go in for some home grown solution or use something like a NAS, which will allow you to expand later on. And have a hot swappable replica of it.
Replicating to storage of all the webservers is a very expensive setup, and as you said there are a lot of write operations, it will have a large overhead in replicating to all the servers(which will only increase with the number of servers and growing data). And there is also the issue of stale data being served by one of the other nodes. Apart from that troubleshooting replication issues will be a mess with 10 and growing nodes.
Unless the lookup / read / write of files is very time critical, replicating to all the webservers is not a good idea. Users(of web) will hardly notice the difference of 100ms - 200ms in loadtime.
There are some enterprise solutions for this sort of thing. But I don't doubt that they are expensive. NAS doesn’t scale well. And you have a single point of failure which is not good.
There are some ways that you can write code to help with this. You could cache the images on the web servers the first time they are requested, this will reduce the load on the image server.
You could get a master slave set up, so that you have one main image server but other servers which copy from this. You could load balance these, and put some logic in your code so that if a slave doesn’t have a copy of an image, you check on the master. You could also assign these in priority order so that if the master is not available the first slave then becomes the master.
Since you have so little data in your storage, it makes sense to buy several big HDs or use the free space on your web servers to keep copies. It will take down the strain on your backend storage system and when it fails, you can still deliver content for your users. Even better, if you need to scale (more downloads), you can simply add a new server and the stress on your backend won't change, much.
If I had to do this, I'd use rsync or unison to copy the image files in the exact same space on the web servers where they are on the storage device (this way, you can swap out the copy with a network file system mount any time).
Run rsync every now and then (for example after any upload or once in the night; you'll know better which sizes fits you best).
A more versatile solution would be to use a P2P protocol like Bittorreent. This way, you could publish all the changes on the storage backend to the web servers and they'd optimize the updates automatcially.

Resources