hdfs access over thrift - does HadoopThriftServer still exist? - hadoop

I have been trying to access hdfs without yarn etc. and since wikipedia advertises thrift support, I was under the impression, hdfs/hadoop comes with a (java) thrift server and you only have to define the client in the language of your choice (Haskell in my case).
I have not yet been successful and lots of the information out there seems to be outdated - for example I have not been able to procure an official thrift file for hdfs, the official link here is broken.
Looking deeper for the server side, I found old posts regarding hadoop-thriftfs-0.2 mentioning a separate thrift server startup, sometimes named HadoopThriftServer here or there.
I have not been able to determine the current state of such a server project (apart from a shady, probably obsolete download). I could of course implement the thrift server myself using the java bindings, but since I am under the impression that should already be done I would rather not...
(Basis: spec from an old github project)
Is the thrift support obsolete (maybe discarded in favor of the REST API)? Am I overlooking something obvious? What would be the proper way to use the hadoop file system without the rest of the infrastructure (in a language other than java)?
Notes: Most of the hadoop/thrift documentation/examples are for hbase etc., not plain hdfs. At the moment I am not interested in hbase, before adding another level to the stack, I would rather go a completely different way.
I am especially interested in hdfs functions needed for ensuring data locality your own way.

Related

Long Running ETL Process - Background Jobs, Spark, Hadoop

I have a scenario in an application where;
Have to load data from multiple sources (more than 10)
Mostly sources are HTTP/JSON Web Services and some FTP
Have to process those data and put into a central Database (Postgresql)
Current implementation is done in Ruby using Background jobs. But I see following issues in it;
Very high memory usage
Jobs stuck sometimes without any error report
Horizontal scaling is tricky to setup
Does in this scenario, any way Spark or Hadoop can help or a better option.
Please elaborate with some good reasoning.
Update:
As per comment, I need to elaborate it further. Here are the points why I thought to Spark or Hadoop.
If we scale the concurrency of running jobs, that also increase heavy load on DB server. I had read though, that Spark and Hadoop are build to face such heavy load even on DB side.
We can't run more background process then the physical cores of CPU (as recommended by ruby and sidekiq community)
Concurrency in Ruby is actually dependent on GIL, which is not actually real concurrency supported. So each job can fetch single central data source, if that stuck into an IO call then the source will be locked for it.
All above points considered to be part of builtin architecture of Hadoop & Spark. So I was thinking over lines to look into these tools.
In my opinion, I would give a try to Pentaho Data Integrator (PDI) (or Talend).
They where visual tools designed to solve problems like yours. And have a free version downloadable form SourceForge (just unzip and press the spoon.bat button).
They can a acquire data from FTP and HTTP (among others), decode JSON, and write databases like Postgres. PDI has a free plug-in able to run Ruby code out-of-the-box, so you can save start-up development.
The PDI also has ready made Spark and Hadoop interfaces, so you can implement your hadoop/sparkle servers transparently at a later stage if you need a more metal solution.
The PDI was build for heavy data load and gives you you control on concurrency and remote servers.

Thrift, Avro, Protocolbuffers - Are they all dead?

Working on a pet project (cassandra, spark, hadoop, kafka) I need a data serialization framework. Checking out the common three frameworks - namely Thrift, Avro and Protocolbuffers - I noticed most of them seem to be dead-alive having 2 minor releases a year at most.
This leaves me with two assumptions:
They are as complete as such a framework should be and just rest in maintenance mode as long as no new features are needed
There is no reason to exist for such framework - not being obvious to me why. If so, what alternatives are out there?
If anyone could give me a hint to my assumptions, any input is welcome.
Protocol Buffers is a very mature framework, having been first introduced nearly 15 years ago at Google. It's certainly not dead: Nearly every service inside Google uses it. But after so much usage, there probably isn't much that needs to change at this point. In fact, they did a major release (3.0) this year, but the release was as much about removing features as adding them.
Protobuf's associated RPC system, gRPC, is relatively new and has had much more activity recently. (However, it is based on Google's internal RPC system which has seen some 12 years of development.)
I don't know as much about Thrift or Avro but they have been around a while too.
The advantage of Thrift compared to Protobuf is that Thrift offers a complete RPC and serialization framework. Plus Thrift supports about 20+ target languages and that number is still growing. We are about to include .NET core and there will be Rust support in the not-so-far future.
The fact that there have been not that many Thrift releases in the last months is surely something that needs to be addressed, and we are fully aware of it. On the other hand, the overall stability of the codebase is quite good, so one may do a Github fork and cut a branch on its own from current master as well - of course with the usual quality measures.
The main difference between Avro and Thrift is that Thrift is statically typed, while Avro uses a more dynamic approach. In most cases a static approach fits the needs quite well, in that case Thrift lets you benefit from the better performance of generated code. If that is not the case, Avro might be more suitable.
Also it is worth mentioning that besides Thrift, Protobuf and Avro there are some more solutions on the market, such as Capt'n'proto or BOLT.
Concerning thrift: as far as I am aware of it is alive and kicking. We use it for serialization and internal API's where I work at and it works fine for that.
Missing things like connection multiplexing and more user-friendly clients have been added through projects such as Twitter's Finagle.
Though I would characterize our use of it as semi-intensive only (ie, we don't look at performance first: it should be easy to use and bug-free before anything else) we did not run into any issue so far.
So, regarding thrift, I'd say it falls into your first category.[*]
Protocolbuffers is an alternative for thrift's serialization part, but it does not provide the RPC toolbox thrift offers.
I'm not aware of any other project that blends RPC and serialization into such a simple to use and complete single package.
[*]Anyway, once you start using it and see all the benefits, it's hard to
put it into your second category :)
They are all very much in use at plenty of places, so I'd say your first assumption. I don't know what your expectation of a release schedule is, but they seem normal to me for a library of that size and maturity. Heck, Avro 1.8.0 came out at the start of 2016, and most things still use Avro 1.7.7 (e.g. Spark, Hadoop). https://avro.apache.org/releases.html

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

What cassandra client to use for haoop integration?

I am trying to build a data services layer using cassandra as the backend store. I am new to Cassandra and not sure what client to use for cassandra - thrift or cql 3? We have a lot of mapreduce jobs using Amazon elastic mapreduce (EMR) that will be reading/ writing the data from cassandra at high volume. The total data volume will be > 100 TB with billions of rows in Cassandra. The mapreduce jobs may be read or write heavy with high qps (>1000 qps). The requirements are as follows:
Simplicity of client code. It seems thrift has in-built integration with Hadoop for bulk data loading using sstableloader (http://www.datastax.com/dev/blog/bulk-loading).
Ability to define new columns at run time. We may need to add more columns depending on application requirements. It seems cql3 does not allow definition of columns dynamically at runtime.
Performance of bulk read/ write. Not sure which client is better. However, I found this post that claims thrift client has better performance for high data volume: http://jira.pentaho.com/browse/PDI-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
I could not find any authoritative source of information that answers this question. Appreciate if you could help with this since I am sure this is a common problem for most folks and would benefit the overall community.
Many thanks in advance.
-Prateek
Hadoop and Cassandra are both written in Java so definitely pick a java based driver. As far as simplicity of code goes I'd go for Astyanax, their wiki page is really good and documentation is solid all round. And yes atyanax does allow you to define columns at runtime as you please but be aware that thrift based APIs are being superseded by cql apis.
If however you want to go down the pure cql3 route, datastax's driver is what I'd advise you to use. It allows for asynchronous connections and is continuously updated (view the logs). The code is also very clean although documentation isn't quite there yet, but there are tests in the source that you can look at.
But to be honest, there are so many questions about the APIs that you should read though them and form an opinion for yourself:
Cassandra Client Java API's
About Java Cassandra Client, which one is better? How about CQL?
Advantages of using cql over thrift
Also for performance here some benchmarks (they are however outdated!) showing that cql is catching up (and somewhat overtaking when it comes to prepared statements) thrift:
compare string vs. binary prepared statement parameters
CQL benchmarking

Run a site on Scheme

I can't find this on Google (so maybe it doesn't exist), but I basically'd like to install something on a web server such that I can run a site on Scheme, PHP is starting to annoy me, I want to get rid off it, what I want is:
Run Scheme sources towards UTF-8 output (duh)
Support for SXML, SXLT et cetera, I plan to compose the damned thing in SXML and -> to normal representation on at the end.
Ability to read other files from the server, write them, set permissions et cetera
Also some things to for instance determine the filesize of files, height of images, mime-types and all that mumbo-jumbo
(optionally) connect to a database, but for what I want to do storing the entire database in S-expressions itself is feasible enough
I don't need any fancy libraries and other things that come with it like CMS'es and what-not, except the support for SXML but I'm sure I can just find a lib for that anyway that I can load.
Spark-Scheme has a full web server. If you don't need that, it also has a FastCGI interface so that you can serve Scheme scripts from a web servers like Apache, Lighttpd etc. Spark-Scheme also seem to meet your requirements for database support, UTF-8, file handling and SXML. See the Spark-Scheme Programming Guide (pdf) for more information.
mod_lisp and FastCGI are the only two Apache modules I'm aware of that might work at this time. mod_lisp provides Scheme support because it's architecture is similar to FastCGI, where CGI like parameters are sent over a socket to a second process which remains running as the Scheme backend to the web server. Basically you use one or the other to send CGI like parameters across a socket to a running Scheme backend.
You can find some information about these solutions here. There was another FastCGI like effort called SCGI which demoed a simple SCGI receiver in Scheme called gambit. That code is probably not maintained anymore, but the scheme receiver might be useful.
Back in the Apache 2.0 days, there were more projects playing with scheme and clisp bindings. I don't believe that mod_scheme ever released anything, but if they did, odds are it is not compatible with the modern releases of Apache.
Did you come across Fermion (http://vijaymathew.wordpress.com/2009/08/19/fermion-the-scheme-web-server/)?
If you're looking for a lispy language to develop web applications in, I'd recommend looking into Clojure. Clojure is a lisp variant that's fairly close to scheme; here is a list of some of the differences.
Clojure runs on the Java virtual machine and integrates well with Java libraries, and there's a great webapp framework available called Compojure.
Check out Chicken Scheme's Eggs Unlimited. I think what you want is a combination of the sxml- packages coupled with the fastcgi package.
PLT Scheme has a web application server here: http://docs.plt-scheme.org/web-server/index.html

Resources