I've been using kafka (using the wurstmeister images, but i have also tried to set up the broker and zookeeper using the confluent images and it works) for a while and I am now trying to set up kafka-connect so I can directly load messages from a kafka topic to S3. However, i've been running into several issues. Qemu errors, java.lang.ExceptionInInitializerError at org.eclipse.jetty.http.MimeTypes, etc, which I have read that it has something to do with lack of ARM support (https://github.com/confluentinc/common-docker/issues/117 and https://github.com/docker/buildx/issues/542). I have tried to run the docker compose with platform: linux/amd64, but it still doesn't work.
I was wondering if anyone has any workarounds to make kafka-connect work or if you know any alternatives.
Thanks!
You don't need Docker to run Kafka Connect.
Install Java on your Mac
Download Kafka
Run Zookeeper and Kafka (this might have issues with M1 Mac)
run bin/connect-distributed.sh config/connect-distributed.properties
If you really need Docker, you can rebuild images from other sources, such as mine, which builds from adoptopenjdk:11-jre base image, which supports ARM
When I create a pod, a corresponding image is pulled to the node where the pod is created
Can I have those images shared among the cluster nodes, instead of being stored locally on each node?
Thanks a lot
Best Regards
It's possible if you have shared storage across all the Kubernetes nodes. However, it's not a good idea 🙅 since typically the place where images get stored is also the place where the container runtime stores its files when it's actually running the container. For example, if you are using Docker, everything gets stored under /var/lib/docker or in the case of containerd it's /var/lib/containerd
So in summary, it's possible with shared files/cluster file systems like NFS, Ceph, Glusterfs, AWS EFS, etc, but it's not a good idea in my opinion 🚫.
Update (#BMitch):
Make sure that the container storage driver you are using supports the filesystem that you are using.
✌️
I have been trying to spin up a Kubernetes/Fabric8 installation on AWS using Stackpoint as described in this video: https://www.youtube.com/watch?v=lNRpGJTSMKA
My problem is that three of the apps wont start becuase no volumes are available and I cannot see how to resolve those PV requests. For example Gogs is reporting the following error:
Unable to mount volumes for pod "gogs-2568819805-bcw8e_default(03d618b9-7477-11e6-8c6b-0a945216fb91)": timeout expired waiting for volumes to attach/mount for pod "gogs-2568819805-bcw8e"/"default". list of unattached/unmounted volumes=[gogs-data]
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "gogs-2568819805-bcw8e"/"default". list of unattached/unmounted volumes=[gogs-data]
I am pretty sure this is very simple but cannot see how to connect the dots here from the various K8, Fabric8 docs. I can create a new EBS volume in AWS easily enough but cannot see how to then update this running stack to attach it to these services. Any help would be greatly appreciated!
Sorry about that, what version of gofabric8 are you using? We're currently adding persistent volume support for the core platform apps although the integration our stackpoint isn't there quite yet. Hopefully soon though.
For now you should be able to disable the PV claims using --pv=false during the deploy. So gofabric8 deploy --pv=false. We'll look at using this as the default until the integration is there and we can leverage AWS persistent volumes
We just shipped functionality that allows you to create and manage AWS volumes for Kubernetes. You get a volume, PV, and claim - just name the claim to be what is required by Fabric8. Eventually, you'll be able to use dynamic volume creation.
I understand you can download Spark source code (1.5.1), or prebuilt binaries for various versions of Hadoop. As of Oct 2015, the Spark webpage http://spark.apache.org/downloads.html has prebuilt binaries against Hadoop 2.6+, 2.4+, 2.3, and 1.X.
I'm not sure what version to download.
I want to run a Spark cluster in standalone mode using AWS machines.
<EDIT>
I will be running a 24/7 streaming process. My data will be coming from a Kafka stream. I thought about using spark-ec2, but since I already have persistent ec2 machines, I thought I might as well use them.
My understanding is that since my persistent workers need to perform checkpoint(), it needs to have access to some kind of shared file system with the master node. S3 seems like a logical choice.
</EDIT>
This means I need to access S3, but not hdfs. I do not have Hadoop installed.
I got a pre-built Spark for Hadoop 2.6. I can run it in local mode, such as the wordcount example. However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
<EDIT>
It's not a show stopper but I want to make sure I understand the reason of this warning message. I was under the assumption that Spark doesn't need Hadoop, so why is it even showing up?
</EDIT>
I'm not sure what version to download.
This consideration will also be guided by what existing code you are using, features you require, and bug tolerance.
I want to run a Spark cluster in standalone mode using AWS instances.
Have you considered simply running Apache Spark on Amazon EMR? See also How can I run Spark on a cluster? from Spark's FAQ, and their reference to their EC2 scripts.
This means I need to access S3, but not hdfs
One does not imply the other. You can run a Spark cluster on EC2 instances perfectly fine, and never have to access S3. While many examples are written using S3 access through the out-of-the-box S3 "fs" drivers for the Hadoop library, pay attention that there are now 3 different access methods. Configure as appropriate.
However, your choice of libraries to load will depend on where your data is. Spark can access any filesystem supported by Hadoop, from which there are several to choose.
Is your data even in files? Depending on your application, and where your data is, you may only need to use Data Frame over SQL, Cassandra, or others!
However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
Not a problem. It is telling you that it is falling back to a non-optimum implementation. Others have asked this question, too.
In general, it sounds like you don't have any application needs right now, so you don't have any dependencies. Dependencies are what would drive different configurations such as access to S3, HDFS, etc.
I can run it in local mode, such as the wordcount example.
So, you're good?
UPDATE
I've edited the original post
My data will be coming from a Kafka stream. ... My understanding is that .. my persistent workers need to perform checkpoint().
Yes, the Direct Kafka approach is available from Spark 1.3 on, and per that article, uses checkpoints. These require a "fault-tolerant, reliable file system (e.g., HDFS, S3, etc.)". See the Spark Streaming + Kafka Integration Guide for your version for specific caveats.
So why [do I see the Hadoop warning message]?
The Spark download only comes with so many Hadoop client libraries. With a fully-configured Hadoop installation, there are also platform-specific native binaries for certain packages. These get used if available. To use them, augment Spark's classpath; otherwise, the loader will fallback to less performant versions.
Depending on your configuration, you may be able to take advantage of a fully configured Hadoop or HDFS installation. You mention taking advantage of your existing, persistent EC2 instances, rather than using something new. There's a tradeoff between S3 and HDFS: S3 is a new resource (more cost) but survives when your instance is offline (can take compute down and have persisted storage); however, S3 might suffer from latency compared to HDFS (you already have the machines, why not run a filesystem over them?), as well as not behave like a filesystem in all cases. This tradeoff is described by Microsoft for choosing Azure storage vs. HDFS, for example, when using HDInsight.
We're also running Spark on EC2 against S3 (via the s3n file system). We had some issue with the pre-built versions for Hadoop 2.x. Regrettably I don't remember what the issue was. But in the end we're running with the pre-built Spark for Hadoop 1.x and it works great.
I'm trying to use different zlib library with hadoop. When i'm using a particular library, hadoop services are not getting started. Services are starting properly if i use different library. The only difference between the libraries are, not working one using a device file to send data to kernel driver for compression. Where could be the issue ? and also where to look for the error log in hadoop.