I need to fetch data from a Mysql server through ssh tunneling.
I am using Apache Beam 2.19.0 Java JdbcIO on Google Dataflow to connect to the database.
But as the database is inside a private network I need to reach the database through one in between ssh server via ssh tunneling.
Is it something achievable using apache beam jdbc IO ?
This functionality isn't built into Apache Beam, however there are several options. The JdbcIO uses the standard Java JDBC interface to connect to your database. It wouldn't be too difficult to overload the Mysql JDBC Driver with your own wrapper that sets up a SSH tunnel before connecting. I did a quick Google search and found a project that wraps an arbitrary JDBC driver with an SSH tunnel using SSHJ: jdbc-sshj (a copy is published to maven as com.cekrlic:jdbc-sshj:0.1.0). The project looks somewhat unmantained but it will do what you want. Add this to your runtime dependencies then update your config to something like this (this example is not secure):
pipeline.apply(JdbcIO.<KV<Integer, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.cekrlic.jdbc.ssh.tunnel.SshJDriver",
"jdbc:sshj://sshbastion?remote=database:3306&username=sshuser&password=sshpassword&verify_hosts=off;;;jdbc:mysql://localhost:3306/mydb")
.username("username")
.withPassword("password"))
.withQuery("select id,name from Person")
.withCoder(KvCoder.of(BigEndianIntegerCoder.of(), StringUtf8Coder.of()))
.withRowMapper(new JdbcIO.RowMapper<KV<Integer, String>>() {
public KV<Integer, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getInt(1), resultSet.getString(2));
}
})
);
If you are using Dataflow you can setup a GCE VM to act as your gateway. On that VM use SSH forwarding to tunnel the Database to the VM's external interface (ssh -R \*:3306:database:3306 sshbastion), make the port avalable within the VPC, and then run your Dataflow job on your VPC. If your database is already running in GCP, you can use this approach to run your dataflow job on the same VPC as the database and drop the SSH step.
Related
The Goal
I need to query a Redshift cluster from my golang application. The cluster is not available for public, so I want to use SSH to access the said cluster via bastion host.
Status Quo
I have an AWS Redshift cluster in a private VPC, with inbound rules to not allow any traffic from the internet, but tcp 22;
There's a bastion (which can connect to the cluster), so fowarding a port and using rsql works perfectly fine from the terminal;
I use ODBC, and the official ODBC driver from AWS;
In golang application, I use the following db implementation of the ODBC https://github.com/alexbrainman/odbc;
I can't use Devart's Redshift ODBC driver;
I can't use JDBC;
MacOS/Unix.
The Problem
The problem is pretty much trivial. When cluster is available for public and accessible from the internet, the alexbramain's library does it job. However, when the cluster is behind the wall, that's when problems kick in.
The code of the library is translated into C (system calls), I can't really debug it. While, with mysql, e.g., it's possible to register your custom dialer, it doesn't seem to be a case with ODBC.
Even when the tunnel is active, providing an ODBC DSN to the local host for some reason doesn't work. The SQLRETURN is always -1 (api/zapi_unix.go).
The Question
Did someone have such experience? How did you resolve a problem of accessing the cluster from the internet via a go app?
Thank you!
I try to connect to my MySQL server with logstash on our elastic cloud cluster, the problem is that we use SSH tunnel on the sql server. Is there a way, using the logstash pipeline creation interface on elastic cloud, to connect to a mysql server using SSH tunnel ?
Interface is as follow, there is not that much parameters..
No, I'm afraid that's not part of the JDBC input plugin of Logstash (which will do the connection to MySQL). Can you set up the SSH tunnel between your Logstash server and MySQL manually?
I'm trying to connect my local standalone MySQL with Cloud Fusion to create and test a data pipeline. I have deployed the driver successfully.
Also, I have configured the pipeline properties with correct values of jdbc string, user name and password but connectivity isn't getting established.
Connection String: jdbc:mysql://localhost:3306/test_database
I have also tried to test the connectivity via data wrangling option but that is also not getting succeeded.
Do I need to bring both the environments under same network by setting up some VPC and tunneling?
In your example, I see that you specified localhost in your Connection String. localhost is only advertised to other services running local to your machine, and Cloud Data Fusion (running in GCP) will not be able to reach the MySQL instance (running on your machine). Hence you're seeing the connectivity issue.
I highly recommend looking at this answer on SO that will help you setup a quick proof-of-concept.
I think that your question is more related to the way how to connect some on-premise environments to GCP networking system that gathering Google cloud instances or resources throughout VPC connection model.
Admitting the fact that GCP is actually leveraging different approaches for connection methods within a Hybrid cloud concepts, I would encourage you to learn some fundamental principles of Cloud VPN as a essential part of performing secure connection between particular VPN Peer Gateway and Cloud VPN Gateway and further creating a VPN tunnel between parties.
I guess there is even dedicated chapter in GCP documentation about Data Fusion VPC peering implementation that might be helpful in your user case.
I am following this guide on Hadoop/FIWARE-Cosmos and I have a question about the Hive part.
I can access the old cluster’s (cosmos.lab.fiware.org) headnode through SSH, but I cannot do it for the new cluster. I tried both storage.cosmos.lab.fiware.org and computing.cosmos.lab.fiware.org and failed to connect.
My intention in trying to connect via SSH was to test Hive queries on our data through the Hive CLI. After failing to do so, I checked and was able to connect to the 10000 port of computing.cosmos.lab.fiware.org with telnet. I guess Hive is served through that port. Is this the only way we can use Hive in the new cluster?
The new pair of clusters have not enabled the ssh access. This is because users tend to install a lot of stuff (even not related with Big Data) in the “old” cluster, which had the ssh access enabled as you mention. So, the new pair of clusters are intended to be used only through the APIs exposed: WebHDFS for data I/O and Tidoop for MapReduce.
Being said that, a Hive Server is running as well and it should be exposing a remote service in the 10000 port as you mention as well. I say “it should be” because it is running an experimental authenticator module based in OAuth2 as WebHDFS and Tidoop do. Theoretically, connecting to that port from a Hive client is as easy as using your Cosmos username and a valid token (the same you are using for WebHDFS and/or Tidoop).
And what about a Hive remote client? Well, this is something your application should implement. Anyway, I have uploaded some implementation examples in the Cosmos repo. For instance:
https://github.com/telefonicaid/fiware-cosmos/tree/develop/resources/java/hiveserver2-client
Similar questions have been asked
RabbitMQ on Amazon EC2 Instance & Locally?
and
cant connect from my desktop to rabbitmq on ec2
But they get different error messages.
I have a RabbitMQ server running on my linux EC2 instance which is set up correctly. I have created custom users and given them permissions to read/write to queues. Using a local client I am able to correctly receive messages. I have set up the security groups on EC2 so that ports (5672/25672) are open and can telnet to those ports. I also have set up rabbitmq.conf like this.
[
{rabbit, [
{tcp_listeners, [{"0.0.0.0",5672}]},
{loopback_users, []},
{log_levels, [{connection, info}]}
]
}
].
At the moment I have a client on the server publishing to the queue.
I have another client running on a server outside of EC2 which needs to consume data from the same queue (I can't run both on EC2 as the consume does a lot of plotting/graphical manipulation).
When I try to connect however from the external client using some test code
try {
ConnectionFactory factory = new ConnectionFactory();
factory.setUri("amqp://****:****#****:5672/");
connection = factory.newConnection();
} catch (IOException e) {
e.printStackTrace();
}
I get the following error.
com.rabbitmq.client.AuthenticationFailureException: ACCESS_REFUSED -
Login was refused using authentication mechanism PLAIN. For details
see the broker logfile.
However there is nothing in the broker logfile as if I never tried to connect.
I've tried connecting using the individual getter/setter methods of factory, I've tried using different ports (along with opening them up).
I was wondering if I need to use SSL or not to connect to EC2 but from reading around the web it seems like it should just work but I'm not exactly sure. I cannot find any examples of people successfully achieving what I'm trying to do and documenting it.
Thanks in advance
The answer was simply that I needed to specify the host to be the same IP I use to SSH into. I was trying to use the Elastic IP/public dns of the instance of the EC2 instance which I thought should point to the same machine.
Although I did try many things including setting up an SSL connection it was not necessary.
All that is needed is:
Create rabbitmq user using rabbitmqctrl and give it appropriate permissions
Open the needed ports on EC2 via Security Groups menu (default is 5672)
Use client library to connect to correct host name/username/password/port where the host name is the same as the machine that you normally SSH into.