How to import data into Cassandra on EC2 using DSBulk Loader - amazon-ec2

I'm attempting to import data into Cassandra on EC2 using dsbulk loader. I have three nodes configured and communicating as follows:
UN 172.31.37.60 247.91 KiB 256 35.9% 7fdfe44d-ce42-45c5-bb6b-c3e8377b0eba 2a
UN 172.31.12.203 195.17 KiB 256 34.1% 232f7d98-9cc2-44e5-b18f-f52107a6fe2c 2c
UN 172.31.23.23 291.99 KiB 256 30.0% b5389bf8-c0e5-42be-a296-a35b0a3e68fb 2b
I'm trying to run the following command to import a csv file into my database:
dsbulk load -url cassReviews/reviewsCass.csv -k bnbreviews -t reviews_by_place -h '172.31.23.23' -header true
I keep receiving the following error:
Error connecting to Node(endPoint=/172.31.23.23:9042, hostId=null, hashCode=b9b80b7)
Could not reach any contact point, make sure you've provided valid addresses
I'm running import from outside of the cluster, but within the same EC2 instance. On each node, I set the listen_address and rpc_address to its privateIP. Port 9042 is open - All three nodes are within the same region, and I'm using an Ec2Snitch. Each node is running on an ubuntu v18.04 server.
I've made sure each of my nodes is up before running the command, and that the path to my .csv file is correct. It seems like when I run the dsbulk command, the node that I specify with the -h flag goes down immediately. Could there be something wrong with my configuration that I'm missing? DSBulk loader worked well locally, but is there a more ideal method for importing data from csv files in an EC2 instance? Thank you!
EDIT: I've been able to load data in chunks using dsbulk loader, but the process is occasionally interrupted by this error:
[s0|/xxx.xx.xx.xxx:9042] Error while opening new channel
The way I'm currently interpreting it is that the node at the specified IP has run out of storage space and crashed, causing any subsequent dsbulk operations to fail. The work-around so far has been to clear excess logging files from /var/log/cassandra and restart the node, but I think a better approach would be to increase the SSD on each instance.

As mentioned in my edit, the problem was solved by increasing the volume on each of my node instances. The reason that DSBulk was failing and causing the nodes to crash was due to the EC2 instances running out of storage, from a combination of imported data, logging, and snapshots. I ended up running my primary node instance, in which I was running the DSBulk command, on a t2.medium instance with 30GB SSD, which solved the issue.

Related

Cloudant : Error with running weatherreport to check cluster health

We have three node cluster setup and facing issue to run weather report command.
By looking at error, it is clear that machine from where weatherreport utility is running not able to connect to other two machines. I have checked all machines and they are accessible using fqdn. But from message it looks like it is using shortname while connecting to peer machine. So how to check from where it is taking peer machine names? So I can give a try to change them to full machine name and that might work for me. if there is any other solution then let us know.
Error is coming as
['cloudant_diag17506#machine2031.domain.com'] [crit] Could not run check weatherreport_check_safe_to_rebuild on cluster node 'cloudant#machine2031'
['cloudant_diag17506#machine2031.domain.com'] [crit] Could not run check weatherreport_check_safe_to_rebuild on cluster node 'cloudant#machine2032'
['cloudant_diag17506#machine2031.domain.com'] [crit] Could not run check weatherreport_check_safe_to_rebuild on cluster node 'cloudant#machine2033'
['cloudant#machine2032.domain.com'] [crit] Rebuilding this node will leave the following shard with NO live copies: default/t_alpha e0000000-ffffffff, default/t_alpha a0000000-bfffffff, default/t_alpha 60000000-7fffffff, default/t_alpha 20000000-3fffffff, default/metrics_app e0000000-ffffffff, default/metrics_app a0000000-bfffffff, default/metrics_app 60000000-7fffffff, default/metrics_app 20000000-3fffffff
I got solution for this problem.
It was problem that when DB was created first time, short name was used so in database it might be referring for short name to connect to other peer hosts.
Now that the Cloudant Local installation is in problematic stage, to make it consistent would be to remove all the files under /srv/cloudant/ on all database nodes. This will remove all default Cloudant databases. Then run the configure.sh script again on each node as before but now that "hostname -f" correctly outputs the fully qualified host name, then create your databases again.

Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster

I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured.
I receive the following error when running "spark-shell" from the shell (locally on the master), as well as when uploading a job through the web-GUI and the gcloud command line utility from my local machine:
15/11/08 21:27:16 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (38281+2679 MB) is above the max threshold (20480 MB) of this cluster! Please increase the value of 'yarn.s
cheduler.maximum-allocation-mb'.
I tried modifying the value in /etc/hadoop/conf/yarn-site.xml but it didn't change anything. I don't think it pulls the configuration from that file.
I've tried with multiple cluster combinations, at multiple sites (mainly Europe), and I only got this to work with the low memory version (4-cores, 15 gb memory).
I.e. this is only a problem on the nodes configured for memory higher than the yarn default allows.
Sorry about these issues you're running into! It looks like this is part of a known issue where certain memory settings end up computed based on the master machine's size rather than the worker machines' size, and we're hoping to fix this in an upcoming release soon.
There are two current workarounds:
Use a master machine type with memory either equal to or smaller
than worker machine types.
Explicitly set spark.executor.memory and spark.executor.cores either using the --conf flag if running from an SSH connection like:
spark-shell --conf spark.executor.memory=4g --conf spark.executor.cores=2
or if running gcloud beta dataproc, use --properties:
gcloud beta dataproc jobs submit spark --properties spark.executor.memory=4g,spark.executor.cores=2
You can adjust the number of cores/memory per executor as necessary; it's fine to err on the side of smaller executors and letting YARN pack lots of executors onto each worker, though you can save some per-executor overhead by setting spark.executor.memory to the full size available in each YARN container and spark.executor.cores to all the cores in each worker.
EDIT: As of January 27th, new Dataproc clusters will now be configured correctly for any combination of master/worker machine types, as mentioned in the release notes.

Submitting jobs to Spark EC2 cluster remotely

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

"Unable to verify integrity of data" while running MR job

I'm running a relatively big MR job using Amazon Elastic Map Reduce.
I ran the job plenty of times on small data sets with no problem.
But when trying to run it on a large dataset I'm getting the following exception:
Error: com.amazonaws.AmazonClientException: Unable to verify integrity
of data download. Client calculated content length didn't match
content length received from Amazon S3. The data may be corrupt.
I googled it and the only recommendation I got was to set the following:
System.setProperty("com.amazonaws.services.s3.disableGetObjectMD5Validation","true");
That didn't help at all.
I'm using replication 3, 11 M1Large datanodes and 1 M1Medium master node.
Any workaround or known fix for this issue?
Apparently, this is a known bug. Or so I've been told by an Amazon employee here.
It occurs when running on large datasets where an S3 object is bigger than 2GB.
I managed to work around it by moving to Hadoop 2.4.0 and AMI 3.1.0.

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources