Setup and configuration of JanusGraph for a Spark cluster and Cassandra - hadoop

I am running JanusGraph (0.1.0) with Spark (1.6.1) on a single machine.
I did my configuration as described here.
When accessing the graph on the gremlin-console with the SparkGraphComputer, it is always empty. I cannot find any error in the logfiles, it is just an empty graph.
Is anyone using JanusGraph with Spark and can share his configuration and properties?
Using a JanusGraph, I get the expected Output:
gremlin> graph=JanusGraphFactory.open('conf/test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().count()
14:26:10 WARN org.janusgraph.graphdb.transaction.StandardJanusGraphTx - Query requires iterating over all vertices [()]. For better performance, use indexes
==>1000001
gremlin>
Using a HadoopGraph with Spark as GraphComputer, the graph is empty:
gremlin> graph=GraphFactory.open('conf/test.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g=graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>0==============================================> (14 + 1) / 15]
My conf/test.properties:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
#
# Titan Cassandra InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.keyspace=janusgraph
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
storage.keyspace=janusgraph
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=janusgraph
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
#
# SparkGraphComputer Configuration
#
spark.master=spark://127.0.0.1:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executor.memory=100g
gremlin.spark.persistContext=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
HDFS seems to be configured correctly as described here:
gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_178390072_1, ugi=cassandra (auth:SIMPLE)]]]

Try fixing these properties:
janusgraphmr.ioformat.conf.storage.keyspace=janusgraph
storage.keyspace=janusgraph
Replace with:
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
storage.cassandra.keyspace=janusgraph
The default keyspace name is janusgraph, so despite the mistakes on the property names, I don't think you would have observed that problem unless you loaded your data using a different keyspace name.
The latter property is described in the Configuration Reference. Also, keep an eye on this open issue to improve the docs for Hadoop-Graph usage.

Related

Failed to communicate with peer when load data using round robin load balancer in Apache NiFi cluster

I facing the issue of failed to communicate with peer when loading data using Round Robin load balancer from the queue in Apache-NiFi cluster.
I setup 3 node in the cluster and below is one of the nifi.properties setting from one of the node.
From the screenshot attached below, I have the GetFile processor which will read some text file in multiple lines of CSV format. So, when there are several files in the processor, it will put to the queue once read it. In the queue, I use round robin load balancer. So, when the started to load data into the queue, the error occur.
####################
# State Management #
####################
nifi.state.management.configuration.file=./conf/state-management.xml
# The ID of the local state provider
nifi.state.management.provider.local=local-provider
# The ID of the cluster-wide state provider. This will be ignored if NiFi is not clustered but must be populated if running in a cluster.
nifi.state.management.provider.cluster=zk-provider
# Specifies whether or not this instance of NiFi should run an embedded ZooKeeper server
nifi.state.management.embedded.zookeeper.start=true
# Properties file that provides the ZooKeeper properties to use if <nifi.state.management.embedded.zookeeper.start> is set to true
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
# H2 Settings
nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE
# FlowFile Repository
nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.checkpoint.interval=20 secs
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.provider.password=
nifi.flowfile.repository.encryption.key.id=
nifi.flowfile.repository.encryption.key=
nifi.flowfile.repository.retain.orphaned.flowfiles=true
nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
# Content Repository
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=7 days
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.provider.password=
nifi.content.repository.encryption.key.id=
nifi.content.repository.encryption.key=
# Provenance Repository Properties
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.provider.password=
nifi.provenance.repository.encryption.key.id=
nifi.provenance.repository.encryption.key=
# Persistent Provenance Repository Properties
nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=30 days
nifi.provenance.repository.max.storage.size=10 GB
nifi.provenance.repository.rollover.time=10 mins
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
# Comma-separated list of fields. Fields that are not indexed will not be searchable. Valid fields are:
# EventType, FlowFileUUID, Filename, TransitURI, ProcessorID, AlternateIdentifierURI, Relationship, Details
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship
# FlowFile Attributes that should be indexed and made searchable. Some examples to consider are filename, uuid, mime.type
nifi.provenance.repository.indexed.attributes=
# Large values for the shard size will result in more Java heap usage when searching the Provenance Repository
# but should provide better performance
nifi.provenance.repository.index.shard.size=500 MB
# Indicates the maximum length that a FlowFile attribute can be when retrieving a Provenance Event from
# the repository. If the length of any attribute exceeds this value, it will be truncated when the event is retrieved.
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
# Volatile Provenance Respository Properties
nifi.provenance.repository.buffer.size=100000
# Component and Node Status History Repository
nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
# Volatile Status History Repository Properties
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min
# QuestDB Status History Repository Properties
nifi.status.repository.questdb.persist.node.days=14
nifi.status.repository.questdb.persist.component.days=3
nifi.status.repository.questdb.persist.location=./status_repository
# Site to Site properties
nifi.remote.input.host=node1
nifi.remote.input.secure=false
nifi.remote.input.socket.port=10001
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs
# web properties #
#############################################
# For security, NiFi will present the UI on 127.0.0.1 and only be accessible through this loopback interface.
# Be aware that changing these properties may affect how your instance can be accessed without any restriction.
# We recommend configuring HTTPS instead. The administrators guide provides instructions on how to do this.
nifi.web.http.host=node1
nifi.web.http.port=8081
nifi.web.http.network.interface.default=
#############################################
nifi.web.https.host=
nifi.web.https.port=
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=
nifi.web.proxy.host=
nifi.web.max.content.size=
nifi.web.max.requests.per.second=30000
nifi.web.request.timeout=60 secs
nifi.web.request.ip.whitelist=
nifi.web.should.send.server.version=true
# Include or Exclude TLS Cipher Suites for HTTPS
nifi.web.https.ciphersuites.include=
nifi.web.https.ciphersuites.exclude=
# security properties #
nifi.sensitive.props.key=sf4eCVtTmnwRfMd5LarMMkKyTONuLvgE
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=NIFI_PBKDF2_AES_GCM_256
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=
nifi.security.autoreload.enabled=false
nifi.security.autoreload.interval=10 secs
nifi.security.keystore=./conf/keystore.p12
nifi.security.keystoreType=PKCS12
nifi.security.keystorePasswd=df9e762b67b2c74eb1ea147be8d7ecf0
nifi.security.keyPasswd=df9e762b67b2c74eb1ea147be8d7ecf0
nifi.security.truststore=./conf/truststore.p12
nifi.security.truststoreType=PKCS12
nifi.security.truststorePasswd=6d115b1494e9dd5112c2d2dc0608bc85
nifi.security.user.authorizer=single-user-authorizer
nifi.security.allow.anonymous.authentication=false
nifi.security.user.login.identity.provider=single-user-provider
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=
# OpenId Connect SSO Properties #
nifi.security.user.oidc.discovery.url=
nifi.security.user.oidc.connect.timeout=5 secs
nifi.security.user.oidc.read.timeout=5 secs
nifi.security.user.oidc.client.id=
nifi.security.user.oidc.client.secret=
nifi.security.user.oidc.preferred.jwsalgorithm=
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=
nifi.security.user.oidc.fallback.claims.identifying.user=
# Apache Knox SSO Properties #
nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=
# SAML Properties #
nifi.security.user.saml.idp.metadata.url=
nifi.security.user.saml.sp.entity.id=
nifi.security.user.saml.identity.attribute.name=
nifi.security.user.saml.group.attribute.name=
nifi.security.user.saml.metadata.signing.enabled=false
nifi.security.user.saml.request.signing.enabled=false
nifi.security.user.saml.want.assertions.signed=true
nifi.security.user.saml.signature.algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256
nifi.security.user.saml.signature.digest.algorithm=http://www.w3.org/2001/04/xmlenc#sha256
nifi.security.user.saml.message.logging.enabled=false
nifi.security.user.saml.authentication.expiration=12 hours
nifi.security.user.saml.single.logout.enabled=false
nifi.security.user.saml.http.client.truststore.strategy=JDK
nifi.security.user.saml.http.client.connect.timeout=30 secs
nifi.security.user.saml.http.client.read.timeout=30 secs
# Identity Mapping Properties #
# These properties allow normalizing user identities such that identities coming from different identity providers
# (certificates, LDAP, Kerberos) can be treated the same internally in NiFi. The following example demonstrates normalizing
# DNs from certificates and principals from Kerberos into a common identity string:
#
# nifi.security.identity.mapping.pattern.dn=^CN=(.*?), OU=(.*?), O=(.*?), L=(.*?), ST=(.*?), C=(.*?)$
# nifi.security.identity.mapping.value.dn=$1#$2
# nifi.security.identity.mapping.transform.dn=NONE
# nifi.security.identity.mapping.pattern.kerb=^(.*?)/instance#(.*?)$
# nifi.security.identity.mapping.value.kerb=$1#$2
# nifi.security.identity.mapping.transform.kerb=UPPER
# Group Mapping Properties #
# These properties allow normalizing group names coming from external sources like LDAP. The following example
# lowercases any group name.
#
# nifi.security.group.mapping.pattern.anygroup=^(.*)$
# nifi.security.group.mapping.value.anygroup=$1
# nifi.security.group.mapping.transform.anygroup=LOWER
# cluster common properties (all nodes must have same values) #
nifi.cluster.protocol.heartbeat.interval=5 sec
nifi.cluster.protocol.heartbeat.missable.max=8
nifi.cluster.protocol.is.secure=false
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=node1
nifi.cluster.node.protocol.port=9991
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=60 sec
nifi.cluster.node.read.timeout=60 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
# cluster load balancing properties #
nifi.cluster.load.balance.host=node1
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=2
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=node1:2181,node2:2182,node3:2183
nifi.zookeeper.connect.timeout=10 secs
nifi.zookeeper.session.timeout=10 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.client.secure=false
nifi.zookeeper.security.keystore=
nifi.zookeeper.security.keystoreType=
nifi.zookeeper.security.keystorePasswd=
nifi.zookeeper.security.truststore=
nifi.zookeeper.security.truststoreType=
nifi.zookeeper.security.truststorePasswd=
# Zookeeper properties for the authentication scheme used when creating acls on znodes used for cluster management
# Values supported for nifi.zookeeper.auth.type are "default", which will apply world/anyone rights on znodes
# and "sasl" which will give rights to the sasl/kerberos identity used to authenticate the nifi node
# The identity is determined using the value in nifi.kerberos.service.principal and the removeHostFromPrincipal
# and removeRealmFromPrincipal values (which should align with the kerberos.removeHostFromPrincipal and kerberos.removeRealmFromPrincipal
# values configured on the zookeeper server).
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=
# kerberos #
nifi.kerberos.krb5.file=
# kerberos service principal #
nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=
# kerberos spnego principal #
nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours
# external properties files for variable registry
# supports a comma delimited list of file locations
nifi.variable.registry.properties=
# analytics properties #
nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name=rSquared
nifi.analytics.connection.model.score.threshold=.90
# runtime monitoring properties
nifi.monitor.long.running.task.schedule=
nifi.monitor.long.running.task.threshold=
Hopefully someone can give some advise.
Thanks in advance.

issue with adding a new member in etcd cluster

I have 3 node etcd cluster running on docker
Node1:
etcd-advertise-client-urls: "http://sensu-backend1:2379"
etcd-initial-advertise-peer-urls: "http://sensu-backend3:2380"
etcd-initial-cluster: "sensu-backend1=http://sensu-backend1:2380,sensu-backend2=http://sensu-backend2:2380,sensu-backend3=http://sensu-backend3:2380"
etcd-initial-cluster-state: "new" # new or existing
etcd-listen-client-urls: "http://0.0.0.0:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-name: "sensu-backend1"
Node2:
etcd-advertise-client-urls: "http://sensu-backend2:2379"
etcd-initial-advertise-peer-urls: "http://sensu-backend3:2380"
etcd-initial-cluster: "sensu-backend1=http://sensu-backend1:2380,sensu-backend2=http://sensu-backend2:2380,sensu-backend3=http://sensu-backend3:2380"
etcd-initial-cluster-state: "new" # new or existing
etcd-listen-client-urls: "http://0.0.0.0:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-name: "sensu-backend2"```
Node3:
etcd-advertise-client-urls: "http://sensu-backend3:2379"
etcd-initial-advertise-peer-urls: "http://sensu-backend3:2380"
etcd-initial-cluster: "sensu-backend1=http://sensu-backend1:2380,sensu-backend2=http://sensu-backend2:2380,sensu-backend3=http://sensu-backend3:2380"
etcd-initial-cluster-state: "new" # new or existing
etcd-listen-client-urls: "http://0.0.0.0:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-name: "sensu-backend3"
I am running each node as a docker service without persisting the etcd data directory.
When I start all the nodes together etcd forms the cluster.
If I delete one node and try to add as etcd-initial-cluster-state: "existing" then I get following error
{"component":"etcd","level":"fatal","msg":"tocommit(6264) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?","pkg":"raft","time":"2020-12-09T11:32:55Z"}
After stopping etcd, I deleted the node from cluster using etcdctl member remove . When I restart container with empty etcd data directory then I get cluster id mismatch error.
{"component":"backend","error":"error starting etcd: error validating peerURLs {ClusterID:4bccd6f485bb66f5 Members:[\u0026{ID:2ea5b7e4c09185e2 RaftAttributes:{PeerURLs:[http://sensu-backend1:2380]} Attributes:{Name:sensu-backend1 ClientURLs:[http://sensu-backend1:2379]}} \u0026{ID:9e83e7f64749072d RaftAttributes:{PeerURLs:[http://sensu-backend2:2380]} Attributes:{Name:sensu-backend2 ClientURLs:[http://sensu-backend2:2379]}}] RemovedMemberIDs:[]}: member count is unequal"}
Please help me on fixing the issue.
If you delete a node that was in a cluster, you should manually delete it from etcd cluster also i.e. by doing 'etcdctl remove '.
And member mismatch count error is because 'etcd-initial-cluster' still has all 3 entries of nodes, you need to remove that entry of deleted node from this field also in all containers.

Gremlin-Python Connecting to existing JanusGraph

I Have created a graph using gremlin console
gremlin> ConfiguredGraphFactory.graphNames
==>MYGRAPH
gremlin> ConfiguredGraphFactory.getConfiguration('MYGRAPH')
==>storage.backend=cql
==>graph.graphname=MYGRAPH
==>storage.hostname=127.0.0.1
==>Template_Configuration=false
gremlin> g.V().properties()
==>vp[name->SFO]
==>vp[country->USA]
==>vp[name->ALD]
==>vp[country->IND]
==>vp[name->BLR]
==>vp[country->IND]
gremlin>
I want to connect with MYGRAPH using gremlin-python.
Can someone please tell me how to access graph named "MYGRAPH" using gremlin-python.
Thanks in advance...
First of all you will need to install some jar files for JanusGraph to handle gremlin-python scripts:
./bin/gremlin-server.sh -i org.apache.tinkerpop gremlin-python 3.2.9
Please note that the version of gremlin-python you install must match the Tinkerpop version JanusGraph is compatible with. You can find compatibility information on the JanusGraph releases page. For example JanusGraph 0.2.2 is compatible with Tinkerpop 3.2.9.
Next you need to start a JanusGraph server using ConfiguredGraphFactory. You just have to use the file conf/gremlin-server/gremlin-server-configuration.yaml from the ditribution.
bin/gremlin-server.sh conf/gremlin-server/gremlin-server-configuration.yaml
This file differs from the traditional conf/gremlin-server/gremlin-server.yaml in those few lines
graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties
}
Then we need to load the graph MYGRAPH during the initialization script of the server. Please create an init script scripts/init.groovy. Here you can load as many different graphs as you want.
def globals = [:]
myGraph = ConfiguredGraphFactory.open("MYGRAPH")
globals << [myGraphTraversal : myGraph.traversal()]
Make sure this script is executed when gremlin server starts in conf/gremlin-server/gremlin-server-configuration.yaml
scriptEngines: {
gremlin-groovy: {
imports: [java.lang.Math],
staticImports: [java.lang.Math.PI],
scripts: [scripts/init.groovy]}}
Finally in your Python project, install the gremlin-python package that matches the Tinkerpop version of your version of JanusGraph. In case of JanusGraph 0.2.2, this is version 3.2.9.
pip install gremlin-python==3.2.9
Start a Python shell and start coding:
>>> from gremlin_python import statics
>>> from gremlin_python.structure.graph import Graph
>>> from gremlin_python.process.graph_traversal import __
>>> from gremlin_python.process.strategies import *
>>> from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
>>> graph = Graph()
>>> myGraphTraversal = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','myGraphTraversal'))
>>> myGraphTraversal.V().count()

How can i find the result of Debugging topology in storm?

I'm new to storm trying to use deubugging
i forced topology.debug: true in storm.yaml
but when i finished sumbiting topology couldn't find where is the result of debug
I noticed in storm ui that topology.debug is false !
why it coudln't read my changes ?
Each node/machine in you cluster has it's own storm.yaml file. Thus, your changes to your local storm.yaml does not have any effect. However, you can overwrite this value via a topology configuration that is provided when you submit the topology:
Config cfg = new Config();
cfg.setDebug(true);
StormSubmitter.submitTopology("myTopology", cfg, builder.createTopology());
You will find the log files on the nodes in you cluster in your_storm_dir/logs/

MLlib ALS cannot delete checkpoint RDDs Wrong FS: hdfs://[url] expected: file:///

I am using Spark MLlib's ALS class to train a MatrixFactorizationModel. I have setup a HDFS for checkpointing intermediate rdds (as suggested by ALS class). The rdds are begin saved but I get an exception when it tries to delete them again: java.lang.IllegalArgumentException: Wrong FS: hdfs://[url], expected: file:///
Here is the stack trace:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://[url], expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(ALS.scala:568)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(ALS.scala:566)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2.apply$mcV$sp(ALS.scala:566)
at org.apache.spark.ml.recommendation.ALS$$anonfun$train$1.apply$mcVI$sp(ALS.scala:602)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:596)
at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:219)
at adapters.ALSAdapter.run(ALSAdapter.java:59) ...
The culprit seems to be:
at org.apache.spark.ml.recommendation.ALS.scala 568
FileSystem.get(sc.hadoopConfiguration).delete(new Path(file), true)
which appears to return a RawLocalFilesystem instead of a distributed filesystem object.
I have not touched sc.hadoopConfiguration. The only interaction I've had is to call myJavaStreamingContext.checkpoint(hdfs://[url + directory]);.
Is there anything further I need to do client side to setup sc.hadoopConfiguration or would the problem be hdfs server side?
I was using spark 1.3.1 but tried 1.4.1 and the problem still persists.
your sc.hadoopConfiguration is using local filesystem ("fs.default.name" = file:///) and despite specifying explicitly the checkpoint location, the ALS code uses sc.hadoopConfiguration to determine filesystem (e.g. for deleting previous checkpoints if applicable et al)
you can try
set sc.hadoopConfiguration.set("fs.default.name", hfds://xx:port)
set sc.setCheckpointDir(yourCheckpointDir)
and just call checkpoint with no explicit filesystem path or see if setCheckpointInterval will work for you (i.e. let ALS take care of checkpointing)
you can add hdfs-site.xml and core-site.xml to spark classpath if you don't want sc.hadoopConfiguration.set in code

Resources