I am totally new to EtherCAT. I have Microchip LAN9255 based EtherCAT slave device and TwinCAT3 software tool as EtherCAT master. Some configuration data are required to send from EtherCAT master (TwinCAT3 master) to LAN9255 EtherCAT slave. This configuration data needs to be sent only when it is changed (acyclically). Also, I have Slave Stack Code (SSC) Tool to generate slave stack code. With this tool I import one xlsx based application descriptor file.
Application descriptor file
In this xlsx based application descriptor file I want to change the transmission type of GPIO_OUTPUT object to Event driven (If the TwinCAT3 EtherCAT master recognizes a change of state). What changes are required in this application descriptor file and EtherCAT slave firmware?
Related
I am new to Spark Streaming and have little knowledge about checkpoint.Is streaming data stored in the checkpoint? Is the data stored in hdfs or memory ?How much space will it takes?
according to : Spark The definitive guide
The most important operational concern for a streaming application is
failure recovery. Faults are inevitable: you’re going to lose a
machine in the cluster, a schema will change by accident without a
proper migration, or you may even intentionally restart the cluster or
application. In any of these cases, Structured Streaming allows you to
recover an application by just restarting it. To do this, you must
configure the application to use checkpointing and write-ahead logs,
both of which are handled automatically by the engine. Specifically,
you must configure a query to write to a checkpoint location on a
reliable file system (e.g., HDFS, S3, or any compatible filesystem).
Structured Streaming will then periodically save all relevant progress
information (for instance, the range of offsets processed in a given
trigger) as well as the current intermediate state values to the
checkpoint location. In a failure scenario, you simply need to restart
your application, making sure to point to the same checkpoint
location, and it will automatically recover its state and start
processing data where it left off. You do not have to manually manage
this state on behalf of the application—Structured Streaming does it
for you.
I conclude that it is job progress information and intermediate results in which stored in checkpoint not the data, checkpoint location has to be a path in an HDFS compatible file system and the required space is based on the intermediate generated output.
I have a custom nifi processor that uses external data for some user controlled configuration. I want to know how to signal the processor to reload the data when it is changed.
I was thinking that a flofile could be sent to signal the processor but I am concerned that in a clustered environment only one processor would get the notification and all the others would still be running on old configuration.
The most common ways to watch a file for changes are the JDK WatchService or Apache Commons IO Monitor...
https://www.baeldung.com/java-watchservice-vs-apache-commons-io-monitor-library
https://www.baeldung.com/java-nio2-watchservice
Your processor could use one of these and reload the data when the file changed, just make sure to synchronize access to relevant fields in your processor between the code that is reloading them and the code that is using them during execution.
Why can't HDFS client send directly to DataNode?
What's the advantage of HDFS client cache?
An application request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local file.
Application writes are transparently redirected to this temporary local file.
When the local file accumulates data worth at least one HDFS block size, the client contacts the NameNode to create a file.
The NameNode then proceeds as described in the section on Create. The client flushes the block of data from the local temporary file to the specified DataNodes.
When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode.
The client then tells the NameNode that the file is closed.
At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.
It sounds like you are referencing the Apache Hadoop HDFS Architecture documentation, specifically the section titled Staging. Unfortunately, this information is out-of-date and no longer an accurate description of current HDFS behavior.
Instead, the client immediately issues a create RPC call to the NameNode. The NameNode tracks the new file in its metadata and replies with a set of candidate DateNode addresses that can receive writes of block data. Then, the client starts writing data to the file. As the client writes data, it is writing on a socket connection to a DataNode. If the written data becomes large enough to cross a block size boundary, then the client will interact with the NameNode again for an addBlock RPC to allocate a new block in NameNode metadata and get a new set of candidate DataNode locations. There is no point at which the client is writing to a local temporary file.
Note however that alternative file systems, such as S3AFileSystem which integrates with Amazon S3, may support options for buffering to disk. (See the Apache Hadoop documentation for Integration with Amazon Web Services if you're interested in more details on this.)
I have filed Apache JIRA HDFS-11995 to track correcting the documentation.
We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.
I am setting up flume but very not sure of what topology to go ahead with for our use case.
We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.
Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.
So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
I am not sure about the above approach and am confused between three approaches:
Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.
Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.
Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.
I hope that's not too vague. Any help would be appreciated.
The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.
This tiered architecture allows you to simplify scaling, failover etc.