I'm new to DataFlow and try to set up a streaming pipeline reading CSV files from Google Cloud Storage into BigQuery. The pipeline is successfully created and CSV files are being read and parsed. The whole pipeline isn't properly initialized though. So no data is loaded into BigQuery.
I use Java 8 and Apache Beam 2.5.0.
When I traverse the DataFlow execution graph I see that the block Write to bigquery/BatchLoads/TempFilePrefixView/Combine.GloballyAsSingletonView/View.CreatePCollectionView/ParDo(StreamingPCollectionViewWriter) gets input but never spits out any output. Because of that the following step Write to bigquery/BatchLoads/TempFilePrefixView/Combine.GloballyAsSingletonView/View.CreatePCollectionView/CreateDataflowView is never executed.
I'm "inspired" by code from for example
https://github.com/asaharland/beam-pipeline-examples/blob/master/src/main/java/com/harland/example/streaming/StreamingFilePipeline.java
public class MyStreamPipeline {
private static final Logger LOG = LoggerFactory.getLogger(MyStreamPipeline.class);
private static final int WINDOW_SIZE_SECONDS = 120;
public interface MyOptions extends PipelineOptions, GcpOptions {
#Description("BigQuery Table Spec project_id:dataset_id.table_id")
ValueProvider<String> getBigQueryTableSpec();
void setBigQueryTableSpec(ValueProvider<String> value);
#Description("Google Cloud Storage Bucket Name")
ValueProvider<String> getBucketUrl();
void setBucketUrl(ValueProvider<String> value);
}
public static void main(String[] args) throws IOException {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
List<TableFieldSchema> tableFields = new ArrayList<>();
tableFields.add(new TableFieldSchema().setName("FIELD_NAME").setType("INTEGER"));
// ... more fields here ...
TableSchema schema = new TableSchema().setFields(tableFields);
pipeline
.apply("Read CSV as string from Google Cloud Storage",
TextIO
.read()
.from(options.getBucketUrl() + "/**")
.watchForNewFiles(
// Check for new files every 1 minute(s)
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.never())
)
.apply(String.format("Window Into %d Second Windows", WINDOW_SIZE_SECONDS),
Window.into(FixedWindows.of(Duration.standardSeconds(WINDOW_SIZE_SECONDS))))
.apply("Convert CSV string to Record",
ParDo.of(new CsvToRecordFn()))
.apply("Record to TableRow",
ParDo.of(new DoFn<Record, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
Record record = c.element();
TableRow tr = record.getTableRow();
c.output(tr);
return;}}))
.apply("Write to bigquery",
BigQueryIO
.writeTableRows()
.to(options.getBigQueryTableSpec())
.withSchema(schema)
.withTimePartitioning(new TimePartitioning().setField("PARTITION_FIELD_NAME").setType("DAY"))
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
pipeline.run();
}
}
I execute the pipeline using a Maven command like this one:
mvn -f pom_2-5-0.xml clean compile exec:java \
-Dexec.mainClass=com.organization.processor.gcs.myproject.MyStreamPipeline \
-Dexec.args=" \
--project=$PROJECT_ID \
--stagingLocation=gs://$PROJECT_ID-processor/$VERSION/staging \
--tempLocation=gs://$PROJECT_ID-processor/$VERSION/temp/ \
--gcpTempLocation=gs://$PROJECT_ID-processor/$VERSION/gcptemp/ \
--runner=DataflowRunner \
--zone=$DF_ZONE \
--region=$DF_REGION \
--numWorkers=$DF_NUM_WORKERS \
--maxNumWorkers=$DF_MAX_NUM_WORKERS \
--diskSizeGb=$DF_DISK_SIZE_GB \
--workerMachineType=$DF_WORKER_MACHINE_TYPE \
--bucketUrl=$GCS_BUCKET_URL \
--bigQueryTableSpec=$PROJECT_ID:$BQ_TABLE_SPEC \
--streaming"
I really don't understand why the pipeline isn't initialized properly and why no data is loaded into BigQuery.
Any help appreciated!
Related
I'm creating a POC to store files in Azure following the steps in https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-dotnet. In the snippet below creating the directory fails with message No such host is known. (securedfstest02.blob.core.windows.net:443). Appreciate any suggestion to workaround
this issue.
using Azure;
using Azure.Storage;
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace DataLakeHelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
try
{
CreateFileClientAsync_DirectoryAsync().Wait();
}
catch(Exception e)
{
Console.WriteLine(e);
}
}
static async Task CreateFileClientAsync_DirectoryAsync()
{
// Make StorageSharedKeyCredential to pass to the serviceClient
string storageAccountName = "secureblobtest02";
string storageAccountKey = "mykeyredacted";
string dfsUri = "https://" + storageAccountName + ".dfs.core.windows.net";
StorageSharedKeyCredential sharedKeyCredential = new StorageSharedKeyCredential(storageAccountName, storageAccountKey);
// Create DataLakeServiceClient using StorageSharedKeyCredentials
DataLakeServiceClient serviceClient = new DataLakeServiceClient(new Uri(dfsUri), sharedKeyCredential);
// Create a DataLake Filesystem
DataLakeFileSystemClient filesystem = serviceClient.GetFileSystemClient("my-filesystem");
if(!await filesystem.ExistsAsync())
await filesystem.CreateAsync();
//Create a DataLake Directory
DataLakeDirectoryClient directory = filesystem.CreateDirectory("my-dir");
if (!await directory.ExistsAsync())
await directory.CreateAsync();
// Create a DataLake File using a DataLake Directory
DataLakeFileClient file = directory.GetFileClient("my-file");
if(!await file.ExistsAsync())
await file.CreateAsync();
// Verify we created one file
var response = filesystem.GetPathsAsync();
IAsyncEnumerator<PathItem> enumerator = response.GetAsyncEnumerator();
Console.WriteLine(enumerator?.Current?.Name);
// Cleanup
await filesystem.DeleteAsync();
}
}
}
--Update
In your question, you mention of Azure data lake, but you seem to have the host: securedfstest02.blob.core.windows.net
Azure Data Lake Storage uses .dfs.core.windows.net/ whereas a Azure Blob Storage uses .blob.core.windows.net/ While using Blob service related operations in ADLS you would have to change the endpoint too accordingly.
please note the official MS docs URI templates.
I have used the same code and was able to create directory. Just replaced my adls credentials. I have not configured any additional permissions. ADLS is Allowed access from all networks. You might want to check if yours is by default configured to specific network or if firewall allows client (your) IP.
using Azure;
using Azure.Storage;
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace DataLakeHelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Starting....");
try
{
Console.WriteLine("Executing...");
CreateFileClientAsync_DirectoryAsync().Wait();
Console.WriteLine("Done");
}
catch (Exception e)
{
Console.WriteLine(e);
}
}
static async Task CreateFileClientAsync_DirectoryAsync()
{
// Make StorageSharedKeyCredential to pass to the serviceClient
string storageAccountName = "kteststarageeadls";
string storageAccountKey = "6fAe+P8LRe8LH0Ahxxxxxxxxx5ma17Slr7SjLy4oVYSgj05m+zWZuy5X8p4/Bbxxx8efzCj/X+On/Fwmxxxo7g==";
string dfsUri = "https://" + "kteststarageeadls" + ".dfs.core.windows.net";
StorageSharedKeyCredential sharedKeyCredential = new StorageSharedKeyCredential(storageAccountName, storageAccountKey);
// Create DataLakeServiceClient using StorageSharedKeyCredentials
DataLakeServiceClient serviceClient = new DataLakeServiceClient(new Uri(dfsUri), sharedKeyCredential);
// Create a DataLake Filesystem
DataLakeFileSystemClient filesystem = serviceClient.GetFileSystemClient("my-filesystem");
if (!await filesystem.ExistsAsync())
await filesystem.CreateAsync();
//Create a DataLake Directory
DataLakeDirectoryClient directory = filesystem.CreateDirectory("my-dir");
if (!await directory.ExistsAsync())
await directory.CreateAsync();
// Create a DataLake File using a DataLake Directory
DataLakeFileClient file = directory.GetFileClient("my-file");
if (!await file.ExistsAsync())
await file.CreateAsync();
// Verify we created one file
var response = filesystem.GetPathsAsync();
IAsyncEnumerator<PathItem> enumerator = response.GetAsyncEnumerator();
Console.WriteLine(enumerator?.Current?.Name);
// Cleanup
//await filesystem.DeleteAsync();
}
}
}
I've edited storage account key, used for reference only.
I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.
I'm beginner with java and using console to compile and run my programs. I'm trying to read data from MS Access .accdb file with ucanaccess driver. As i have added 5 ucanaccess files to C:\Program Files\Java\jdk1.8.0_60\jre\lib\ext, but still getting Exception java.lang.ClassNotFoundException:net.ucanaccess.jdbc.ucanaccessDriver.
Here is my code.
import java.sql.*;
public class jdbcTest
{
public static void main(String[] args)
{
try
{
Class.forName("net.ucanaccess.jdbc.UcanaccessDriver");
String url = "jdbc:ucanaccess://C:javawork/PersonInfoDB/PersonInfo.accdb";
Connection conctn = DriverManager.getConnection(url);
Statement statmnt = conctn.createStatement();
String sql = "SELECT * FROM person";
ResultSet rsltSet = statmnt.executeQuery(sql);
while(rsltSet.next())
{
String name = rsltSet.getString("name-");
String address = rsltSet.getString("address");
String phoneNum = rsltSet.getString("phoneNumber");
System.out.println(name + " " + address + " " + phoneNum);
}
conctn.close();
}
catch(Exception sqlExcptn)
{
System.out.println(sqlExcptn);
}
}
}
Please add JDBC driver jar to lib folder.
Download URL download jar
I tried the method mentioned by Gord in his post Manipulating an Access database from Java without ODBC and used eclipse instead of command line compile and run. Also to learn eclipse basics, i watched video tutorial https://www.youtube.com/watch?v=mMu-JlBrYXo.
Finally i was able to read my MS Access data base file from my java code.
I'm trying to run the benchmark software yscb on ElasticSearch
The problem I'm having is that after the load, the data seems to get removed during cleanup.
I'm struggling to understand what is supposed to happen?
If I comment out the cleanup, it still fails because it cannot find the index during the "run" phase.
Can someone please explain what is supposed to happen in YSCB?
I mean I think it would have
1. load phase: load say 1,000,000 records
2. run phase: query the records loaded during the "load phase"
Thanks,
Okay I have discovered by running Couchbase in YCSB that the data shouldn't be removed.
Looking at cleanup() for ElasticSearchClient I see no reason why the files would be deleted (?)
#Override
public void cleanup() throws DBException {
if (!node.isClosed()) {
client.close();
node.stop();
node.close();
}
}
The init is as follows: any reason this would not persist on the filesystem?
public void init() throws DBException {
// initialize OrientDB driver
Properties props = getProperties();
this.indexKey = props.getProperty("es.index.key", DEFAULT_INDEX_KEY);
String clusterName = props.getProperty("cluster.name", DEFAULT_CLUSTER_NAME);
Boolean newdb = Boolean.parseBoolean(props.getProperty("elasticsearch.newdb", "false"));
Builder settings = settingsBuilder()
.put("node.local", "true")
.put("path.data", System.getProperty("java.io.tmpdir") + "/esdata")
.put("discovery.zen.ping.multicast.enabled", "false")
.put("index.mapping._id.indexed", "true")
.put("index.gateway.type", "none")
.put("gateway.type", "none")
.put("index.number_of_shards", "1")
.put("index.number_of_replicas", "0");
//if properties file contains elasticsearch user defined properties
//add it to the settings file (will overwrite the defaults).
settings.put(props);
System.out.println("ElasticSearch starting node = " + settings.get("cluster.name"));
System.out.println("ElasticSearch node data path = " + settings.get("path.data"));
node = nodeBuilder().clusterName(clusterName).settings(settings).node();
node.start();
client = node.client();
if (newdb) {
client.admin().indices().prepareDelete(indexKey).execute().actionGet();
client.admin().indices().prepareCreate(indexKey).execute().actionGet();
} else {
boolean exists = client.admin().indices().exists(Requests.indicesExistsRequest(indexKey)).actionGet().isExists();
if (!exists) {
client.admin().indices().prepareCreate(indexKey).execute().actionGet();
}
}
}
Thanks,
Okay what I am finding is as follows
(any help from ElasticSearch-ers much appreciated!!!!
because I'm obviously doing something wrong )
Even when the load shuts down leaving the data behind, the "run" still cannot find the data on startup
ElasticSearch node data path = C:\Users\Pl_2\AppData\Local\Temp\/esdata
org.elasticsearch.action.NoShardAvailableActionException: [es.ycsb][0] No shard available for [[es.ycsb][usertable][user4283669858964623926]: routing [null]]
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:140)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:125)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:72)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:83)
The github README has been updated.
It looks like you need to specify using:
-p path.home=<path to folder to persist data>
I am new to Flume-ng. I have to write a program, which can transfer a text file to other program (agent). I know we must know about agent i.e. host-ip, port number etc. Then a source, sink and a channel should be defined. I just want to transfer a log file to server. My client code is as follows.
public class MyRpcClientFacade {
public class MyClient{
private RpcClient client;
private String hostname;
private int port;
public void init(String hostname, int port) {
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getDefaultInstance(hostname, port);
}
public void sendDataToFlume(String data) {
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
try {
client.append(event);
} catch (EventDeliveryException e) {
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
}
}
public void cleanUp() {
client.close();
}
}
Above code can send only String data to specified process. But i have to send files. Moreover tell me please that whether Source,Channel and Sink have to be written onto server? And if so, how to configure and write these three. Please help me. Give a small sample of Source,Sink And Channel
Actually you just have to get flume client on each node. Then you provide a config file providing information about their behaviors.
For instance, if your node read a file (read each new lines and send them as events to channel ), and send file contents trough a RPC socket. Your configuration will look like :
# sources/sinks/channels list
<Agent>.sources = <Name Source1>
<Agent>.sinks = <Name Sink1>
<Agent>.channels = <Name Channel1>
# Channel attribution to a source
<Agent>.sources.<Name Source1>.channels = <Name Channel1>
# Channel attribution to sink
<Agent>.sinks.<Name Sink1>.channels = <Name Channel1>
# Configuration (sources,channels and sinks)
# Source properties : <Name Source1>
<Agent>.sources.<Name Source1>.type = exec
<Agent>.sources.<Name Source1>.command = tail -F test
<Agent>.sources.<Name Source1>.channels = <Name Channel1>
# Channel properties : <Name Channel1>
<Agent>.channels.<Name Channel1>.type = memory
<Agent>.channels.<Name Channel1>.capacity = 1000
<Agent>.channels.<Name Channel1>.transactionCapacity = 1000
# Sink properties : <Name Sink1>
<Agent>.sinks.<Nom Sink1>.type = avro
<Agent>.sinks.<Nom Sink1>.channel = <Nom Channel1>
<Agent>.sinks.<Nom Sink1>.hostname = <HOST NAME or IP>
<Agent>.sinks.<Nom Sink1>.port = <PORT NUMBER>
Then you will have to set an agent, which will read on an avro source on same port and process the event the way you want to store them.
I hope it helps ;)