Does trino supports output of a query to be written into multiple files? - trino

I'm querying onto trino how, can I generate more than 1 output files of the result.
Does trino decides this on it's own? If not how we can tell trino to create more than 1 file.

Related

Transfer CSV files from azure blob storage to azure SQL database using azure data factory

I need to transfer around 20 CSV files inside a folder named ActivityPointer in an azure blob storage container to Azure SQL database in a single data factory pipeline, but ActivityPointer contains 20 CSV files and another folder named snapshots inside it. So when I try to create a pipeline and give * to select all the CSV files inside ActivityPointer it includes the snapshots folder too, which should not be included. Is there any possibilities to complete this task. Also I can't create another folder to transform the snapshots folder into it. What can I do now? Anyone can please help me out.
Assuming you want to copy all CSV files within ACtivityPointer folder,
You can use wildcard expression as below :
you can provide path till Active folder and than *.csv
Copy data is also considering the inner folder while using wildcards (even if we use .csv in wildcard file path). So, we have to validate whether it is a file or folder. Please look at the following demonstration.
First use Get Metadata on the required folder with field list as Child items. The debug output will be:
Now use this to iterate through child items using For each activity.
#activity('Get Metadata1').output.childItems
Inside for each, use if condition activity to check whether the current item is a file or not. Use the following condition.
#equals(item().type,'File')
When this is true, you can use copy data to complete copying the file to target table (Ignore the false case). I have create file_name parameter in my source dataset passing its value as #item().name().
This will help you to achieve your requirement. The following is the debug output. I have 4 files and 1 folder. The folder will be ignored, and the rest will be copied into the target table.

how to save a text file to hive using table of context as schema

I have many project reports in text format (word and pdf). These files contains data that I want to extract; Such as references, keywords, names mentioned .......
I want to process these files with Apache spark and save the result to hive,
use the power of dataframe (use the table of context as schema) is that possible?
May you share with me any ideas about how to process these files?
As far as I understand, you will need to parse the files using Tika and manually create custom schema s as described here.
Let me know if this helps. Cheers.

How to define/process NiFi FlowFile content without a schema?

I have been experimenting with Nifi (via Hortonworks HDF) and can't grasp something that seems basic. I start with a file..I want to filter records based on a value (e.g., field number 2 contains "xyz")...then write only the matching records to HDFS.
I have a non-filtered flow working but can't find any docs or examples showing how to apply a schema (or in any way understand the "content") of the file.
What am I missing?
(I have seen references to FlowFiles being "schema-less" - not sure how that is useful).
For example, a file like this:
[app1][field1][field2]
[app2][field1][field2]
[app2][field1][field2][field3]
[app3][field1]
I want to filter to select records containing "[app2]" and create an HDFS file that looks like this:
field1,field2
field1,field2,field3

Build pipeline from Oracle DB to AWS DynamoDB

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.
I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

writing multiple files (different content) using spring batch

I have a requirement to write multiple files using Spring Batch. The first file will be written based on the data from the database table. The second file will contain just the number of records written to the first file. How can I create the second file? I am not sure whether org.springframework.batch.item.file.MultiResourceItemWriter is an option for me as I think it will write multiple files based on the data it will write chunks of data in the multiple files. Correct me if I am wrong here.
Please do suggest some options with sample code if possible.
You have couple of options:
You can use CompositeItemWriter which calls collection of item writers in defined order so you can define one item writer which will write records based on data from DB and second will count records and write that to another file.
You can write data to a file in first step, finish whole file and save it somewhere, you can save counter of records if that is all you need to StepContext (common batch patterns and scroll to 11.8 Passing Data to Future Steps) and read in new Taskletcounter and save to new file.
If you want to go with option 1 which I think is right choice you can check this example of batch job configuration with CompositeItemWriter

Resources