is there a way to check if the file was processed by autoloader already? - azure-databricks

I upload files to data bricks file system every two day, i want to know if there's a place or log to see if some file is processed or not?
Thanks

Please refer this link:https://docs.databricks.com/ingestion/auto-loader/production.html#querying-files-discovered-by-auto-loader
For python code refer this link:Get the list of loaded files from Databricks Autoloader
sample code SELECT * FROM cloud_files_state('path/to/checkpoint');

Related

How do you restore files that were deleted from the file system (through GetFile) and were saved as FlowFiles in the Content Repository?

So I'm a complete rookie with NiFi and when I was trying it out for the first time, I just ran a single "GetFile" processor and set it to a fairly important directory, and now all of the files are gone. I poked around in the Content Repository, and it would appear that there are a whole lot of files there that are in some unknown format. I am assuming those are the files from my HD, but are now in "FlowFile" format. I also noticed that I can look at the provenance records and download them one by one, but there are several thousands...so that is not an option.
So if I'm looking to restore all of those to those files, I imagine I would need to read all of those in the content repository as flowfiles, and then do a PutFile. Any suggestions on how to go about this? Thanks so much!
If you still have the flowfiles in a queue, add a PutFile processor to another directory (not your important one) and move the queue over to it (click the queue that has the flowfiles in it and drag the little blue square at the end of the relationship over to the new PutFile). Run the PutFile and let it drain out. The files might not come out like-for-like, but the data will be there (assuming you didnt drop any flowfiles).
Don't develop flows on important directorties that you don't have backups for. Copy a data subset to a testing dir.

How to append current date to property file value every day in Unix?

I've got a property file which is read several times per day by an external application in order to process some files. One of the properties tells the app where to store the processed files. Application runs on Linux.
success_path=/u02/oapp/success
The problem is that every day several files are thrown in that path and after several months, I would have thousands of files in this plane folder.
Question: How can I append the current date to this property file so it would look like:
success_path=/u02/oapp/success/dd-MMM-yyyy
This would be updated every day at 12:00AM so for example today it would be
success_path=/u02/oapp/success/28-JAN-2017
The file is /u02/oapp/configuration/oapp.properties
Thanks in advance
Instead of appending current date to the property, add additional logic to the code that stores the processed files so that:
it takes the base directory from the property file (success_path in your case)
it creates a year/month/day directory to store the files
Something like:
/u02/oapp/success/year/month/day (as in `/u02/oapp/success/2017/01/01`)
or
/u02/oapp/success/yearmonth/day (as in `/u02/oapp/success/201701/01`)
or
/u02/oapp/success/yearmonthday (as in `/u02/oapp/success/20170101`)
If you don't have the capability to change the app's behavior, you might need to write a cron job that periodically moves the files external to the app.
jq -Rr 'select(startswith("success_path="))="success_path=/u02/oapp/success/"+(now|strflocaltime("%d-%b-%Y")|ascii_upcase)' /u02/oapp/configuration/oapp.properties | sponge /u02/oapp/configuration/oapp.properties

Camel FTP - FTP consumer for known filenames

Is there a way to request files from an FTP endpoint if the name is known? In our case we want to retrieve files depending on date and time from a folder structure that is huge - listing recursively through the folder takes too long. I know the names of the files and locations to call for in advance (they are calculable from date and time), so scanning is just waste of time. I'd rather poll for the exact file I want until I successfully received it.
What is the best approach for this?
Cheers,
Kai
By definition camel file and ftp components only poll directories.
You can use a combination of maxMessagesPerPoll and fileName to achieve your purpose, like
from("ftp://.../xyz?maxMessagesPerPoll=x&fileName=y");
fileName can be an expression. Take a look at camel file2 and ftp2 site.
I know that to get specific files whose file names are known beforehand, you can use filtering approach.
This is given with an example in the official documentation but I'm not sure of it saving the time you'd spend in scanning the working directory.
Search for filter in the 'ftp page'

Storing information about Hadoop job in file

I am new to Hadoop and wanted to know how to write to a common output file to store metadata about a recently executed job.
Currently if I am processing files a,b,c and d ; I have a custom counter which adds information about the number of files prcoessed but I wanted to know all the file names which were processed also.
Any comments on the best ways to do it?
Can Distributed Cache help?
Context.setStatus
will help?
Use like so:
Context.setStatus("Processed "+file);

dumping the content of the $mft file

for some commercial project I'm doing I need to be able to read the actual data stored on the $mft file.
I found a gpl lib that could help, but since its gpl i can't integrate it into my code.
could someone please point me to a project that i could use / or point me at the relevant windows API (something that doesn't require 1000 lines of code to implement)
BTW, why doesn't windows simply allow me to read the mft file directly anyway? (through the create file and the read method, if i want to ruin my drive it's my business not Ms's).
thanks.
You just have to open a handle to the volume using CreateFile() on \.\X: where X is the drive letter (check the MSDN documentation on CreateFile(), it mentions this in the Remarks section).
Read the first sector into a NTFS Boot Record structure (you can find it online, search for Richard "Flatcap" Russon, edit: I found it, http://www.flatcap.org/ntfs/ntfs/files/boot.html ). One of the fields in the boot sector structure gives the start location of the MFT in clusters (LCN of VCN 0 of the $MFT), you have to do a SetFilePointer() to that location an read in multiples of sectors. The first 1024 bytes from that location is the file record of the $MFT, again you can parse this structure to find the data attribute which is always non-resident and it's size is the actual size of the MFT file at that time.
The basic structures for $Boot, File Record and basic attributes (Standard Information, File Name and Data) along with the parsing code should run you less than 1000 lines of code.
This is not going to be a trivial proposition. You'll likely have to roll your own code solution to accomplish this. You can get some info about the details of the $MFT by checking out http://www.ntfs.com/ntfs-mft.htm
Another option is to spend some time looking through the source code to the opensource project NTFS-3g. You can download the source from http://www.tuxera.com/community/ntfs-3g-download/
Another good project is the NTFSProgs http://en.wikipedia.org/wiki/Ntfsprogs
Good luck.

Resources