I have a workflow of several .tar files containing multiple csv files. After unpacking them through CompressContent and UnpackContent, I need to collect the attributes from the original .tar files. After the UnpackContent processor, I am currently sending the original files through the AttributesToCSV processor, where I overwrite the content of each file with the attributes in a csv format.
The main problem I am facing is that the file is a .tar, and even if I process it through CompressContent and UnpackContent again, it still remains in a .tar format.
Does anyone know a different way to extract attributes and have them in a csv file?
Thanks
Current Workflow:
CompressContent
UnpackContent
AttributestoCSV (flowfile-content)
Related
I have files with different extensions, some are text files, others are zipped files or images. How can I programmatically add a checksum to the files?
For example, my idea was to add a checksum somewhere in the metadata of the files. I tried doing it with PowerShell, but the properties of the files are read-only. I don't want to create a separate file that contains the checksum of the files. I want the checksum itself to be included somewhere in the file itself or in its metadata.
On Windows, with NTFS filesystem, you can use Alternate Data Streams.
They act exactly like files, but hidden and attached to the main file - until it's copied on a non-NTFS partition.
Otherwise, you can't just add a checksum to a file (even a short CRC32) without consequences, and how would you be SURE that the last N bytes are your checksum, and not file's data? You'll need to add a header (so even more bytes), etc. and it can mess up the file loading - simply think about a simple, plain text file, if you add N bytes of binary data at end!
I am using goroutines to concurrently download data from S3. For context, I currently have a group of samples. Each sample contains data in the form of a map, with a key representing the name of a file and the value pointing to the path in S3. Each sample has about 10 files that need to be downloaded from S3. I download all of these files in parallel and write to a shared zipfile object (got the mutexes and stuff figured out). I've figured out the concurrency aspect of this problem but the issue I face is organizing the zipfile object. I was wondering if it was possible to create a subdirectory within a zipfile object. otherwise i'm left with a massive zip object of all the data I need, but it is not really organized in any tangible way. Ideally, I'd be able to create a folder in the zipfile object for each sample and save all the file data to that but i don't know if that's possible.
The zip format has no notion of folder / directory, it just contains a list of files.
The file names may be composed to have folders in them, so the folders are just "virtual" but are not recorded as they are in "real" file systems.
So no, you can't create a directory in a zip file.
Background--we are trying to read different file types (csv or parquet) into pyspark, and I have the task of writing a program that will determine file type.
It appears that parquet files are always directories, parquet file appears in HDFS as a directory.
We have some csv files that are also directories, where the file name is the directory name and the directory contains several part files. What processes do this?
Why are some files --'files' and some files 'directories'?
It will depend on what process produced those files. For example, when MapReduce produces output, it always produces a directory and then creates one output file per reducer within that directory. This is done so that each reducer can create its output independently.
Judging from Spark's CSV package, it expects to output to a single file. So perhaps the single-file CSVs are being generated by Spark and the directories by MapReduce.
To be as generic as possible, it may be a good idea to do the following: check if the file in question is a directory. If not, check the extension. If yes, look at the extension of the files inside of the directory. This should work for each of your situations.
Note that some input formats (e.g. MapReduce input formats) will only accept directories as inputs, and some (e.g. Spark's textFile) will only accept files/globs of files. You need to be aware of what is expected from the libraries you are interacting with.
All the data on your hard drive consists of files and folders. The
basic difference between the two is that files store data, while
folders store files and other folders.
Hadoop execution engines generally creates a directory and write multiple part files as output based on the number of reducers or executors used.
When you many an output file abc.csv it doesn't mean that its a single file with the data. Its just the output location which MapReduce (generally) interprets as the new directory to be created within which it creates the output files(part files).
In case of Spark when you are writing a file(maybe using .saveAsTextFile) it may creates only a single file.
Scenario: Vendor will provide raw feed in tar.gz format which contains multiple files in tab delimited format
File Detail:
a) One Hit level data
b) Multiple Lookup files
c) One Header file for (a)
The feed(tar.gz) will be ingested and landed into BDP operational raw.
Query: Would like to load these data from operational raw area into Pig for data quality checking process. How this can be achieved? Should the files be extracted in hadoop for us to use or alternatives available? Please advise. Thanks!
Note: Any sample script will be more helpful
Ref : http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions
Extract from Docs :
Handling Compression
Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.
To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.
A = load 'myinput.gz';
store A into 'myoutput.gz';
I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.