I use Spring XD and I have created the following stream:
stream create --name test --definition "time | hdfs --rollover=1B --directory=/xd/test --fileName=test --overwrite=true" --deploy
The stream generate a many file. Each file name contains the name and additional number e.g. test-0.txt, test-1.txt, test-2.txt etc.
Because I use Spring XD and Hadoop for educational purpose I want to save free space of my hard drive. So, I would like to overwrite the data. It is possible to remove the above number from file name?
The rollover size 1B is too small which pile up the number of files being created. You can set to use optimal size based on the data you process to control the number of files created.
For more number of options to control the properties you can refer here
Related
Whenever I use ADF copy activity with Blob as source/sink, ADF creates an empty file named after the directory of the sink Blob.
For instance, if I want to copy from input/file.csv to process/file.csv, the copy happens but I also have a blob called "process" with size 0 byte created each time.
Any idea why?
Source
Sink
Firstly, I would suggest you optimize you pipeline copy active settings.
Since you are copying one file from one container/folder to another, you can directly set the source file with parameter. Wildcard path expression *.csv is usually used for folder the same type of files.
You can test again and check if the empty file exist again.
HTH.
This happens if you have a storage ADLS gen2 but you have not enabled the Hierarchical namespace and you select the ADLS gen2 while defining your Linked Service and Dataset. A quick fix for this is use Azure Blob Storage when defining LS and DS.
I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.
What are the differences between a serial file and a Multi file in Ab Initio?
A serial file will contain the data directly.
However a multi file will contain the partitions in which the data will be contained. Once each of the partitions are opened separately then the data in it will be displayed.
We use 'cat' for serial files and 'm_cat' for multi files.
However if 'cat' is used to open a multi file it will just show the name of the partition files.
Typically the serial and multi file directories will differ in containing a file called .mdir.
The .mdir will contain all the partitions that the multi file system will refer to.
Performance wise multi files will increase speed of execution of a graph since the data will be processed in parallel. However in a serial file the same data will processed in sequence.
Thanks
Arijit
A serial file is Just like a normal file system file . Usually under serial paths.
While a "multifile" is a Single file divided into multiple files and kept under different partitions - DATA PARTITIONS ( count of no of data partitions is defined by MFS_DEPTH parmaeter).
If you want to operate on multifile you always do it from its control partition using 'm_' commands.
You can 'cd' to where the file resides in Control Partiion and do a 'cat FILENAME.mfctl' and you will get where is your file lies in all the data partition.
a serial file is a normal file
a multifile is file which is later divided into different partitions and each of the file may be located anywhere the ab initio co operating system is installed
the icon for a mutlifile has 3 platters
I defined several streams, using the new partitionPath option so that files end up in per-day directories in Hadoop:
stream create --name XXXX --definition "http --port=8300|hdfs-dataset --format=avro --idleTimeout=100000 --partitionPath=dateFormat('yyyy/MM/dd/')" --deploy
stream create --name YYYY --definition "http --port=8301|hdfs --idleTimeout=100000 --partitionPath=dateFormat('yyyy/MM/dd/')" --deploy
All of the streams were created and deployed, except for XXXX up there:
17:42:49,102 INFO Deployer server.StreamDeploymentListener - Deploying stream Stream{name='XXXX'}
17:42:50,948 INFO Deployer server.StreamDeploymentListener - Deployment status for stream 'XXXX': DeploymentStatus{state=failed,error(s)=java.lang.IllegalArgumentException: Cannot instantiate 'IntegrationConfigurationInitializer': org.springframework.integration.jmx.config.JmxIntegrationConfigurationInitializer}
17:42:50,951 INFO Deployer server.StreamDeploymentListener - Stream Stream{name='XXXX'} deployment attempt complete
Note that its data gets processed and deposited in avro format. And FWIW, where the other streams get put in /xd/<NAME>/<rest of path>, using the hdfs-dataset --format=avro combo results in files going to /xd/<NAME>/string
I re-defined it w/o the partitionPath option, and the stream deployed.
Do we have a bug here, or am I doing something wrong?
The hdfs-dataset sink is intended for writing serialized POJOs to HDFS. We use the Kite SDK kite-data functionality for this, so take a look at that project for some additional info.
The partitioning expressions for hdfs and hdfs-dataset are different. The hdfs-dataset follows the Kite SDK syntax and you need to specify a field of the POJO where your partition value is stored. For a timestamp (long) field the expression would look like this: dateFormat('timestamp', 'YM', 'yyyyMM') where timestamp is the name of the field, 'YM' is the prefix that gets added to the directory for the partition like YM201411 and 'yyyyMM' is the format you want for the partition value. If you want a year/mont/day directory structure for the partition you could use year('timestamp')/month('timestamp')/day('timestamp'). There is some more coverage in the Kite SDK Partitioned Datasets docs.
For your example it doesn't make much sense to add partitioning since you are persisting a simple String value. If you do add a processor to transform the data to a POJO then partitioning makes more sense and we have some examples in the XD docs.
how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table