Chef/Ruby How to modify same file in different hosts? - ruby

I need to modify one file in 20 different nodes. Problem here is that all the files are different in lenght an number of rows and values on those nodes.
So i need to modify those files with new one created but problem is that some unique data should remain in old files too.
How should I implement this on nodes by overwriting existing file with new one with remaining some unique data. In this case it will be different host name.
I need to do this using Chef and Ruby.
Any ideas?
Thanks.

Related

How do I add multiple csv files to the catalog in kedro

I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder.
But my requirement is to read only given 4 from many files in the azure container.
Example:
I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.
For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.
Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:
file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv
you can specify filepath_suffix: _file_i_care_about.csv.
Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw/subdirectory/"
dataset: "pandas.CSVDataSet"
Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

How to output multiple values with the same key in reducer?

I have a bunch of text files which are categorized and I would like to create a sequence file for each category in which the key is the category name and the value consists of all the textual content of all the files for the category.
I have a nosql database which has only two columns. Each row represents a file, the first column is the category name and the second one is the absolute address of the text file stored on the HDFS. My mapper reads the database and output pairs in which the key is the category and the value is the absolute address. In the reducer sides, I have the addresses of all the files for each category and I would like to create one sequence files for each category in which the key is the category name and the value consists of the all textual content of all the files belonging to that category.
A simple solution is to iterate through the pairs (in the reducer) and open files one by one and append their content to a String variable and at the end create a sequence file using MultipleOutputs. However as the file sizes may be large appending the content to a single String may not be possible. Is there any way to do this without using a String variable?
Then, since you have all the files in reducer, you can get the content of those files, and append using StringBuilder to save memory, and then discard that StringBuilder reference. If avoiding String is your question, StringBuilder is a quick way. The IO operaion involving the file access and reading is resource intensive. However the data itself, should be ok given the architecture of reducers in hadoop.
You can also think of using a combiner. However, that is mainly used to reduce the traffic between map and reduce. You can take advantage of preparing part of the sequence file, at combiner and then remaining at reducer level. ofcouse this is valid only if the content can be added as it comes and not based on specific order.

HDFS:How to distribute files of small sizes across?

I have very large number of small files to be stored in HDFS. Based on the file name I want to store them in different data nodes. This way I can achieve file names starting with certain alphabets to go into specific data nodes. How to do this in Hadoop?
Not a very good choice. Reasons :
Hadoop is not very good at handling very large number of small files.
Storing one complete file in a single node is against one of the fundamental principles of HDFS, distributed storage.
I would like to know what benefit will you get with this approach.
In response to your comment :
HDFS doesn't do any kind of sorting like HBase does. When you put a file into HDFS, it gets split into small blocks first and then gets stored(each block on a different node). So there is nothing like sending a whole file to a single node. Your file(blocks) reside on multiple nodes.
What you could do is create a directory hierarchy as per you needs and store files in those directories(in case your intention is to fetch the files directly based on their location). For example,
/dirA
/dirA/A.txt
/dirA/B.txt
/dirB
/dirB/P.txt
/dirB/Q.txt
/dirC
/dirC/Y.txt
/dirC/Z.txt
But, if you really want to send the blocks of a particular file to some specific nodes then you need to implement your own block placement policy and which is not very easy. See this for more details.

Generate multiple outputs with Hadoop Pig

I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number, and so on...
The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number. So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?).
Any idea?
Thanks
Daniele
While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what you are looking for.
for getting an output(file or anything) you need to assign data to a variable, thats how it works with STORE. If id's are limited and finite you can FILTER them one by one and then STORE them. (I always do that for action types which is about 20-25).
But if you need to get each unique id file badly then make 2 files. 1 with whole data in it grouped by id, 1 with just unique ids. Then try generating 1(or more if you have too many) pig scripts that FILTER BY that id. But it's a bad solution. Assuming you would group 10 ids in a pig script you would have (unique id count/10) pig scripts to run.
Beware that Hdfs ain't good at handling too many small files.
Edit:
A better solution would be to GROUP and SORT by unique id to a big file. Then since its sorted you can easily divide the contents with a 3rd party script.

Resources