Nifi - MergeContent - Multiple CSV files - counter - apache-nifi

I want to merge 6 CSV files into 1
I use
ListHDFS >> FechHDFS >> UpdateAttribute >> MergeContent >> QueryRecord >> ...
ListHDFS >> FechHDFS >> UpdateAttribute is repeated as the number of files to merge ( 6 times)
because I shoud to give for each file the fragment.index parameter and an allias ( used later for the join query in QueryRecord )
The UpdateAttribute for one of the files:
Is there a way to avoid multiple processors to get the files ListHDFS >> FechHDFS >> UpdateAttribute
How to reduce is into one ListHDFS >> FechHDFS >> UpdateAttribute and give a different fragment.index for each different file which shloud be between 0 and 6 (max number of files) ?
I tried NextInt() to attribute a new fragment.index value but it is incremental, not suitable for multiple executions.
Thanks in advance.

Please find the solution in this thread:
Link to the solution

Related

how can I use different CSVs for my JMeter script on different instance

We have 20 worker on AWS and I want to parameterized CSV file name for each instance Please help
I have divided my CSV in to number of Load generator hosts
$ wc -l "youroriginalcsv.csv" /* this will return number of total rows in csv*/
$ split -l "count of above query"/"number of hosts" "youroriginalcsv.csv" /* this will split CSV with file name as xaa, xab ... */
Transfer each unique CSV to all available hosts
$ scp xaa host1_user#host1_ip:/csvpath/csvfile.csv
$ scp xab host2_user#host2_ip:/csvpath/csvfile.csv
$ scp xaz hostN_user#hostN_ip:/csvpath/csvfile.csv
Now I want to use specific file name for specific host
What do you mean by "specific file name for specific host"? Your CSV files are all named csvfile.csv so it's sufficient to specify /csvpath/csvfile.csv in the CSV Data Set Config and each JMeter slave will pick up its own file containing partial data from the "big" CSV file.
If you want to use different names for CSV files depending on the machine IP address or DNS hostname - go for combination of If Controller with __machineName() or __machineIP() function
Also if you don't want the same data to be re-used by different JMeter slaves you can consider using Redis Data Set Config or HTTP Simple Table Server, this way you won't have to "split" and "copy" CSV files and will be able to centrally manage your test data from a single location

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

How to fill redis with redis-cli with dummy data of size weigh hundreds of MB?

I am getting my hand dirty with redis monitoring. So far I came up with this metrics useful to monitor about redis:
memory_used
through put
latency
connections
replication
I am newbie on this. I am trying to fill the redis from redis-cli with dummy data as:
for i in `seq 10000000`; do redis-cli SET users:app "{id: '$i', name: 'name$i', address: 'address$i' }" ; done
but it doesn't scale my need to fillup the redis-db fast enough...
Also I need some help regarding the latency and throught put monitoring. I know what they mean, but I don't know how to measure them... My eyes don't see anything rellated to that on output for redis-cli info
Thanks, for support/guidence :D
Use the undocumented DEBUG POPULATE command.
DEBUG POPULATE count [prefix] [size]: Create count string keys named key:<num>. If a prefix is specified it's used instead of the key prefix.
The value starts with value:<num> and is filled with null chars if needed until it achieves the given size if specified.
> DEBUG POPULATE 5 test 1000000
OK
> KEYS *
1) "test:3"
2) "test:1"
3) "test:4"
4) "test:2"
5) "test:0"
> STRLEN test:0
(integer) 1000000
> STRLEN test:4
(integer) 1000000
> GETRANGE test:1 0 10
"value:1\x00\x00\x00\x00"
To "fill fast", follow the instructions in the documentation about Mass Insert - the gist is using the --pipe directive on a pre-prepared data file.
following #leomurillo
I got this to work without the last parameter, and I couldn't find the documentation for this undocumented command :)
127.0.0.1:6379> DEBUG POPULATE 10000000 PHPREDIS_SESSION
OK
(15.61s)
127.0.0.1:6379> dbsize
(integer) 10000334
Using Python
redis-dummy-data-generator.py, Creates 10000 key-value pairs
#!/usr/bin/python
for i in range(10000):
print 'set name'+str(i),'helloworld'
Run generator script and store the output in redis_commands.txt file
python redis-dummy-data-generator.py > redis_commands.txt
Load generated dummy data into redis-server
redis-cli -a mypassword -h localhost -p 6379 < redis_commands.txt

Word Count Hadoop Example

I am running word count ex on a 41 GB file ( with default configuration setting ) that comes with Hadoop( Version: 0.20.3-dev) . But this code is giving correct output for the small file but it is giving some garbage for the 41 GB file. Why is this happening ?
Thanks for everybody.It may create wrong output because Hadoop by default does not know your file format it treats every file as a simple text file.

How to reduce number of output files in Apache Hive

Does anyone know of a tool that can "crunch" the output files of Apache Hadoop into fewer files or one file. Currently I am downloading all the files to a local machine and the concatenate them in one file. So does anyone know of an API or a tool that does the same.
Thanks in advance.
Limiting the number of output files means you want to limit the number of reducers. You could do that with the help of mapred.reduce.tasks property from the Hive shell. Example :
hive> set mapred.reduce.tasks = 5;
But it might affect the performance of your query. Alternatively, you could use getmerge command from the HDFS shell once you are done with your query. This command takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Usage :
bin/hadoop fs -getmerge <src> <localdst>
HTH
See https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038
set hive.merge.mapfiles=true; -- Merge small files at the end of a map-only job.
set hive.merge.mapredfiles=true; -- Merge small files at the end of a map-reduce job.
set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job.
set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold
-- When the average output file size of a job is less than this number,
-- Hive will start an additional map-reduce job to merge the output files
-- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles
-- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

Resources