I'm new to map-reduce framework. I want to find out the number of files under a specific directory by providing the name of that directory.
e.g. Suppose we have 3 directories A, B, C and each one is having 20, 30, 40 part-r files respectively. So I'm interested in writing a hadoop job, which will count files/records in each directory i.e I want an output in below formatted .txt file:
A is having 20 records
B is having 30 records
C is having 40 records
These all directories are present in HDFS.
The simplest/native approach is to use built in hdfs commands, in this case -count:
hdfs dfs -count /path/to/your/dir >> output.txt
Or if you prefer a mixed approach via Linux commands:
hadoop fs -ls /path/to/your/dir/* | wc -l >> output.txt
Finally the MapReduce version has already been answered here:
How do I count the number of files in HDFS from an MR job?
Code:
int count = 0;
FileSystem fs = FileSystem.get(getConf());
boolean recursive = false;
RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive);
while (ri.hasNext()){
count++;
ri.next();
}
System.out.println("The count is: " + count);
Related
When i put a file in the local directory (vagrant/flume/test.csv), in HDFS flume turns it into (/user/inputs/test.csv.1591560702234) ,i want to know why HDFS adds 1591560702234 and how to remove it !
this is my flume.conf file
# Flume agent config
a1.sources = r1
a1.sinks = k2
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.basenameHeader = true
a1.sources.r1.spoolDir = /vagrant/flume
a1.sinks.k2.type = hdfs
a1.sinks.k2.channel = c1
a1.sinks.k2.hdfs.filePrefix = %{basename}
a1.sinks.k2.hdfs.writeFormat = Text
#a1.sinks.k2.hdfs.fileSuffix =
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.path = /user/inputs/
a1.sinks.k2.rollInterval = 0
a1.sinks.k2.rollSize = 0
a1.sinks.k2.rollCount = 0
a1.sinks.k2.idleTimeout = 0
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k2.channel = c1
Flume add the time in milliseconds.
From your example:
select from_unixtime(ceil(1591560702234 / 1000));
+----------------------+--+
| time |
+----------------------+--+
| 2020-06-07 22:11:43 |
+----------------------+--+
I think it's not possible to remove the timestamp with flume configuration.
But you could add a Suffix with hdfs.fileSuffix.
From the documentation:
hdfs.fileSuffix – Suffix to append to file (eg .avro - NOTE: period is not automatically added)
You could also put more events in a single file with some flume properties
please check
batchSize
rollSize
rollTime
rollCount
You could also merge directories with HDFS commands.
getmerge
Usage: hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.
-skip-empty-file can be used to avoid unwanted newline characters in case of empty files.
Examples:
hadoop fs -getmerge -nl /src /opt/output.txt
hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt
I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1 (one reducer) it is not sort.
For example, my input to mapper is:
file 1: A long time ago in a galaxy far far away
file 2: Another episode for Star Wars
Result is:
A 1
a 1
Star 1
ago 1
for 1
far 2
away 1
time 1
Wars 1
long 1
Another 1
in 1
episode 1
galaxy 1
But this is no a Globally Sort!
So, What is meaning of Sort in Shuffle and Sort and Globally Sort?
mapper code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
reducer code:
#!/usr/bin/env python
import sys
word2count = {}
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )
I use this command to run it:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1
I have a directory structure as follows in the HDFS,
/data/current/population/{p_1,p_2}
/data/current/sport
/data/current/weather/{w_1,w_2,w_3}
/data/current/industry
The folders population, sport, weather & industry each correspond to different dataset. The end-folders, for example p_1 & p_2, pertain to different data-sources if available.
I'm working on PySpark code which work on these A_1, A_2, B, C_1, C_2, C_3 & D folders (the end-folders). Given a path like /data/current/ to your code, how do you extract the absolute paths of just the end folders?
The command hdfs dfs -ls -R /data/current gives the following output
/data/current
/data/current/population
/data/current/population/p_1
/data/current/population/p_2
/data/current/sport
/data/current/weather
/data/current/weather/w_1
/data/current/weather/w_2
/data/current/weather/w_3
/data/current/industry
But I want to end up with the absolute paths of end-folders. My output should look like following
/data/current/population/p_1
/data/current/population/p_2
/data/current/sport
/data/current/weather/w_1
/data/current/weather/w_2
/data/current/weather/w_3
/data/current/industry
-Thanks in advance
Why don't you write some code using HDFS client like SnakeBite.
I am attaching the scala function to do the same below.This function takes the root folder path and gives a List of all end paths. You can do the same in python using SnakeBite.
def traverse(path: Path, col: ListBuffer[String]): ListBuffer[String] = {
val stats = fs.listStatus(path)
for (stat <- stats) {
if (stat.isFile()) {
col += stat.getPath.toString()
} else {
val nl = fs.listStatus(stat.getPath)
if (nl.isEmpty)
col += stat.getPath.toString()
else {
for (n <- nl) {
if (n.isFile) {
col += n.getPath.toString()
} else {
col ++= traverse(n.getPath, new ListBuffer)
}
}
}
}
}
col
}
Below HDFS command might help :
hdfs getconf -confKey fs.defaultFS
I want to split a file into 4 equal parts using Apache pig. Example, if a file has 100 lines the first 25 should go to the 1st output file and so on.. the last 25 lines should go to the 4th output file. Can someone help me to achieve this. I am using Apache pig because the number of records in the file will be in Millions and there are previous steps that generate the file that needs to be split uses Pig.
I did a bit of digging on this, because it comes up the the Hortonworks sample exam for hadoop. It doesn't seem to be well documented - but its quite simple really. In this example I was using the Country sample database offered for download on dev.mysql.com:
grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';
Then if we have a look at the directory in hdfs:
[root#sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r-- 3 hive hdfs 0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r-- 3 hive hdfs 3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r-- 3 hive hdfs 4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r-- 3 hive hdfs 4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002
Hope that helps.
You can use some of the below PIG feature to achieve your desired result.
SPLIT function http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
MultiStorage class : https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
Write custom PIG storage : https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions
You have to provide some condition based on your data.
This could do. But may be there could be a better option.
A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';
My requirement changed a bit, I have to store only the first 25% of the data into one file and the rest to another file. Here is the pig script that worked for me.
ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1 <= $0/4;
rest_file = filter with_max by $1 > $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');
I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.