Reading Text File in to Hbase MapReduce and store it to HTable - hadoop

I am new to HBaseMapReduce and Hadoop Data Base. I need to read a raw text file from mapreduce job and store the retrieved data into Htable using HBase MapReduce API.
I am googling from may days but I am not able to understand the extact flow. Can any one please provide me with some sample Code of reading data from A file.
I need to read Data From a Text/csv files. I can find some examples of reading data from command prompt. Which method can we use to read an xml file FileInputFormat or, please help me in learning Mapreduce API and please provide me with simple read and write examples.

You can import your csv data to HBase using importtsv and completebulkupload tools. importtsv loads csvs to hadoop files and completebulkupload loads them to a specified HTable. You can use these tools both from command line and Java code. If this can help you inform me to provide sample code or command

Related

how to read parquet schema in non mapreduce java program

Is there a way to direct read Parquet file column names by getting metadata without mapreduce. Please give some example. I am using snappy as compression codec.
You can use either ParquetFileReader or use existing tool https://github.com/Parquet/parquet-mr/tree/master/parquet-tools for reading parquet file using command line.

How do i get generated filename when calling the Spark SaveAsTextFile method

I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps

Load data into Hbase table using HBASE MAP REDUCE API

I am very new for Hbase and Map Reduce API.I am very confused with Map Reduce concepts. I need to Load text file into Hbase table using MAPReduce API. I googled some Examples but in that I can find MAPPER () not reducer method. I am confused with when to use mapper and when to use Reducer (). I am thinking in the way like :
TO write data to a Hbase we use mapper
TO read data from
HBASE we use mapper and reducer(). please can any one clear me with
detail explanation.
I am trying to load data from text file into
HBASE table. I googled and tried some code but i dont know, how to
load the text file and read in HBASE mapreduce API.
I really thank full for certain help
With regard to your questions:
The Mapper receives splits of data and returns a pair key, set<values>
The Reducer receives the output of from the Mapper and generates a pair <key, value>
Generally, will be your Reducer task which will write results (to the filesystem or to HBase), but the Mapper can do that too. There are MapReduce jobs which don't require a Reducer. With regard to reading from HBase, it's the Mapper class that has the configuration from which table to read from. But there's nothing related a Mapper is a reader and Reducer a writer. This article "HBase MapReduce Examples" provides good examples about how to read from and write into HBase using MapReduce.
In any case, if what you need is to bulk import some .csv files into HBase, you don't really need to do it with a MapReduce job. You can do it directly with the HBase API. In pseudocode:
table = hbase.createTable(tablename, fields);
foreach (File file: dir) {
content = readfile(file);
hbase.insert(table, content);
}
I wrote an importer of .mbox files into HBase. Take a look at the code, it may give you some ideas.
Once your data is imported into HBase, then you do need to code a MapReduce job to operate with that data.
Using HFileOutputFormat with CompleteBulkLoad is best and fastest way to load data in HBase.
You will find sample code here
Here are a couple responses of mine that address loading data into HBASE.
What is the fastest way to bulk load data into HBase programmatically?
Writing to HBase in MapReduce using MultipleOutputs
EDIT: Adding additional link based on comment
This link might help make the file available for processing.
Import external libraries in an Hadoop MapReduce script

replace text in input file with hadoop MR

I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.
You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage

HDFS: Using HDFS API to append to a SequenceFile

I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.
// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();
Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.
I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.
I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.
Any thoughts or work-arounds? Thanks.
Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139
For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140
CDH5 users can find the same ability in version CDH 5.7.1 onwards.
Sorry, currently the Hadoop FileSystem does not support appends. But there are plans for it in a future release.

Resources