replace text in input file with hadoop MR - hadoop

I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.

You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage

Related

Small files in hadoop

I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.

show hadoop files on HDFS only created on a specific day

I want to show hadoop files on HDFS under a specific folder which created on a specific day, is there a command/option to do this?
Thanks in advance,
Lin
As far as I know, hadoop command won't support this.
You can write a script to achieve this, which is not a good implementation.
My suggestions:
Organize your file in the way more convenient to be used. Say in your case, make a time partition would be better.
If you want to make data analysis easier, use some database based on hdfs like hive. hive support partition and sql like query and insert.
more about hive and hive partitions:
https://hive.apache.org/
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.
You can use HCatalog or Impala for faster querying.
From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Resources