data cleaning in hdfs without using hive

data cleaning in hdfs without using hive - hadoop

Is there an option where i can do hadoop fs -sed , essentially I am trying to replace "\" into "something" in my data directly in hdfs without having to bring data into local and load.
currently I am using getmerge to bring the data into local , clean it and load it with copyFromlocal to hdfs back. it takes a lot of time this way . so is there more easier solution or faster way of doing the replacement of character data.

Not clear why you'd use Hive for this anyway.
Pig or Spark are far better options that don't require an explicit schema for the data.
See Pig REPLACE function
In any case, Hadoop CLI has no sed option
Another option would be NiFi, but that requires more setup, and is overkill for this task.

Related

Save and access table-like data structure in hadoop

I want to save and access a table like data structure in HDFS with MapReduce programming. Part of this DS is shown in the following picture. This DS have tens of thousands of columns and hundreds of rows and All nodes should have access to it.
My Question is: How can I save this DS in HDFS and access it with MapReduce programming. Should I use arrays? (Or Hive tables ? Or Hbase?)
Thank you.

HDFS is distributed file System which stores your big files in distributed servers.
You can copy your files from local system to HDFS using command
hadoop fs -copyFromLocal /source/local/path destincation/hdfs/path
Once copy completed an External hive table can be formed on destincation/hdfs/path.
This table can be queried using hive shell.

Do consider Hive for this scenario. If you want to do table type of processing like SAS dataset or R dataframe/dataTable or python pandas; almost always an equivalent thing is possible in SQL. Hive provides powerful SQL abstraction through MapReduce and Tez engines. If you want to graduate to Spark sometime then you can read Hive tables in dataframes. As #sumit pointed you just need to transfer your data from local to HDFS (using HDFS copyFromLocal or put command) and define an external Hive table on that.
If in case you want to write some custom map-reduce on this data then access the background hive table data (more likely at /user/hive/warehouse). After reading the data from stdin, parse it in mapper (separator could be find using describe extended <hive_table>) and emit in key-value pair format.

Hive/Beeline, how can I set the job .staging directory?

On the cluster I'm working on every user is given 60GB of Hadoop quota.
Historically the project I'm working on generates a lot of Hive queries.
In order for things to work faster I'm trying to parallel these queries (which are unrelated) but as a result the directory /user/{myusername}/.staging/ is being filled with job_{someid} directories which in turn are filled with the hive jars and consume these 60GB very fast. While I can limit the parallelization factor I would also like to see if I can ask Hive to put these jars on a different directory. Say /tmp/{myusername} where I have a lot more space.
Any idea how do I tell Hive/Beeline to create the .staging directory under /tmp/{myusername}?

Easiest way is on execution of your beeline session.
beeline --hive.exec.stagingdir=/tmp/{myusername}
Think you can do it via !set inside beeline but don't have the syntax to hand.

The above doesn't work.
We found the following working
beeline --hiveconf hive.exec.stagingdir=/tmp/{myusername}

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !

HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.

The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.

You can use HCatalog or Impala for faster querying.

From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

replace text in input file with hadoop MR

I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.

You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

data cleaning in hdfs without using hive - hadoop

Not clear why you'd use Hive for this anyway. Pig or Spark are far better options that don't require an explicit schema for the data. See Pig REPLACE function In any case, Hadoop CLI has no sed option Another option would be NiFi, but that requires more setup, and is overkill for this task.

Related

Save and access table-like data structure in hadoop

Hive/Beeline, how can I set the job .staging directory?

Hdfs and Hbase: how it works?

Processing HDFS files

replace text in input file with hadoop MR

Categories

Resources