I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps
Related
I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.
I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.
I'd uploaded 50GB data on Hadoop cluster. But Now i want to delete first row of data file.
This is time consuming if i remove that data & change manually. Then upload it again on HDFS.
Please reply me.
HDFS files are immutable (for all practical purposes).
You need to upload the modified file(s). You can do the change programatically with a M/R job that does a near-identity transformation, eg. running a streaming shell script that does sed, but the gist of it that you need to create new files, HDFS files cannot be edited.
I am new to PIG.
Actually I have a use case in which I have to store the data again and again in the same file after every regular interval. But as I gone through some tutorial and links, I didn't see the anything related to this.
How should I do store the data in same file?
It's impossible. Pig uses Hadoop and right now there is no "recommended" solution for appending files.
The other point is that pig would produce one file only if one mapper has been used or one reducer has been used and the end of the whole data flow.
You can:
Give more info about the problem you are trying to solve
Bad solution:
2.1. process data in your pig script
2.2. load data from exisitng file
2.3. union relations hwre first relation keeps new data, the second relation keeps data from exisitng file
2.4. store union result to new output
2.5. replace old file with new one.
Good solution:
Create folder /mydata
create partitions inside folder, they can be /yyyy/MM/dd/HH if you do process data each hour
Use globs to read data:
/mydata/*/*/*/*/*
All files from hour partitions would be read by PIG/HIVE/MR or whatever hadoop tool.
make a date folder like: /abc/hadoop/20130726/
within you generate output based on timestamp like: /abc/hadoop/20130726/201307265465.gz.
Then use getmerge command to merge all data into a single file
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Hope it will help you.
I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.
You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage