Hi I want to ask something. I've started to learn about apache storm. is it possible to storm to read data file in hdfs.??
example: I have a data txt file in directory /user/hadoop on hdfs. is it possible to storm to read that file.?? thx before
because when I try to running the storm I've an error message if the file does not exist. when I try to run it by read file from my local storage it was successful
>>> Starting to create Topology ...
---> Read Class: FileReaderSpout , 1387957892009
---> Read Class: Split , 1387958291266
---> Read Class: Identity , 247_Identity_1387957902310_1
---> Read Class: SubString , 1387964607853
---> Read Class: Trim , 1387962789262
---> Read Class: SwitchCase , 1387962333010
---> Read Class: Reference , 1387969791518
File /reff/test.txt .. not exists ..
Of course! Here is an example for how to read a file from HDFS:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
// ...stuff...
FileSystem fs = FileSystem.get(URI.create("hdfs://prodserver"), new Configuration());
String hdfspath = "/apps/storm/conf/config.json";
Path path = new Path();
if (fs.exists(path)) {
InputStreamReader reader = new InputStreamReader(fs.open(path));
// do stuff with the reader
} else {
LOG.info("Does not exist {}", hdfspath);
}
This doesn't use anything specific to storm, just the Hadoop API (hadoop-common.jar).
The error you're getting looks like it is because your file path is incorrect.
Related
I have to add the following UDF in hive :
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
This is an example from the book "Hadoop : The definitive guide"
I created the .class file of above java file using the following command :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ javac Strip.java
Then I created the jar file using the following command :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ jar cvf Strip.jar Strip Strip.class
Strip : no such file or directory
added manifest
adding: Strip.class(in = 915) (out= 457)(deflated 50%)
I added the geenrated jar file to hdfs directory with :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ hadoop dfs -copyFromLocal /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar /user/hduser/input
I tried to create a UDf usign the following command :
hive> create function strip as 'com.hadoopbook.hive.Strip' using jar 'hdfs://localhost/user/hduser/input/Strip.jar';
But I got an error as following :
converting to local hdfs://localhost/user/hduser/input/Strip.jar Added
[/tmp/hduser_resources/Strip.jar] to class path Added resources:
[hdfs://localhost/user/hduser/input/Strip.jar] Failed to register
default.strip using class com.hadoopbook.hive.Strip FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
I also tried to create temporary function.
So I first added the jar file to hive using :
hive> add jar hdfs://localhost/user/hduser/input/Strip.jar;
converting to local hdfs://localhost/user/hduser/input/Strip.jar
Added [/tmp/hduser_resources/Strip.jar] to class path
Added resources: [hdfs://localhost/user/hduser/input/Strip.jar]
Then I tried to add the temporary function :
hive> create temporary function strip as 'com.hadoopbook.hive.Strip';
But I got the following error :
FAILED: Class com.hadoopbook.hive.Strip not found FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
The jar file was successully created and added to hive.Still it is showing that class not found.
Can anyone please tell what is wrong with it ?
yes using IDE like eclipse is easy then making jar from CLI.
Creating jar file from command line you have to follow these steps:
First make project dirs under project dir ch17-hive:
bin - will store .class (Strip.class) files
lib - will store required external jars
traget - will store jars that you will create
[ch17-hive]$ mkdir bin lib traget
[ch17-hive]$ ls
bin lib src target
copy required external jars to ch170hive/lib dir:
[ch17-hive]$ cp /usr/lib/hive/lib/hive-exec.jar lib/.
[ch17-hive]$ cp /usr/lib/hadoop/hadoop-common.jar lib/.
Now compile java from dir from which your class com.hadoopbook.hive.Strip resides, in your case its ch17-hive/src/main/java:
[java]$ pwd
/home/cloudera/ch17-hive/src/main/java
[java]$ javac -d ../../../bin -classpath ../../../lib/hive-exec.jar:../../../lib/hadoop-common.jar com/hadoopbook/hive/Strip.java
Create menifest file as:
[ch17-hive]$ cat MENIFEST.MF
Main-Class: com.hadoopbook.hive.Strip
Class-Path: lib/hadoop-common.jar lib/hive-exec.jar
Create jar as
[ch17-hive]$ jar cvfm target/strip.jar MENIFEST.MF -C bin .added manifest
adding: com/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/hive/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/hive/Strip.class(in = 915) (out= 456)(deflated 50%)
Now you project structure should look like:
[ch17-hive]$ ls *
MENIFEST.MF
bin:
com
lib:
hadoop-common.jar hive-exec.jar
src:
main
target:
strip.jar
copy created jar to hdfs:
hadoop fs -put /home/cloudera/ch17-hive/target/strip.jar /user/cloudera/.
use it in HIVE:
hive> create function strip_new as 'com.hadoopbook.hive.Strip' using jar 'hdfs:/user/cloudera/strip.jar';
converting to local hdfs:/user/cloudera/strip.jar
Added [/tmp/05a13d23-8051-431f-a354-793abac66160_resources/strip.jar] to class path
Added resources: [hdfs:/user/cloudera/strip.jar]
OK
Time taken: 0.071 seconds
hive>
The file '/home/hadoop/_user_active_score_small' exactly exists. But when run load data local as below, get a SemanticException:
hive> load data local inpath '/home/hadoop/_user_active_score_small' overwrite into table user_active_score_tmp ;
FAILED: SemanticException Line 1:24 Invalid path ''/home/hadoop/_user_active_score_small'': No files matching path file:/home/hadoop/_user_active_score_small
But, cp /home/hadoop/_user_active_score_small /home/hadoop/user_active_score_small, and then run load data again:
hive> load data local inpath '/home/hadoop/user_active_score_small' overwrite into table user_active_score_tmp ;
Loading data to table user_bg_action.user_active_score_tmp
OK
Time taken: 0.368 seconds
The files' access type are the same, in the same directory:
-rw-rw-r-- 1 hadoop hadoop 614 7月 5 13:49 _user_active_score_small
-rw-rw-r-- 1 hadoop hadoop 614 7月 5 11:48 user_active_score_small
I don't know how does this happen. Is file name which starts with '_' not allowed by hive?
Files and directories that starts with underscore _ are considered hidden in MapReduce, that's probably the reason of the observed behavior.
If you look at FileInputFormat source code you can find this:
protected static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
I am trying to read and write Parquet file as RDD using Spark. I cant use Spark-Sql-Context in my current application(It needs a parquet schema in StructType which when I convert from Avro Schema gives me castException in few cases)
So if i try to implement and save Parquet File by overload AvroParquetFormat and Sending ParquetInputFormat to Hadoop To write in following way:
def saveAsParquetFile[T <: IndexedRecord](records: RDD[T], path: String)(implicit m: ClassTag[T]) = {
val keyedRecords: RDD[(Void, T)] = records.map(record => (null, record))
spark.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
val job = Job.getInstance(spark.hadoopConfiguration)
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, m.runtimeClass.newInstance().asInstanceOf[IndexedRecord].getSchema())
keyedRecords.saveAsNewAPIHadoopFile(
path,
classOf[Void],
m.runtimeClass.asInstanceOf[Class[T]],
classOf[ParquetOutputFormat[T]],
job.getConfiguration
)
}
This is thowing error:
Exception in thread "main" java.lang.InstantiationException: org.apache.avro.generic.GenericRecord
I am calling The function as follows:
val file1: RDD[GenericRecord] = sc.parquetFile[GenericRecord]("/home/abc.parquet")
sc.saveAsParquetFile(file1, "/home/abc/")
Are there any algorithms in Apache Spark to find out the frequent patterns in a text file. I tried following example but always end up with this error:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:
/D:/spark-1.3.1-bin-hadoop2.6/bin/data/mllib/sample_fpgrowth.txt
Can anyone help me solve this problem?
import org.apache.spark.mllib.fpm.FPGrowth
val transactions = sc.textFile("...").map(_.split(" ")).cache()
val model = new FPGrowth()
model.setMinSupport(0.5)
model.setNumPartitions(10)
model.run(transactions)
model.freqItemsets.collect().foreach {
itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
try this
file://D:/spark-1.3.1-bin-hadoop2.6/bin/data/mllib/sample_fpgrowth.txt
or
D:/spark-1.3.1-bin-hadoop2.6/bin/data/mllib/sample_fpgrowth.txt
if not work, replace / with //
I assume you are running spark on windows.
Use file path like
D:\spark-1.3.1-bin-hadoop2.6\bin\data\mllib\sample_fpgrowth.txt
NOTE : Escape "\" if necessary .
public String getDirs() throws IOException{
fs=FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/private/tmp/as"), new Path("/test"));
LocalFileSystem lfs=LocalFileSystem.getLocal(conf);
// System.out.println(new LocalFileSystem().ge (conf.getLocalPath("/private/tmp/as")));
System.out.println("Local Path : "+lfs.getFileChecksum(new Path("/private/tmp/as")));
System.out.println("HDFS PATH : "+ fs.getFileChecksum(new Path("/test/as")));
return "done";
}
Output is
Local Path : null
HDFS PATH : MD5-of-0MD5-of-512CRC32:a575c5e99b2e08605dc7c6723889519c
Not sure why the checksum is null for local file
Hadoop relies on the FileSystem to have a checksum ready to match against. It does not generate one on-the-fly.
By default, the LocalFileSystem (or the specific implementation used for file:// paths) does not create/store checksums for all files created through it. You can toggle this behavior via the FileSystem#setWriteChecksum API call, and subsequently retrieving the checksum post-write will then work.