Using parquet tools on files in hdfs - maven

I downloaded and built parquet-1.5.0 of https://github.com/apache/parquet-mr.
I now want to run some commands on my parquet files that are in hdfs. I tried this:
cd ~/parquet-mr/parquet-tools/src/main/scripts
./parquet-tools meta hdfs://localhost/my_parquet_file.parquet
and I got:
Error: Could not find or load main class parquet.tools.Main

Download jar
Download the jar from maven repo, or any location of your choice. Just google it. The time of this post I can get the parquet-tools from here.
If you’re logged in the hadoop box:
wget http://central.maven.org/maven2/org/apache/parquet/parquet-tools/1.9.0/parquet-tools-1.9.0.jar
This link might stop working few days later. So get the new link from maven repo.
Build jar
If you are unable to download the jar, you could also build the jar from source. Clone the parquet-mr repo and build the jar from the source
git clone https://github.com/apache/parquet-mr
mvn clean package
Note: you need maven on your box to build the source.
Read parquet file
You can use these commands to view the contents of the parquet file-
Check schema for s3/hdfs file:
hadoop jar parquet-tools-1.9.0.jar schema s3://path/to/file.snappy.parquet
hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
Head file contents:
hadoop jar parquet-tools-1.9.0.jar head -n5 s3://path/to/file.snappy.parquet
Check contents of local file:
java -jar parquet-tools-1.9.0.jar head -n5 /tmp/path/to/file.snappy.parquet
java -jar parquet-tools-1.9.0.jar schema /tmp/path/to/file.snappy.parquet
More commands:
hadoop jar parquet-tools-1.9.0.jar –help

The script is built on the assumption that parquet-tools-<version>.jar is located in a directory called lib next to the script file itself, like so:
$ find -type f
./parquet-tools
./lib/parquet-tools-1.10.1-SNAPSHOT.jar
You can set up such a file layout by issuing the following commands from the root of the parquet-mr git repo (of course many alternative ways and installation locations are possible):
mkdir -p ~/.local/share/parquet-tools/lib
cp parquet-tools/src/main/scripts/parquet-tools ~/.local/share/parquet-tools/
cp parquet-tools/target/parquet-tools-1.5.0.jar ~/.local/share/parquet-tools/lib
After this you can run ~/.local/share/parquet-tools/parquet-tools. (I tested this with version 1.10.1-SNAPSHOT though instead of 1.5.0.)

Related

Change tmp directory while running yarn jar command

I am running an MR job using yarn jar command and it creates a temporary jar in /tmp folder which fills up the entire disk space. I want to redirect the path of this jar to some other folder where I have more disk space. On this link, I came to know that we can change the path by setting the property mapred.local.dir for hadoop version 1.x. I am using the following command to run the jar
yarn jar myjar.jar MyClass myyml.yml arg1 -D mapred.local.dir="/grid/1/uie/facts"
The above argument mapred.local.dir doesn't change the path and it is still creating the jar in tmp folder.
Found the hack to not write the unjar file to /tmp folder. Apparently, it is not a configurable behaviour, so we can avoid the use of 'hadoop jar' or 'yarn jar'(RunJar utility) by invoking instead with the generated classpath:
java -cp $(hadoop classpath):my-fat-jar-with-all-dependencies.jar
your.app.mainClass
1. Reference link

org.apache.flink.api.java.io.jdbc.JDBCInputFormat NOT INSIDE FLINK JARS

I have created a new Java project in
eclipse-jee-kepler-SR2-win32-x86_64.
I have included the Jars in
flink-0.8.1\lib.
I have created the standard WordCount and it works.
I have modified my WordCount to take input from text files and csv files and it works.
all the imports work perfectly.
then i tried import org.apache.flink.api.java.io.jdbc.JDBCInputFormat.
Eclipse doesn't find it?
Why does Eclipse not find the import?
Because inside the jar flink-java-0.8.1.jar there is no directory io/jdbc.
I tried the same thing with flink-0.9.0-bin-hadoop27 and in the jar flink-dist-0.9.0.jar there is no org/apache/flink/api/java/io/jdbc directory. I uncompressed the jar and searched for the string "jdbcinputformat" with 0 results. I searched the string "jdbc" and it is only mentioned in org/apache/log4j, org/eclipse/jetty, and in other places that are not org.apache.flink.api.java.io
So my question is: Where do I find the class JDBCInputFormat?
What can I do to access SqlServer2012 in Flink (apart from accessing it outside Flink, create csv files, and then reading them in Flink (It sounds horrible to me since there should be a class specific for that))?
The corresponding module is not included. In order to use it, you need to build Flink from scratch. Run the following commands:
git clone https://github.com/apache/flink.git
cd flink
mvn -DskipTests clean install
This builds the latest snapshot for flink-0.10-SNAPSHOT. If you want to use stable version 0.9 run different git clone command:
git clone -b release-0.9 https://github.com/apache/flink.git
In your current project, you need to change the used Flink version in your pom file accordingly, eg, 0.10-SNAPSHOT or 0.9-SNAPSHOT.

Example Jar in Hadoop release

I am learning Hadoop with book 'Hadoop in Action' by Chuck Lam. In first chapter the books says that Hadoop installation will have example jar and by running 'hadoop jar hadoop-*-examples.jar' will show all the examples. But when I run the command then it throw error 'Could not find or load main class org.apache.hadoop.util.RunJar'. My guess is that installed Hadoop doesn't have example jar. I have installed 'hadoop-2.1.0-beta.tar.gz' on cygwin on Win 7 laptop. Please suggest how to get example jar.
run following command
hadoop jar PathToYourJarFile wordcount inputPath OutputPath
you can get examples jar file at your hadoop installation directory
What I can suggest here is you should manually go to the Hadoop installation directory and look for a jar name similar to hadoop-examples.jar yourself. Different distribution can have different names for the jar.
If you are in Cygwin, while in the Hadoop Installation directory you can also do a ls *examples*.jar to find the same, narrowing down the file listing to any jar file containing examples as a string.
You can then directly use the jar file name like --
hadoop jar <exampleJarYourFound.jar>
Hope this takes you to a solution.

Unable to execute Map/Reduce job

I've been trying to figure out how execute my Map/Reduce job for almost 2 days now. I keep getting a ClassNotFound exception.
I've installed a Hadoop cluster in Ubuntu using Cloudera CDH4.3.0. The .java file (DemoJob.java which is not inside any package) is inside a folder called inputs and all required jar files are inside inputs/lib.
I followed http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_5_2.html for reference.
I compile the .java file using:
javac -cp "inputs/lib/hadoop-common.jar:inputs/lib/hadoop-map-reduce-core.jar" -d Demo inputs/DemoJob.java
(In the link, it says -cp should be "/usr/lib/hadoop/:/usr/lib/hadoop/client-0.20/". But I don't have those folders in my system at all)
Create jar file using:
jar cvf Demo.jar Demo
Move 2 input files to HDFS
(Now this is where I'm confused. Do I need to move the jar file to HDFS as well? It doesn't say so in the link. But if it is not in HDFS, then how does the hadoop jar .. command work? I mean how does it combine the jar file which is in Linux system and the input files which are in HDFS?)
I run my code using:
hadoop jar Demo.jar DemoJob /Inputs/Text1.txt /Inputs/Text2.txt /Outputs
I keep getting ClassNotFoundException : DemoJob.
Somebody please help.
The class not found exception only means that some class wasn't found when class DemoJob was loaded. The missing class could have been a class referenced (imported, for example) by DemoJob. I think the problem is that you don't have the /usr/lib/hadoop/:/usr/lib/hadoop/client-0.20/ folders (classes) in your class path. It's the classes that should be there but aren't that probably are triggering the class not found exception.
Finally figured out what the problem was. Instead of creating a jar file from a folder, I directly created the jar file from the .class files using jar -cvf Demo.jar *.class
This resolved the ClassNotFound error. But I don't understand why it was not working earlier. Even when I created the jar file from a folder, I did mention the folder name when executing the class file as:hadoop jar Demo.jar Demo.DemoJob /Inputs/Text1.txt /Inputs/Text2.txt /Outputs

How run mahout in action example ReutersToSparseVectors?

I want run "ReutersToSparseVectors.java". I can compile and created JAR file without problem.
I compiled this file by below command:
javac -classpath hadoop-core-0.20.205.0.jar:lucene-core-3.6.0.jar:mahout-core-0.7.jar:mahout-math-0.7.jar ReutersToSparseVectors.java
created JAR file with below command:
jar cvf ReutersToSparseVectors.jar ReutersToSparseVectors.class
When I write java -jar ReutersToSparseVectors.jar to run, give me below error:
Failed to load Main-Class manifest attribute from
ReutersToSparseVectors.jar
Do you can help me to solve this problem?
IF this example can run with hadoop, please me that how i can run this with hadoop.
instead of using -jar option, then it's better to to run:
java -cp mahout-core.jar:... mia.clustering.ch09.ReutersToSparseVectors
or you can use mvn exec:java command, as described in README for examples...
mvn exec:java -Dexec.mainClass="mia.clustering.ch09.ReutersToSparseVectors"
Or you can run this file directly from your IDE (assuming, that you correctly imported Maven project).
P.S. your command isn't working, because to run with -jar switch, the .jar file should have special entry in manifest that describes that class should be started by default...
P.P.S. It's better to use book's examples with Mahout 0.7, as they were tested for it. You can use it with version 0.7 if you need, by then you need to take source code from mahout-0.7 branch of repository with examples (link is above)

Resources