I ran a clustering test on crawled pages (more than 25K docs ; personal data set).
I've done a clusterdump :
$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt
The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :
VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}
How to interpret this output?
In short : I am looking for document ids which belong to a particular cluster.
What is the meaning of :
VL-x ?
n=y c=[z:z', ...]
r=[z'':z''', ...]
Does 0:0.017 means "0" is the document id which belongs to this cluster?
I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?
Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.
By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles
NamedVecotr:
vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.
import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.NamedVector;
public class ClusterOutput {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
BufferedWriter bw;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
File pointsFolder = new File(args[0]);
File files[] = pointsFolder.listFiles();
bw = new BufferedWriter(new FileWriter(new File(args[1])));
HashMap<String, Integer> clusterIds;
clusterIds = new HashMap<String, Integer>(5000);
for(File file:files){
if(file.getName().indexOf("part-m")<0)
continue;
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (reader.next(key, value)) {
NamedVector vector = (NamedVector) value.getVector();
String vectorName = vector.getName();
bw.write(vectorName + "\t" + key.toString()+"\n");
if(clusterIds.containsKey(key.toString())){
clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
}
else
clusterIds.put(key.toString(), 1);
}
bw.flush();
reader.close();
}
bw.flush();
bw.close();
bw = new BufferedWriter(new FileWriter(new File(args[2])));
Set<String> keys=clusterIds.keySet();
for(String key:keys){
bw.write(key+" "+clusterIds.get(key)+"\n");
}
bw.flush();
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
To complete the answer:
VL-x: is the identifier of the cluster
n=y: is the number of elements in the cluster
c=[z, ...]: is the centroid of the cluster, with the
z's being the weights of the different dimensions
r=[z, ...]: is the radius of the cluster.
More info here:
https://mahout.apache.org/users/clustering/cluster-dumper.html
I think you need to read the source code -- download from http://mahout.apache.org. VL-24130 is just a cluster identifier for a converged cluster.
You can use mahout clusterdump
https://cwiki.apache.org/MAHOUT/cluster-dumper.html
Related
I have written the following code, which returns a "Class not found" exception. I'm not sure what I need to do to load data from a csv file into SparkSQL.
import org.apache.spark.SparkContext
/**
* Loading sales csv using DataFrame API
*/
object CsvDataInput {
def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Csv loading example")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1),"header"->"true"))
df.printSchema()
df.registerTempTable("data")
val aggDF = sqlContext.sql("select * from data")
println(aggDF.collectAsList())
}
}
Try replacing this line
import org.apache.spark.SparkContext
with this
import org.apache.spark.*
You are importing just part of the library, but using classes from outside this part. Also, your import is actually misspelled - it should read org.apache.spark.sql.SQLContext, and you used some other package, not related to the code presented.
I am trying to perform BulkLoad into Hbase. The input to map reduce is hdfs file(from Hive).
Using the below code in Tool(Job) class to initiate the bulk loading process
HFileOutputFormat.configureIncrementalLoad(job, new HTable(config, TABLE_NAME));
In Mapper, using the following as output of Mapper
context.write(new ImmutableBytesWritable(Bytes.toBytes(hbaseTable)), put);
Once the mapper is completed. performing the actual bulk loading using,
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles(configuration);
HTable hTable = new HTable(configuration, tableName);
loadFfiles.doBulkLoad(new Path(pathToHFile), hTable);
The job runs fine, but once the Loadincrement start, it hangs on for ever. I have to stop the job from running after many attempts. However after long wait of may be 30 mins, I finally got the above error. After extensive search I found, that Hbase would be trying to access the files(HFiles) which are placed in the output folder, and that folder do not have permission to be written or executed. So throwing the above error. So the alternative solutions are to add file access permissions as below in java code before Bulk Loading is performed.
FileSystem fileSystem = FileSystem.get(config);
fileSystem.setPermission(new Path(outputPath),FsPermission.valueOf("drwxrwxrwx"));
Is this the correct approach, as we move from development to production. Also once I added the above code, I got the similar error for the folder created inside the output folder. This time its the column family folder. This is dynamic action at runtime.
As a temporary workaround, I did as below and was able to move ahead.
fileSystem.setPermission(new Path(outputPath+"/col_fam_folder"),FsPermission.valueOf("drwxrwxrwx"));
Both the steps seems to be workarounds, and I need a correct solution to move to production. Thanks in advance
Try this
System.setProperty("HADOOP_USER_NAME", "hadoop");
Secure bulk load seems to be an appropriate answer. This thread explains a sample implementation. The snippet is copied over as below.
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HRegionInfo;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.client.coprocessor.SecureBulkLoadClient;
import org.apache.hadoop.hbase.security.UserProvider;
import org.apache.hadoop.hbase.security.token.FsDelegationToken;
import org.apache.hadoop.hbase.util.Pair;
import org.apache.hadoop.security.UserGroupInformation;
String keyTab = "pathtokeytabfile";
String tableName = "tb_name";
String pathToHFile = "/tmp/tmpfiles/";
Configuration configuration = new Configuration();
configuration.set("hbase.zookeeper.quorum","ZK_QUORUM");
configuration.set("hbase.zookeeper"+ ".property.clientPort","2181");
configuration.set("hbase.master","MASTER:60000");
configuration.set("hadoop.security.authentication", "Kerberos");
configuration.set("hbase.security.authentication", "kerberos");
//Obtaining kerberos authentication
UserGroupInformation.setConfiguration(configuration);
UserGroupInformation.loginUserFromKeytab("here keytab", path to the key tab);
HBaseAdmin.checkHBaseAvailable(configuration);
System.out.println("HBase is running!");
HBaseConfiguration.addHbaseResources(configuration);
Connection conn = ConnectionFactory.createConnection(configuration);
Table table = conn.getTable(TableName.valueOf(tableName));
HRegionInfo tbInfo = new HRegionInfo(table.getName());
//path to the HFiles that need to be loaded
Path hfofDir = new Path(pathToHFile);
//acquiring user token for authentication
UserProvider up = UserProvider.instantiate(configuration);
FsDelegationToken fsDelegationToken = new FsDelegationToken(up, "name of the key tab user");
fsDelegationToken.acquireDelegationToken(hfofDir.getFileSystem(configuration));
//preparing for the bulk load
SecureBulkLoadClient secureBulkLoadClient = new SecureBulkLoadClient(table);
String bulkToken = secureBulkLoadClient.prepareBulkLoad(table.getName());
System.out.println(bulkToken);
//creating the family list (list of family names and path to the hfile corresponding to the family name)
final List<Pair<byte[], String>> famPaths = new ArrayList<>();
Pair p = new Pair();
//name of the family
p.setFirst("nameofthefamily".getBytes());
//path to the HFile (HFile are organized in folder with the name of the family)
p.setSecond("/tmp/tmpfiles/INTRO/nameofthefilehere");
famPaths.add(p);
//bulk loading ,using the secure bulk load client
secureBulkLoadClient.bulkLoadHFiles(famPaths, fsDelegationToken.getUserToken(), bulkToken, tbInfo.getStartKey());
System.out.println("Bulk Load Completed..");
I have zip files that I would like to open 'through' Spark. I can open .gzip file no problem because of Hadoops native Codec support, but am unable to do so with .zip files.
Is there an easy way to read a zip file in your Spark code? I've also searched for zip codec implementations to add to the CompressionCodecFactory, but am unsuccessful so far.
There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()
In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.
#user3591785 pointed me in the correct direction, so I marked his answer as correct.
For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.
JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {
#Override
public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();
InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
newList.add(newTuple);
}
return newList;
}
});
Please try the code below:
using API sparkContext.newAPIHadoopRDD(
hadoopConf,
InputFormat.class,
ImmutableBytesWritable.class, Result.class)
I've had a similar issue and I've solved with the following code
sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.flatMap { zipEntry => ??? }
}
This answer only collects the previous knowledge and I share my experience.
ZipFileInputFormat
I tried following #Tinku and #JeffLL answers, and use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me. And I do not know how would I put com-cotdp-hadoop lib on my production cluster. I am not responsible for the setup.
ZipInputStream
#Tiago Palma gave a good advice, but he did not finish his answer and I struggled quite some time to actually get the decompressed output.
By the time I was able to do so, I had to prepare all the theoretical aspects, which you can find in my answer: https://stackoverflow.com/a/45958182/1549135
But the missing part of the mentioned answer is reading the ZipEntry:
import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)
File name should be pass using conf
conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)
Please Find PROPERTY_NAME from your input formatter for set path
Try:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")
I am getting the following error "astyanax.connectionpool.exceptions.PoolTimeoutException:" when trying to use client Astyanax to connect to Cassandra on a EC2 instance. Need help
Following is my code snippet.
import org.mortbay.jetty.servlet.Context;
import com.netflix.astyanax.AstyanaxContext;
import com.netflix.astyanax.Keyspace;
import com.netflix.astyanax.MutationBatch;
import com.netflix.astyanax.connectionpool.NodeDiscoveryType;
import com.netflix.astyanax.connectionpool.OperationResult;
import com.netflix.astyanax.connectionpool.exceptions.ConnectionException;
import com.netflix.astyanax.connectionpool.impl.ConnectionPoolConfigurationImpl;
import com.netflix.astyanax.connectionpool.impl.ConnectionPoolType;
import com.netflix.astyanax.connectionpool.impl.CountingConnectionPoolMonitor;
import com.netflix.astyanax.impl.AstyanaxConfigurationImpl;
import com.netflix.astyanax.model.Column;
import com.netflix.astyanax.model.ColumnFamily;
import com.netflix.astyanax.model.ColumnList;
import com.netflix.astyanax.model.CqlResult;
import com.netflix.astyanax.serializers.StringSerializer;
import com.netflix.astyanax.thrift.ThriftFamilyFactory;
public class MetadataRS {
public static void main(String args[]){
AstyanaxContext<Keyspace> context = new AstyanaxContext.Builder()
.forCluster("ClusterName")
.forKeyspace("KeyspaceName")
.withAstyanaxConfiguration(new AstyanaxConfigurationImpl()
.setDiscoveryType(NodeDiscoveryType.RING_DESCRIBE)
.setConnectionPoolType(ConnectionPoolType.ROUND_ROBIN)
)
.withConnectionPoolConfiguration(new ConnectionPoolConfigurationImpl("MyConnectionPool")
.setPort(9042)
.setMaxConnsPerHost(40)
.setSeeds("<EC2-IP>:9042")
.setConnectTimeout(5000)
)
.withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
.buildKeyspace(ThriftFamilyFactory.getInstance());
context.start();
Keyspace keyspace = context.getEntity();
System.out.println(keyspace);
ColumnFamily<String, String> CF_USER_INFO =
new ColumnFamily<String, String>(
"Standard1", // Column Family Name
StringSerializer.get(), // Key Serializer
StringSerializer.get()); // Column
OperationResult<ColumnList<String>> result = null;
try {
result = keyspace.prepareQuery(CF_USER_INFO)
.getKey("user_id_hash")
.execute();
} catch (ConnectionException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ColumnList<String> columns = result.getResult();
// Lookup columns in response by name
String uid = columns.getColumnByName("user_id_hash").getStringValue();
System.out.println(uid);
// Or, iterate through the columns
for (Column<String> c : result.getResult()) {
System.out.println(c.getName());
}
}
}
Error
com.netflix.astyanax.thrift.ThriftKeyspaceImpl#1961f4
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=():9042, latency=5001(5001), attempts=1] Timed out waiting for connection
at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:201)
at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:158)
at com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.borrowConnection(RoundRobinExecuteWithFailover.java:60)
at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:50)
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:229)
at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$1.execute(ThriftColumnFamilyQueryImpl.java:180)
at com.rjil.jiodrive.rs.MetadataRS.main(MetadataRS.java:57)
Exception in thread "main" java.lang.NullPointerException
at com.rjil.jiodrive.rs.MetadataRS.main(MetadataRS.java:62)
Since you are running cassandra on EC2 instance, check that the cassandra's port no. (which you have choosen as 9042) is in the allowed list of ec2 security group and that you have access to it. If not add the port no. in the inbound list of the ec2 security group and set the ip ranges as 0.0.0.0.
Alos check the firewall on the ec2 is turned off. By default its false, but its good to check it anyway.
If you have done this, then your client might be behind a firewall that prevents outbound traffic to your choosen port (9042).
Lastly if you have not used any elastic ip, its better to use the ec2 instance dns name, both in your setSeeds section and in the rpc_address of cassandra.yaml
Your issue is not in your code. You have a connectivity problem to the node you specified as your seed. So either that node isn't running, or you can't reach it from the machine running your client.
I finally upgraded libthrift to 0.9 and changed my code to the following nd it working fine now.
public Keyspace getDBConnection() {
if (poolConfig == null) {
poolConfig = new ConnectionPoolConfigurationImpl(
"CassandraPool").setPort(port).setMaxConnsPerHost(1)
.setSeeds((new StringBuilder(seedHost).append(":").append(port).toString()))
.setLatencyAwareUpdateInterval(latencyAwareUpdateInterval) // Will resort hosts per
// token partition every
// 10 seconds
.setLatencyAwareResetInterval(latencyAwareResetInterval) // Will clear the latency
// every 10 seconds. In
// practice I set this
// to 0 which is the
// default. It's better
// to be 0.
.setLatencyAwareBadnessThreshold(latencyAwareBadnessThreshold) // Will sort hosts if a host
// is more than 100% slower
// than the best and always
// assign connections to the
// fastest host, otherwise
// will use round robin
.setLatencyAwareWindowSize(latencyAwareWindowSize) // Uses last 100 latency
// samples. These samples are in
// a FIFO q and will just cycle
// themselves.
.setTimeoutWindow(60000)
;
}
AstyanaxContext<Keyspace> context = new AstyanaxContext.Builder()
.forCluster(clusterName)
.forKeyspace(keyspaceName)
.withAstyanaxConfiguration(
new AstyanaxConfigurationImpl().setDiscoveryType(
NodeDiscoveryType.NONE)
.setConnectionPoolType(
ConnectionPoolType.ROUND_ROBIN)
.setCqlVersion("3.0.0")
.setTargetCassandraVersion("2.0"))
.withConnectionPoolConfiguration(poolConfig)
.withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
.buildKeyspace(ThriftFamilyFactory.getInstance());
context.start();
return context.getClient();
}
hi i want to grab a proxy list from the net and search through it to find working proxy numbers and port. my problem is when i grab the site how to i search through it it identify just the ips and poorts and disragrd the rest? all i got so far doeint work
how do i identify just the proxy numbers and nothing else?? and sorry any help would be appreciated but i am a newb:)
package proxytester;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ProxyTester{
public static void main(String[] args) {
try{
URL grab = new URL("http://www.example.com");
BufferedReader in = new BufferedReader(
new InputStreamReader(grab.openStream()));
String input;
while ((input = in.readLine()) != null) {
if(input.charAt(0)=='n'){// the site starts its proxy list with name but this line throws an error
System.out.println(input);
}else if(input.charAt(0)== ' '){
System.out.println("empty");
}else
continue;
}
in.close();
}catch(MalformedURLException aa){
System.out.println("site error");
}catch (IOException e) {
System.out.println("io error");
}
}//end main
}//end main
I would suggest using regular expressions to find an ip address and the port. Here is a regular expression that is needed: java regex matching ip address and port number as captured groups
This article explains how to use regular expressions in java: http://www.mkyong.com/regular-expressions/how-to-validate-ip-address-with-regular-expression/