SourceTracker 2 ''DataFrame object has no atribute "ids" error - qiime

Hi I am trying to generate a barplot with sourceTracker 2, but I keep getting the same error and I don't know what to change.
I ran the following code:
qiime sourcetracker2 gibbs \
--i-feature-table table-dada2.qza \
--m-sample-metadata-file sample-metadata.tsv \
--p-source-sink-column sourcesink \
--p-no-loo \
--p-source-column-value Source \
--p-sink-column-value Sink \
--p-source-category-column Env \
--output-dir qiime2-results/sourcetracker-noloo
It got me 4 .qza files
mixing_proportions.qza
mixing_proportion_stds.qza
per_sink_assignments.qza
per_sink_assignments_map.qza
Now I'm trying to generate the barplot with this code:
qiime sourcetracker2 barplot\ --i-proportions qiime2-results/sourcetracker-noloo/mixing_proportions.qza \ --m-sample-metadata-file sample-metadata.tsv\ --o-visualization qiime2-results/sourcetracker-noloo/proportions-barplot.qzv
And I keep getting the same error message:
Plugin error from sourcetracker2:
'DataFrame' object has no attribute 'ids'
It might be a problem with my metadata, but there is an id column in it and it it pretty much like the example given by the pluggin's website.
my metadata file is the 1st picture and the example is the 2nd picture.
Thank you very much for your help ^^
My metadata
example of metadata by sourcetracker2

Related

PySpark with io.github.spark-redshift-community: BasicAWSCredentialsProvider not found

I'm trying to load data from my redshift database using PySpark.
I'm using "io.github.spark-redshift-community" as connector. It's requires a "tempdir" parameter to use a S3. My code looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-redshift_2.12:5.0.3")
findspark.add_packages("com.amazonaws:aws-java-sdk-bundle:1.12.262")
findspark.add_packages("org.apache.hadoop:hadoop-aws:3.3.4")
findspark.init()
spark = SparkSession.builder.master("local[8]").appName("Dim_Customer").getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", S3_ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", S3_SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
df_read_1 = spark.read \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://IP/DATABASE?user=USER&password=PASS") \
.option("dbtable", "table") \
.option("tempdir", "s3a://url/")\
.option("forward_spark_s3_credentials", "true") \
.load()
But I'm getting an error: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found
I've fond some sources saying to change BasicAWSCredentialsProvider to SimpleAWSCredentialsProvider, but I get another error: NoSuchMethodError.
Could someone help me, please?
Is that any problem with the hadoop and aws-java-sdk versions?
Thank you in advance!

How to use 'run_glue.py' in HuggingFace to finetune for classification?

Here is all the "documentation" I could find https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification
I honestly don't see how you're supposed to know how to use this thing with the resources I found online unless you made the script yourself.
I want to finetune a RoBERTa on a classification task with my own (.json) dataset and my own checkpoint.
DATA_SET='books'
MODEL='RoBERTa_small_fr_huggingface' # I have sentencepiece.bpe.model and pytorch_model.bin what do I use?
MAX_SENTENCES= '8' #batch size
LR = '1e-5'
MAX_EPOCH= '5'
NUM_CLASSES= '2'
SEEDS=3
CUDA_VISIBLE_DEVICES=0
TASK= 'sst2' #??
DATA_PATH= 'data/cls-books-json/' # with test.json, train.json, valid.json
for SEED in range(SEEDS):
SAVE_DIR= 'checkpoints/'+TASK+'/'+DATA_SET+'/'+MODEL+'_ms'+str(MAX_SENTENCES)+'_lr'+str(LR)+'_me'+str(MAX_EPOCH)+'/'+str(SEED)
!(python3 libs/transformers/examples/pytorch/text-classification/run_glue.py \
--model_name_or_path $MODEL \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--output_dir /tmp/hf)
So far I get this error:
run_glue.py: error: argument --model_name_or_path: expected one argument
But I'm sure it's not the only problem.

gremlin_python query 'object is not callable' error when trying to filter

I have some experience with using gremlin in the console but I'm fairly new to gremlin in python. I have found a query that does what I want it to do in the console but I get a 'GraphTraversal' object is not callable error when I try to convert it to gremlin python. The query merges two vertices with the same specified property into one containing the edges of both.
Here is the adapted query:
g.V().has('id', 12345) \
.fold().filter(count(local).is_(gt(1))).unfold(). \
sideEffect(properties().group("p").by(key).by(value())). \
sideEffect(outE().group("o").by(label).by(project("p","iv").by(valueMap()).by(inV()).fold())). \
sideEffect(inE().group("i").by(label).by(project("p","ov").by(valueMap()).by(outV()).fold())). \
sideEffect(drop()). \
cap("p","o","i").as_("poi"). \
addV("User").as_("u"). \
sideEffect(
select("poi").select("p").unfold().as_("kv"). \
select("u").property(select("kv").select(keys), select("kv").select(values))). \
sideEffect(
select("poi").select("o").unfold().as_("x").select(values). \
unfold().addE(select("x").select(keys)).from_(select("u")).to(select("iv"))). \
sideEffect(
select("poi").select("i").unfold().as_("x").select(values). \
unfold().addE(select("x").select(keys)).from_(select("ov")).to(select("u"))).iterate()
and this is the error I'm getting:
TypeError Traceback (most recent call last)
<ipython-input-165-9ce00a27d167> in <module>
1 g.V().has('id', 12345) \
----> 2 .fold().filter(count(local).is_(gt(1))).unfold(). \
3 sideEffect(properties().group("p").by(key).by(value())). \
4 sideEffect(outE().group("o").by(label).by(project("p","iv").by(valueMap()).by(inV()).fold())). \
5 sideEffect(inE().group("i").by(label).by(project("p","ov").by(valueMap()).by(outV()).fold())). \
TypeError: 'GraphTraversal' object is not callable
I suspect it's an issue with my gremlin_python translation. Any help would be greatly appreciated.
When using Python, certain reserved words conflict with Gremlin steps and need to be suffixed with an underscore. Also certain things like gt are part of the P enum and I prefer to write them out in full. So for the line in question it becomes:
.fold().filter_(__.count(Scope.local).is_(P.gt(1))).unfold().

pyspark writeStream: Each Data Frame row in a separate json file

I am using pyspark to read data from a Kafka topic as a streaming dataframe as follows:
spark = SparkSession.builder \
.appName("Spark Structured Streaming from Kafka") \
.getOrCreate()
sdf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load() \
.select(from_json(col("value").cast("string"), json_schema).alias("parsed_value"))
sdf_ = sdf.select("parsed_value.*")
My goal is to write each of the sdf_ rows as seperate json files.
The following code:
writing_sink = sdf_.writeStream \
.format("json") \
.option("path", "/Desktop/...") \
.option("checkpointLocation", "/Desktop/...") \
.start()
writing_sink.awaitTermination()
will write several rows of the dataframe within the same json, depending on the size of the micro-batch (or this is my hypothesis at least).
What I need is to tweak the above so that each row of the dataframe is written in a separate json file.
I have also tried using partitionBy('column'), but still this will not do exactly what I need, but instead create folders within which the json files might still have multiple rows written within them (if they have the same id).
Any ideas that could help out here? Thanks in advance.
Found out that the following option does the trick:
.option("maxRecordsPerFile", 1)

Passing external yml file in my spark-job/code not working throwing "Can't construct a java object for tag:yaml.org,2002"

I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my spark driver class MyDriver.java I am loading the YML file as below
String ymlFilename = args[1].toString();
Optional<QueryEntities> entities = InputYamlProcessor.process(ymlFilename);
I have all code here including InputYamlProcessor.java
https://gist.github.com/BdLearnerr/e4c47c5f1dded951b18844b278ea3441
This is working fine in my local but when I run on cluster this gives error
Error :
Can't construct a java object for tag:yaml.org,2002:com.snp.yml.QueryEntities; exception=Class not found: com.snp.yml.QueryEntities
in 'reader', line 1, column 1:
entities:
^
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
at org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:444)
at com.snp.yml.InputYamlProcessor.process(InputYamlProcessor.java:62)
Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: com.snp.yml.QueryEntities
at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:650)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:331)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:341)
... 12 more
My spark job script is
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/column_family_condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties column_family_condition.yml
What am I doing wrong here? How to fix this issue ?
Any fix is highly thankful.
Tested :
I printed something like this inside the class , before the line where getting above... to check if the issue is really class not found.
public static void printTest() {
QueryEntity e1 = new QueryEntity();
e1.setTableName("tab1");
List<QueryEntity> li = new ArrayList<QueryEntity>();
li.add(e1);
QueryEntities ll = new QueryEntities();
ll.setEntitiesList(li);
ll.getEntitiesList().stream().forEach(e -> logger.error("e1 Name :" + e.getTableName()));
return;
}
Output :
19/09/18 04:40:33 ERROR yml.InputYamlProcessor: e1 Name :tab1
Can't construct a java object for tag:yaml.org,2002:com.snp.helpers.QueryEntities; exception=Class not found: com.snp.helpers.QueryEntities
in 'reader', line 1, column 1:
entitiesList:
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
What is wrong here ?
This has got nothing to do with QueryEntities
i.e. YAMLException: Class not found: com.snp.yml.QueryEntities
is YML constructor issue
Changed To
Yaml yaml = new Yaml(new CustomClassLoaderConstructor(com.snp.helpers.QueryEntities.class.getClassLoader()));
From
/*Constructor constructor = new Constructor(com.snp.helpers.QueryEntities.class);
Yaml yaml = new Yaml( constructor );*/

Resources