How to use 'run_glue.py' in HuggingFace to finetune for classification? - huggingface-transformers

Here is all the "documentation" I could find https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification
I honestly don't see how you're supposed to know how to use this thing with the resources I found online unless you made the script yourself.
I want to finetune a RoBERTa on a classification task with my own (.json) dataset and my own checkpoint.
DATA_SET='books'
MODEL='RoBERTa_small_fr_huggingface' # I have sentencepiece.bpe.model and pytorch_model.bin what do I use?
MAX_SENTENCES= '8' #batch size
LR = '1e-5'
MAX_EPOCH= '5'
NUM_CLASSES= '2'
SEEDS=3
CUDA_VISIBLE_DEVICES=0
TASK= 'sst2' #??
DATA_PATH= 'data/cls-books-json/' # with test.json, train.json, valid.json
for SEED in range(SEEDS):
SAVE_DIR= 'checkpoints/'+TASK+'/'+DATA_SET+'/'+MODEL+'_ms'+str(MAX_SENTENCES)+'_lr'+str(LR)+'_me'+str(MAX_EPOCH)+'/'+str(SEED)
!(python3 libs/transformers/examples/pytorch/text-classification/run_glue.py \
--model_name_or_path $MODEL \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--output_dir /tmp/hf)
So far I get this error:
run_glue.py: error: argument --model_name_or_path: expected one argument
But I'm sure it's not the only problem.

Related

gremlin_python query 'object is not callable' error when trying to filter

I have some experience with using gremlin in the console but I'm fairly new to gremlin in python. I have found a query that does what I want it to do in the console but I get a 'GraphTraversal' object is not callable error when I try to convert it to gremlin python. The query merges two vertices with the same specified property into one containing the edges of both.
Here is the adapted query:
g.V().has('id', 12345) \
.fold().filter(count(local).is_(gt(1))).unfold(). \
sideEffect(properties().group("p").by(key).by(value())). \
sideEffect(outE().group("o").by(label).by(project("p","iv").by(valueMap()).by(inV()).fold())). \
sideEffect(inE().group("i").by(label).by(project("p","ov").by(valueMap()).by(outV()).fold())). \
sideEffect(drop()). \
cap("p","o","i").as_("poi"). \
addV("User").as_("u"). \
sideEffect(
select("poi").select("p").unfold().as_("kv"). \
select("u").property(select("kv").select(keys), select("kv").select(values))). \
sideEffect(
select("poi").select("o").unfold().as_("x").select(values). \
unfold().addE(select("x").select(keys)).from_(select("u")).to(select("iv"))). \
sideEffect(
select("poi").select("i").unfold().as_("x").select(values). \
unfold().addE(select("x").select(keys)).from_(select("ov")).to(select("u"))).iterate()
and this is the error I'm getting:
TypeError Traceback (most recent call last)
<ipython-input-165-9ce00a27d167> in <module>
1 g.V().has('id', 12345) \
----> 2 .fold().filter(count(local).is_(gt(1))).unfold(). \
3 sideEffect(properties().group("p").by(key).by(value())). \
4 sideEffect(outE().group("o").by(label).by(project("p","iv").by(valueMap()).by(inV()).fold())). \
5 sideEffect(inE().group("i").by(label).by(project("p","ov").by(valueMap()).by(outV()).fold())). \
TypeError: 'GraphTraversal' object is not callable
I suspect it's an issue with my gremlin_python translation. Any help would be greatly appreciated.
When using Python, certain reserved words conflict with Gremlin steps and need to be suffixed with an underscore. Also certain things like gt are part of the P enum and I prefer to write them out in full. So for the line in question it becomes:
.fold().filter_(__.count(Scope.local).is_(P.gt(1))).unfold().

SourceTracker 2 ''DataFrame object has no atribute "ids" error

Hi I am trying to generate a barplot with sourceTracker 2, but I keep getting the same error and I don't know what to change.
I ran the following code:
qiime sourcetracker2 gibbs \
--i-feature-table table-dada2.qza \
--m-sample-metadata-file sample-metadata.tsv \
--p-source-sink-column sourcesink \
--p-no-loo \
--p-source-column-value Source \
--p-sink-column-value Sink \
--p-source-category-column Env \
--output-dir qiime2-results/sourcetracker-noloo
It got me 4 .qza files
mixing_proportions.qza
mixing_proportion_stds.qza
per_sink_assignments.qza
per_sink_assignments_map.qza
Now I'm trying to generate the barplot with this code:
qiime sourcetracker2 barplot\ --i-proportions qiime2-results/sourcetracker-noloo/mixing_proportions.qza \ --m-sample-metadata-file sample-metadata.tsv\ --o-visualization qiime2-results/sourcetracker-noloo/proportions-barplot.qzv
And I keep getting the same error message:
Plugin error from sourcetracker2:
'DataFrame' object has no attribute 'ids'
It might be a problem with my metadata, but there is an id column in it and it it pretty much like the example given by the pluggin's website.
my metadata file is the 1st picture and the example is the 2nd picture.
Thank you very much for your help ^^
My metadata
example of metadata by sourcetracker2

pyspark writeStream: Each Data Frame row in a separate json file

I am using pyspark to read data from a Kafka topic as a streaming dataframe as follows:
spark = SparkSession.builder \
.appName("Spark Structured Streaming from Kafka") \
.getOrCreate()
sdf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load() \
.select(from_json(col("value").cast("string"), json_schema).alias("parsed_value"))
sdf_ = sdf.select("parsed_value.*")
My goal is to write each of the sdf_ rows as seperate json files.
The following code:
writing_sink = sdf_.writeStream \
.format("json") \
.option("path", "/Desktop/...") \
.option("checkpointLocation", "/Desktop/...") \
.start()
writing_sink.awaitTermination()
will write several rows of the dataframe within the same json, depending on the size of the micro-batch (or this is my hypothesis at least).
What I need is to tweak the above so that each row of the dataframe is written in a separate json file.
I have also tried using partitionBy('column'), but still this will not do exactly what I need, but instead create folders within which the json files might still have multiple rows written within them (if they have the same id).
Any ideas that could help out here? Thanks in advance.
Found out that the following option does the trick:
.option("maxRecordsPerFile", 1)

Yocto How to overwrite a file of linux rootfs depending on an IMAGE-recipe?

I'm trying to add a simple line in fstab within
the final rootfs that Yocto builds.
My first approach was to add my own fstab in my layer meta-mylayer/recipes-core/base-files/base-files/fstab and the proper meta-mylayer/recipes-core/base-files/base-files/base-files_%.bbappend which only have the following line:
FILESEXTRAPATHS_prepend := "${THISDIR}/${PN}:"
And it works, but as the title of my question says, i want to modify fstab based on the recipe-image i want to build i.e. dev-image & prod-image.
After some investigation i think i have 2 options
Modify fstab within the recipe image, extending the do_install task...
dev-image.bb
--------------
DESCRIPTION = "Development Image"
[...]
inherit core-image
do_install_append () {
echo "======= Modifying fstab ========"
cat >> ${D}${sysconfdir}/fstab <<EOF
# The line i want to Add
EOF
}
[...]
--------------
Problem is that i'm actually not seeing my modified line in my final /etc/fstab and bitbake is not showing any build error or warning about this, actually, i'm not even able to see the echo-trace i put.
My second attempt was to handle these modifications with packages and depending on the recipe-image i will be able to add the package for *-dev or *-prod. This idea was taken from Oleksandr Poznyak in this answer in summary he suggest the following:
1) Create *.bbappend recipe base-files_%s.bbappend in your layer. It
appends to poky "base-files" recipe.
2) Create your own "python do_package_prepend" function where you should
make your recipe produce two different packages
3) Add them to DEPENDS in your image recipe
And based on his example i made my own recipe:
base-files_%.bbappend
-------------------------
FILESEXTRAPATHS_prepend := "${THISDIR}/${PN}:"
SRC_URI += "file://fstab-dev \
file://fstab-prod \
"
PACKAGES += " ${PN}-dev ${PN}-prod"
CONFFILES_${PN}-dev = "${CONFFILES_${PN}}"
CONFFILES_${PN}-prod = "${CONFFILES_${PN}}"
pkg_preinst_${PN}-dev = "${pkg_preinst_${PN}}"
pkg_preinst_${PN}-prod = "${pkg_preinst_${PN}}"
RREPLACES_${PN}-dev = "${PN}"
RPROVIDES_${PN}-dev = "${PN}"
RCONFLICTS_${PN}-dev = "${PN}"
RREPLACES_${PN}-prod = "${PN}"
RPROVIDES_${PN}-prod = "${PN}"
RCONFLICTS_${PN}-prod = "${PN}"
python populate_packages_prepend() {
import shutil
packages = ("${PN}-dev", "${PN}-prod")
for package in packages:
# copy ${PN} content to packages
shutil.copytree("${PKGD}", "${PKGDEST}/%s" % package, symlinks=True)
# replace fstab
if package == "${PN}-dev":
shutil.copy("${WORKDIR}/fstab-dev", "${PKGDEST}/${PN}-dev/etc/fstab")
else:
shutil.copy("${WORKDIR}/fstab-prod", "${PKGDEST}/${PN}-prod/etc/fstab")
}
-------------------------
And in my recipe-image(dev-image.bb) i added base-files-dev packet
dev-image.bb
--------------
DESCRIPTION = "Development Image"
[...]
inherit core-image
IMAGE_INSTALL = " \
${MY_PACKETS} \
base-files-dev \
"
[...]
--------------
Problem with this, is that i'm not familiarized with phyton indentation so probably i'm messing things up, the error log shows as follows.
DEBUG: Executing python function populate_packages
ERROR: Error executing a python function in exec_python_func() autogenerated:
The stack trace of python calls that resulted in this exception/failure was:
File: 'exec_python_func() autogenerated', lineno: 2, function: <module>
0001:
*** 0002:populate_packages(d)
0003:
File: '/home/build/share/build_2/../sources/poky/meta/classes/package.bbclass', lineno: 1138, function: populate_packages
1134:
1135: workdir = d.getVar('WORKDIR')
1136: outdir = d.getVar('DEPLOY_DIR')
1137: dvar = d.getVar('PKGD')
*** 1138: packages = d.getVar('PACKAGES').split()
1139: pn = d.getVar('PN')
1140:
1141: bb.utils.mkdirhier(outdir)
1142: os.chdir(dvar)
File: '/usr/lib/python3.6/shutil.py', lineno: 315, function: copytree
0311: destination path as arguments. By default, copy2() is used, but any
0312: function that supports the same signature (like copy()) can be used.
0313:
0314: """
*** 0315: names = os.listdir(src)
0316: if ignore is not None:
0317: ignored_names = ignore(src, names)
0318: else:
0319: ignored_names = set()
Exception: FileNotFoundError: [Errno 2] No such file or directory: '${PKGD}'
DEBUG: Python function populate_packages finished
DEBUG: Python function do_package finished
I will really appreciate any clue or sort of direction, i'm not an Yocto expert so maybe the options that i suggest are not the most elegant and probably there is a better way to do it, so be free to give me any recommendation.
Thank you very much.
UPDATE:
As always, i was not the only one trying this, the way that i make it work was thanks this answer the only inconvenience with this is that you need to rm what you want to install through a .bbappend but for now is fine for me.
I also tried to do the same with bbclasses, which for me, it is a more elegant wayto do it, but i failed... i got the following error
ERROR: base-files-dev-3.0.14-r89 do_packagedata: The recipe base-files-dev is trying to install files into a shared area when those files already exist. Those files and their manifest location are:
I tried to rm fstab within the .bbappend but the same error is showed
Maybe somebody will share what i'm doing wrong...
If you don't find this post valuable please remove...
Your recipe which base on Oleksandr doesn't work due to dropped support for variables expansion in newer Poky.
https://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#migration-2.1-variable-expansion-in-python-functions
Error explicit says:
Exception: FileNotFoundError: [Errno 2] No such file or directory: '${PKGD}'
It didn't expand the variable.
P.S.
This is not a proper answer to Your question but SO blocks comments.

Passing external yml file in my spark-job/code not working throwing "Can't construct a java object for tag:yaml.org,2002"

I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my spark driver class MyDriver.java I am loading the YML file as below
String ymlFilename = args[1].toString();
Optional<QueryEntities> entities = InputYamlProcessor.process(ymlFilename);
I have all code here including InputYamlProcessor.java
https://gist.github.com/BdLearnerr/e4c47c5f1dded951b18844b278ea3441
This is working fine in my local but when I run on cluster this gives error
Error :
Can't construct a java object for tag:yaml.org,2002:com.snp.yml.QueryEntities; exception=Class not found: com.snp.yml.QueryEntities
in 'reader', line 1, column 1:
entities:
^
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
at org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:444)
at com.snp.yml.InputYamlProcessor.process(InputYamlProcessor.java:62)
Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: com.snp.yml.QueryEntities
at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:650)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:331)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:341)
... 12 more
My spark job script is
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/column_family_condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties column_family_condition.yml
What am I doing wrong here? How to fix this issue ?
Any fix is highly thankful.
Tested :
I printed something like this inside the class , before the line where getting above... to check if the issue is really class not found.
public static void printTest() {
QueryEntity e1 = new QueryEntity();
e1.setTableName("tab1");
List<QueryEntity> li = new ArrayList<QueryEntity>();
li.add(e1);
QueryEntities ll = new QueryEntities();
ll.setEntitiesList(li);
ll.getEntitiesList().stream().forEach(e -> logger.error("e1 Name :" + e.getTableName()));
return;
}
Output :
19/09/18 04:40:33 ERROR yml.InputYamlProcessor: e1 Name :tab1
Can't construct a java object for tag:yaml.org,2002:com.snp.helpers.QueryEntities; exception=Class not found: com.snp.helpers.QueryEntities
in 'reader', line 1, column 1:
entitiesList:
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
What is wrong here ?
This has got nothing to do with QueryEntities
i.e. YAMLException: Class not found: com.snp.yml.QueryEntities
is YML constructor issue
Changed To
Yaml yaml = new Yaml(new CustomClassLoaderConstructor(com.snp.helpers.QueryEntities.class.getClassLoader()));
From
/*Constructor constructor = new Constructor(com.snp.helpers.QueryEntities.class);
Yaml yaml = new Yaml( constructor );*/

Resources