Weka's StringToWordVector filter from command line? - terminal

Is it possible to run the StringToWordVector filter in Weka from the command line and get a processed output file? I'd like to pre-process my data separately before feeding it back into Weka for training. So I'm trying to run the filter, get an output file, and then do the rest. I am using a high-end GPU virtual machine with SSH-only access, so I can't use the Weka GUI, only the command line.

See this
java weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\"\\'()?!-¿¡+*&#$%\\\\/=<>[]_`#\"" -W 10000000 -b -i input-train.arff -o output-train-vector.arff -r input-test.arff -s output-test-vector.arff

Related

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

command in terminal and in script have different action

I have a file and I need to use sed to process it.
Here is my command: sed -i -e '/.*tour\.html\|.*Thumb[^\/]*\.jpg/!d'.
Now if I execute this command from the terminal, for example, sed -i -e '/.*tour\.html\|.*Thumb[^\/]*\.jpg/!d' myfile.txt, it works well. But if I write a bash script with the same command, it will delete all lines.
#!/bin/bash
sed -i -e '/.*tour\.html\|.*Thumb[^\/]*\.jpg/!d' "$1"
This script will delete all lines in file.
My PC is Mac OS.
As far as I understand getting both sed instances output of sed --help and sed --version showed that you have actually two different sed versions reacting to the two different ways of execuing your code.
Sed is a little inconsistent concerning the syntax, especially when it comes to commandline options.
For example, I know of an important difference for the -i switch, which in some Mac versions requires a file extension for backups being given explicitly. Others allow one optionally. This difference could explain why somethign involving a -i without backup extension works in one case and fails in another.
Anishsane suggested that different "PATH" variables could in turn be part of the mechanism to have two different sed versions executed.
I invite OP to edit the output of --help and --version (where possible, there should be a way to get the version out of both sed instances) here into this answer. I do not have those details actually. Which makes this answer seem a little "guessing".

How can you use multiple input streams for input parameters?

I want to run a command line script that requires several parameters. Specifically:
perl prinseq-lite.pl -fastq file1.fq.gz -fastq2 file2.fq.gz
\ -out_good goodfile.out -out_bad badfile.out -log prin.log
\ -ns_max_n 5 ... more_params ...
The problem is that the files are zipped, and must be processed without first unzipping and storing them, because the unzipped file sizes are very large and this command will be run on a large number of files.
So what I need to do is to unzip the input on the fly. Previously, user l0b0, suggested that multiple input streams might be a solution. I have tried the following, but seem to be passing an empty input stream here as the program claims the input files are empty.
perl prinseq-lite.pl -fastq <(zcat f1.gz) -fastq2 <(zcat f2.gz) ...
perl prinseq-lite.pl -fastq 1< <(zcat f1.gz) -fastq2 2< <(zcat f2.gz) ...
So what I need to do, in short, is provide unzipped input for multiple parameters to this program.
Can someone tell me the proper way to do this, and/or what I'm doing wrong with my current attempts? Thanks in advance for your input.
Well, I think the easiest might be to make named pipes for the output of gzunip, then use those names pipes in the command:
mkfifo file1.fq file2.fq file3.fq ...
gunzip -c file1.fq.gz > file1.fq &
gunzip -c file2.fq.gz > file2.fq &
gunzip -c file3.fq.gz > file3.fq &
Then call your program with those pipes as file names:
perl prinseq-lite.pl -fastq file1.fq -fastq2 file2.fq -fastq3 file3.fq ...

apachebench (ab) -g switch append to file?

apachebench (ab) -g switch generates graph friendly data. I'd like to have it run every x(min) in order to collect performance metrics and make a graph however every time the command runs it replaces the data in the file;
Is there a way to make it append the data to the file instead?
example:
ab -g /tmp/graph1 -n 25 -c 1 http://www.google.com/
will put the graph data in /tmp/graph1 file
If I run it a second time however the data in there will be lost and replaced for new data; I want it to append instead and keep both runs data in the file.
I found a rather simple way to make it work:
ab -g /tmp/graph1 -n 25 -c 1 http://www.google.com/ && cat /tmp/graph1 >> /tmp/full_graph

Using GNU Parallel With Split

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv

Resources