Shell Scripting + SQLite3 - shell

Basically I need to find a few hundred *.DB3 files on three or four of our network shares and execute a SQLite3 script against all of them, extracting the outputs to a new DB3 file or even CSV file; but my shell scripting is very rusty and I would appreciate any help.
Using either Windows CLI or something that can be executed in Cygwin, how would I build a script to search up to three named shares (or mounted drive letters) for all files with a specific file extension (*.DB3) store these full paths and filenames in an array or file and then remove any duplicated file names.
For all files in this list, I then need to run a predefined SQLite3.exe script (example below) against all of these files and output this data into a new DB3 file or even a CSV file.
SELECT Product.Type AS 'Product', Product.Identifier AS SKU;

The basic footwork is to create a suitably formatted list of file names. The following will create a file where the filename is in the first column, the full path in the second, tab delimited.
find /share/one /path/to/share/two \
/export/third/share -type f -name '*.DB3' -printf '%f\t%p\n'
Pass that to sort | awk and create a simple script to print only unique names. Does the order of the duplicates matter? Perhaps you would like to check which duplicate is older, instead of blindly keeping only the first, or the last? But here is a simple Awk script which does the former.
awk -F '\t' '!a[$1]++ { print $2 }'
To connect the pieces, run find | sort | awk | xargs sqlite3 'your commands'
If you need something a bit more involved, you can read the list of non-duplicate files in a loop:
find /share/one /path/to/share/two \
/export/third/share -type f -name '*.DB3' -printf '%f\t%p\n' |
sort |
awk -F '\t' '!a[$1]++ { print $2 }' |
while read file; do
sqlite3 "complex things with $file; more things; commit; done; etc"
done >output

Related

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Concatenating CSV files in bash preserving the header only once

Imagine I have a directory containing many subdirectories each containing some number of CSV files with the same structure (same number of columns and all containing the same header).
I am aware that I can run from the parent folder something like
find ./ -name '*.csv' -exec cat {} \; > ~/Desktop/result.csv
And this will work fine, expect for the fact that the header is repeated each time (once for each file).
I'm also aware that I can do something like sed 1d <filename> or tail -n +<N+1> <filename> to skip the first line of a file.
But in my case, it seems a bit more specialised. I want to preserve the header once for the first file and then skip the header for every file after that.
Is anyone aware of a way to achieve this using standard Unix tools (like find, head, tail, sed, awk etc.) and bash?
For example input files
/folder1
/file1.csv
/file2.csv
/folder2
/file1.csv
Where each file has header:
A,B,C and each file has one data row 1,2,3
The desired output would be:
A,B,C
1,2,3
1,2,3
1,2,3
Marked As Duplicate
I feel this is different to other questions like this and this specifically because those solutions reference file1 and file2 in the solution. My question asks about a directory structure with an arbitrary number of files where I would not want to type out each file one by one.
You may use this find + xargs + awk:
find . -name '*.csv' -print0 | xargs -0 awk 'NR==1 || FNR>1'
NR==1 || FNR>1 condition will be true for very first line in combined output or for every non-first line.
$ {
> cat real-daily-wages-in-pounds-engla.tsv;
> tail -n+2 real-daily-wages-in-pounds-engla.tsv;
> } | cat
You can pipe the output of multiple commands through cat. tail -n+2 selects all lines from a file, except the first.

How to delete files from directory using CSV in bash

I have 600,000+ images in a directory. The filenames look like this:
1000000-0.jpeg
1000000-1.jpeg
1000000-2.jpeg
1000001-0.jpeg
1000002-0.jpeg
1000003-0.jpeg
The first number is a unique ID and the second number is an index.
{unique-id}-{index}.jpeg
How would I load the unique-id's in from a .CSV file and remove each file whose Unique ID matches the Unique ID's in the .CSV file?
The CSV file looks like this:
1000000
1000001
1000002
... or I can have it separated by semicolons like so (if necessary):
1000000;1000001;1000002
You can set the IFS variable to ; and loop over the values read into an array:
#! /bin/bash
while IFS=';' read -a ids ; do
for id in "${ids[#]}" ; do
rm $id-*.jpg
done
done < file.csv
Try running the script with echo rm ... first to verify it does what you want.
If there's exactly one ID per line, this will show you all matching file names:
ls | grep -f unique-ids.csv
If that list looks correct, you can delete the files with:
ls | grep -f unique-ids.csv | xargs rm
Caveat: This is a quick and dirty solution. It'll work if the file names are all named the way you say. Beware it could easily be tricked into deleting the wrong things by a clever attacker or a particularly hapless user.
You could use find and sed:
find dir -regextype posix-egrep \
-regex ".*($(sed 's/\;/|/g' ids.csv))-[0-9][0-9]*\.jpeg"
replace dir with your search directory, and ids.csv with your CVS file. To delete the files you could include -delete option.

save filename and information from the file into a two column txt doc. ubuntu terminal

I have a question regarding the manipulation and creation of text files in the ubuntu terminal. I have a directory that contains several 1000 subdirectories. In each directory, there is a file with the extension stats.txt. I want to write a piece of code that will run from the parent directory, and create a file with the name of all the stats.txt files in the first column, and then returns to me all the information from the 5th line of the same stats.txt file in the next column. The 5th line of the stats.txt file is a sentence of six words, not a single value.
For reference, I have successfully used the sed command in combination with find and cat to make a file containing the 5th line from each stats.txt file. I then used the ls command to save a list of all my subdirectories. I assumed both files would be in alphabetical order of the subdirectories, and thus easy to merge, but I was wrong. The find and cat functions, or at least my implementation of them, resulted in a file that appeared to be random in order (see below). No need to try to remedy this code, I'm open to all solutions.
# loop through subdirectories and save the 5th line of stats.txt as a different file.
for f in ~/*; do [ -d $f ] && cd "$f" && sed -n 5p *stats.txt > final.stats.txt done;
# find the final.stats.txt files and save them as a single file
find ./ -name 'final.stats.txt' -exec cat {} \; > compiled.stats.txt
Maybe something like this can help you get on track:
find . -name "*stats.txt" -exec awk 'FNR==5{print FILENAME, $0}' '{}' + > compiled.stats

combining grep and find to search for file names from query file

I've found many similar examples but cannot find an example to do the following. I have a query file with file names (file1, file2, file3, etc.) and would like to find these files in a directory tree; these files may appear more than once in the dir tree, so I'm looking for the full path. This option works well:
find path/to/files/*/* -type f | grep -E "file1|file2|file3|fileN"
What I would like is to pass grep a file with filenames, e.g. with the -f option, but am not successful. Many thanks for your insight.
This is what the query file looks like:
so the file contains one column of filenames separated by '\n' and here is how it looks like:
103128_seqs.fna
7010_seqs.fna
7049_seqs.fna
7059_seqs.fna
7077A_seqs.fna
7079_seqs.fna
grep -f FILE gets the patterns to match from FILE ... one per line*:
cat files_to_find.txt
n100079_seqs.fna
103128_seqs.fna
7010_seqs.fna
7049_seqs.fna
7059_seqs.fna
7077A_seqs.fna
7079_seqs.fna
Remove any whitespace (or do it manually):
perl -i -nle 'tr/ //d; print if length' files_to_find.txt
Create some files to test:
touch `cat files_to_find.txt`
Use it:
find ~/* -type f | grep -f files_to_find.txt
output:
/home/user/tmp/7010_seqs.fna
/home/user/tmp/103128_seqs.fna
/home/user/tmp/7049_seqs.fna
/home/user/tmp/7059_seqs.fna
/home/user/tmp/7077A_seqs.fna
/home/user/tmp/7079_seqs.fna
/home/user/tmp/n100079_seqs.fna
Is this what you want?

Resources