Bash script to add new directories into a PostgreSQL table - bash

I'm trying to write a script which lists a directory and creates an SQL script to insert these directories, problem is I only want to insert new directories, here is what I have so far:
#If file doesn't exist add the search path test
if [ ! -e /home/aydin/movies.sql ]
then
echo "SET SEARCH_PATH TO noti_test;" >> /home/aydin/movies.sql;
fi
cd /media/htpc/
for i in *
do
#for each directory escape any single quotes
movie=$(echo $i | sed "s:':\\\':g" )
#build sql insert string
insertString="INSERT INTO movies (movie) VALUES (E'$movie');";
#if sql string exists in file already
if grep -Fxq "$insertString" /home/aydin/movies.sql
then
#comment out string
sed -i "s/$insertString/--$insertString/g" /home/aydin/movies.sql
else
#add sql string
echo $insertString >> /home/aydin/movies.sql;
fi
done;
#execute script
psql -U "aydin.hassan" -d "aydin_1.0" -f /home/aydin/movies.sql;
It seems to work apart from one thing, the script doesn't recognise entries with single quotes in them, so upon running the script again with no new dirs, this is what the file looks like:
--INSERT INTO movies (movie) VALUES (E'007, Moonraker (1979)');
--INSERT INTO movies (movie) VALUES (E'007, Octopussy (1983)');
INSERT INTO movies (movie) VALUES (E'007, On Her Majesty\'s Secret Service (1969)');
I'm open to suggestions on a better way to do this also, my process seems pretty elongated and inefficient :)

Script looks generally good to me. Consider the revised version (untested):
#! /bin/bash
#If file doesn't exist add the search path test
if [ ! -e /home/aydin/movies.sql ]
then
echo 'SET search_path=noti_test;' > /home/aydin/movies.sql;
fi
cd /media/htpc/
for i in *
do
#build sql insert string - single quotes work fine inside dollar-quoting
insertString="INSERT INTO movies (movie) SELECT \$x\$$movie\$x\$
WHERE NOT EXISTS (SELECT 1 FROM movies WHERE movie = \$x\$$movie\$x\$);"
#no need for grep. SQL is self-contained.
echo $insertString >> /home/aydin/movies.sql
done
#execute script
psql -U "aydin.hassan" -d "aydin_1.0" -f /home/aydin/movies.sql;
To start a new file, use > instead of >>
Use single quotes ' for string constants without variables to expand
Use PostgreSQL dollar-quoting so you don't have to worry about single-quotes in the strings. You'll have to escape the $ character in the shell to remove its special meaning in the shell.
Use an "impossible" string for the dollar-quote, so it cannot appear in the string. If you don't have one, you can test for the quote-string and alter it in the unlikely case it should be matched, to be absolutely sure.
Use SELECT .. WHERE NOT EXISTS for the INSERT to automatically prevent already existing entries to be re-inserted. This prevents duplicate entries in the table completely - not just among the new entries.
An index on movies.movie (possibly, but not necessarily UNIQUE) would speed up the INSERTs.

Why bother with grep and sed and not just let the database detect duplicates?
Add a unique index on movie and create a new (temporary) insert script on each run and then execute it with autocommit (default) or with the -v ON_ERROR_ROLLBACK=1 option of psql. To get a full insert script of your movie database dump it with the --column-inserts option of pg_dump.
Hope this helps.

There's utility daemon called incron, which will fire your script whenever some file is written in watched directory. It uses kernel events, no loops - Linux only.
In its config (full file path):
/media/htpc IN_CLOSE_WRITE /home/aydin/added.sh $#/$#
Then simplest adder.sh script without any param check:
#!/bin/bash
cat <<-EOsql | psql -U "aydin.hassan" -d "aydin_1.0"
INSERT INTO movies (movie) VALUES (E'$1');
EOsql
You can have thousands of files in one directory and no issue as you can face with your original script.

Related

Process multiple files one by one dynamically in workflow using indirect file method

My workflow uses 3 indirect files.
The indirect files can have one or more file names.
Let's say all 3 indirect files have 2 file names each.
Indirect_file1 has (file1,file2)
Indirect_file2 has (filea,fileb)
Indirect_file3 has (filex,filey)
My workflow should run in sequence.
First sequence (file1,filea,filex)
Second sequence (file2,fileb,filey)
we are on Linux environment, so i guess it can be done using shell script
Any pointers will be appreciated.
Thanks in Advance.
This should work -
in informatica session, modify input type to 'Command'
in informatica session, change command type to 'Command generating file List'
for first worfklow set the command like this 'cut -d ',' file -f1' if your delimiter is comma.
for second worfklow set the command like this 'cut -d ',' file -f2' if your delimiter is comma.
You might want to make small work packages first before processing. When the workflow takes a long time it is easier to (re-)start new processes.
You can start with something like this:
# Step 1, move the current set to temporary folder
combine_dir=/tmp/combine
mkdir "${combine_dir}"
mv Indirect_file1 "${combine_dir}"
mv Indirect_file2 "${combine_dir}"
mv Indirect_file3 "${combine_dir}"
# Step 2, construct work packages in other tmp dir
workload_dir=/tmp/workload
mkdir "${workload_dir}"
for file in Indirect_file1 Indirect_file2 Indirect_file3; do
loadnr=1
for work in $(grep -Eo '[^(,)]*' "${file}"); do
echo "${work}" >> ${workload_dir}/sequence${loadnr}
((loadnr++))
done
done
# The sequenceXXX files have been generated with one file on each line.
# When you must have it like (file1,filea,filex), change above loop.
# Now files are ready to be processed. Move them to some dir where files will be handled.
# Please cleanup temporary files

SED is giving me issues

I am working on a really basic script:
1) Grabs account keys from a text file (keyList.txt) --> key format looks like this: 1002000222,1002000400
2) For each key I am looping through and inserting them (using SED) into SQL queries held in another text file.
3) Query example:
UPSERT INTO ACCT_HIST (ACCT_KEY) SELECT ACCT_KEY FROM ACCT_HIST WHERE ACCT_KEY IN (101000033333) AND REC_ACTV_IND = 'Y' AND DT_KEY < 20191009;
My Bash snippet is below but to summarize the issue, SED is only replacing the values in the parenthesis one key at a time, rather than placing them both in the same parenthesis space. The below is now working perfectly.
#!/bin/bash
now=$(date +"%Y%m%d-%H:%M")
cp acct_transfer_soft_del_list.csv keyList_$now.txt
for key in $(<keyList_$now.txt)
do
sed "s/([^)]*)/(${key})/3" hbase.txt >> queries_$now.txt
done
hbase.txt holds the queries but I don't want to permanently change them, so I send the output to queries_$now.txt.
Please, note that you have IFS=,.
This is (probably) breaking your key with a unwanted behaviour.
I admit that I am not sure I understood entirely what you need, but I think you can use the first cycle in order to get everything you need.
Reusing your code, you can do something like this:
#!/bin/bash
now=$(date +"%Y%m%d-%H:%M")
IFS=","
while read f1 f2
do
echo "$f1,$f2"
sed "s/([^\)]*)/($f1,$f2)/3 " hbase.txt >> queries_$now.txt
done < acct_transfer_soft_del_list.csv > keyList_$now.txt
Anyway, I can't get straight your while cycle: it seems to do a simple copy of your file.
You could avoid it with cp acct_transfer_soft_del_list.csv keyList_$now.txt

Running a number of hive queries and writing output to file

I'm trying to make use of the DESCRIBE function via Hive to output the column descriptions of each of the tables out to individual files. I've discovered the -f option so I can just read from a file and write the output back out:
hive -f nameOfSqlQueryFile.sql > out.txt
However, if I open the output file, it throws all the descriptions back to back and it's unclear where one description starts for a table and where it ends.
So, I've tried making a batch file that uses -e to describe each of the tables individually and output to a file:
#!/bin/bash
nameArr=( $(hive -e 'show tables;') )
count=0
for i in "${nameArr[#]}"
do
echo 'Working on table('$count'): '$i
hive -e 'describe '$i > $i'_.txt';
count=$(($count+1))
done
However, because this needs to reconnect for each query, it's remarkably slow, taking hours to process several hundred queries.
Does anyone have an idea of how else I might run each of these DESCRIBE functions, and ideally output to separate files?
You can probably use one of these, depending on how you process the output:
Just use the OK line as a separator and search for it using a script.
Use DESCRIBE EXTENDED which adds a line at the end with info on the table, including its location, which can be used to extract the table name (using sed, for example)
If you're just using the output file as a manual reference, insert a SQL statement that prints a separator of your choice between each table, e.g.:
DESCRIBE table;
SELECT '-----------------' FROM table;

Shell script to make many directories

I'm trying to create a file hierarchy to store data. I want to create a folder for each data acquisition session. That folder has five subfolders, which are named below. My code attempt below gives an error, but I'm not sure how to correct it.
Code
#!/bin/sh
TRACES = "/Traces"
LFPS = '/LFPS'
ANALYSIS = '/Analysis'
NOTES = '/Notes'
SPIKES = '/Spikes'
folders=($TRACES $LFPS $ANALYSIS $NOTES $SPIKES)
for folder in "${folders[#]}"
do
mkdir $folder
done
Error
I get an error when declaring the variables. As written above, bash flips the error Command not found. If, instead, I declare the file names as TRACES = $('\Traces'), bash flips the error No such file or directory.
Remove the spaces between the variable names and the values:
#!/bin/sh
TRACES="/Traces"
LFPS='/LFPS'
ANALYSIS='/Analysis'
NOTES='/Notes'
SPIKES='/Spikes'
folders=($TRACES $LFPS $ANALYSIS $NOTES $SPIKES)
for folder in "${folders[#]}"
do
mkdir $folder
done
With spaces, bash interprets this like
COMMAND param1 param2
with = as param1
I'm taking the 'no spaces around the variable assignments' part of the fix as given.
Using array notation seems like overkill. Allowing for possible spaces in names, you can use:
for dir in "$TRACE" "$LFPS" "$NOTES" "$PASS"
do mkdir "$dir"
done
But even that is wasteful:
mkdir "$TRACE" "$LFPS" "$NOTES" "$PASS"
If you're worried that the directories might exist, you can avoid error messages for that with:
mkdir -p "$TRACE" "$LFPS" "$NOTES" "$PASS"
The -p option is also valuable if the paths are longer and some of the intermediate directories might be missing. If you're sure there won't be spaces in the names, the double quotes become optional (but they're safe and cheap, so you might as well use them).
Also you would want to do some checking beforehand if folders exist or not.
Also you can always debug the shell script with set -x, you could just use "mkdir -p" which would do the trick.
I made the following changes to get your script to run.
As a review comment it is unusual to create such folders hanging off the root file system.
#!/bin/sh
TRACES="/Traces"
LFPS='/LFPS'
ANALYSIS='/Analysis'
NOTES='/Notes'
SPIKES='/Spikes'
folders="$TRACES $LFPS $ANALYSIS $NOTES $SPIKES"
for folder in $folders
do
mkdir $folder
done
Spaces were removed from the initial variable assignments and I also simplified the for loop so that it iterated over the words in the folders string.

bash script to update postgres database

I have some html data stored in text files right now. I recently decided to store the HTML data in the pgsql database instead of flat files. Right now, the 'entries' table contains a 'path' column that points to the file. I have added a 'content' column that should now store the data in the file pointed to by 'path'. Once that is complete, the 'path' column will be deleted. The problem that I am having is that the files contain apostrophes that throw my script out of whack. What can I do to correct this issue??
Here is the script
#!/bin/sh
dbname="myDB"
username="username"
fileroot="/path/to/the/files/*"
for f in $fileroot
do
psql $dbname $username -c "
UPDATE entries
SET content='`cat $f`'
WHERE id=SELECT id FROM entries WHERE path LIKE '*`$f`';"
done
Note: The logic in the id=SELECT...FROM...WHERE path LIKE "" is not the issue. I have tested this with sample filenames in the pgsql environment.
The problem is that when I cat $f, any apostrophe in Edit: the contents of $f closes the SQL string, and I get a syntax error.
For the single quote escaping issue, a reasonable workaround might be to double the quotes, so you'd use:
`sed "s/'/''/g" < "$f"`
to include the file contents instead of the cat, and for the second invocation in the LIKE where you appeared to intend to use the file name use:
${f/"'"/"''"/}
to include the literal string content of $f instead of executing it, and double the quotes. The ${varname/match/replace} expression is bash syntax and may not work in all shells; use:
`echo "$f" | sed "s/'/''/g"`
if you need to worry about other shells.
There are a bunch of other problems in that SQL.
You're trying to execute $f in your second invocation. I'm pretty sure you didn't intend that; I imagine you meant to include the literal string.
Your subquery is also wrong, it lacks parentheses; (SELECT ...) not just SELECT.
Your LIKE expression is also probably not doing what you intended; you probably meant % instead of *, since % is the SQL wildcard.
If I also change backticks to $() (because it's clearer and easier to read IMO), fix the subquery syntax and add an alias to disambiguate the columns, and use a here-document instead passed to psql's stdin, the result is:
psql $dbname $username <<__END__
UPDATE entries
SET content=$(sed "s/'/''/g" < "$f")
WHERE id=(SELECT e.id FROM entries e WHERE e.path LIKE '$(echo "$f" | sed "s/'/''/g")');
__END__
The above assumes you're using a reasonably modern PostgreSQL with standard_conforming_strings = on. If you aren't, change the regexp to escape apostrophes with \ instead of doubling them, and prefix the string with E, so O'Brien becomes E'O\'Brien'. In modern PostgreSQL it'd instead become 'O''Brien'.
In general, I'd recommend using a real scripting language like Perl with DBD::Pg or Python with psycopg to solve scripting problems with databases. Working with the shell is a bit funky. This expression would be much easier to write with a database interface that supported parameterised statements.
For example, I'd write this as follows:
import os
import sys
import psycopg2
try:
connstr = sys.argv[1]
filename = sys.argv[2]
except IndexError as ex:
print("Usage: %s connect_string filename" % sys.argv[0])
print("Eg: %s \"dbname=test user=fred\" \"some_file\"" % sys.argv[0])
sys.exit(1)
def load_file(connstr,filename):
conn = psycopg2.connect(connstr)
curs = conn.cursor()
curs.execute("""
UPDATE entries
SET content = %s
WHERE id = (SELECT e.id FROM entries e WHERE e.path LIKE '%%'||%s);
""", (filename, open(filename,"rb").read()))
curs.close()
if __name__ == '__main__':
load_file(connstr,filename)
Note the SQL wildcard % is doubled to escape it, so it results in a single % in the final SQL. That's because Python is using % as its format-specifier so a literal % must be doubled to escape it.
You can trivially modify the above script to accept a list of file names, connect to the database once, and loop over the list of all file names. That'll be a lot faster, especially if you do it all in one transaction. It's a real pain to do that with psql scripting; you have to use bash co-process as shown here ... and it isn't worth the hassle.
In the original post, I made it sound like there were apostrophes in the filename represented by $f. This was NOT the case, so a simple echo "$f" was able to fix my issue.
To make it more clear, the contents of my files were formatted as html snippets, typically something like <p>Blah blah <b>blah</b>...</p>. After trying the solution posted by Craig, I realized I had used single quotes in some anchor tags, and I did NOT want to change those to something else. There were only a few files where this violation occurred, so I just changed these to double quotes by hand. I also realized that instead of escaping the apostrophes, it would be better to convert them to &apos; Here is the final script that I ended up using:
dbname="myDB"
username="username"
fileroot="/path/to/files/*"
for f in $fileroot
do
psql $dbname $username << __END__
UPDATE entries
SET content='$(sed "s/'/\&apos;/g" < "$f")'
WHERE id=(SELECT e.id FROM entries e WHERE path LIKE '%$(echo "$f")');
__END__
done
The format coloring on here might make it look like the syntax is incorrect, but I have verified that it is correct as posted.

Resources