Okay, so I have a shell script that uses an sqlite database for storing data. However, it's a long running script that runs a lot of small queries over time.
Currently I'm calling sqlite3 'my.db' 'SELECT * FROM foo' or similar for every query, however I don't expect that this is terribly efficient, as sqlite will have to re-open the database each time (unless it sneakily keeps them open for a while?).
However, this also means that if I need to do any setup, such as specifying the timeout option, then I will need to do this every time I call sqlite3 as well.
What I'm wondering is if there is some alternative way I can structure my script, for example opening the sqlite3 process in the background in interactive mode and somehow feeding it with queries and grabbing responses. Is there a good, robust way to do this?
I've tried something along these lines, but I don't always know how many results a query will return (if any), so I end up having to do queries like SELECT COUNT(*) FROM foo WHERE bar=1; SELECT * FROM foo WHERE bar=1, so that I always get back at least one line with the number of results to expect, but this is no good if I (somehow) get more results than expected, as they will still be waiting when I run the next query.
Here's a shortened form of what I've tried (please forgive any typos, I'm sure to make some):
#!/bin/sh
mkfifo '/tmp/db_in'
mkfifo '/tmp/db_out'
# Create the sqlite process in the background (assume database setup already)
sqlite3 'my.db' </tmp/db_in >/tmp/db_out &
# Set connection parameters
echo '.timeout 1000' >/tmp/db_in
# Perform a query with count in first result
echo 'SELECT COUNT(*) FROM foo WHERE bar=1;SELECT * FROM bar WHERE bar=1;' >/tmp/db_in
read results </tmp/db_out
while [ $results -gt 0 ]; do
IFS='|'; read key value </tmp/db_out
echo "$key=$value"
results=$(($results - 1))
done
Is this along the right lines, or is there a better way to do this? I haven't been able to think of a way to get sqlite to return something at the end of output for each query, which would be a better way to do things I think. I know the echo option can be used to echo a query before its output, which I can at least use to make sure I don't read anything that's leftover from a previous query, but I really want to be able to handle arbitrary length results more correctly (without needing a count first, since this could go out of sync).
The only other alternative seems to be dumping results into a file, but as I may be handling a large number of queries I don't really want to be doing that; at least with FIFOs everything's still in memory.
[edit]
I've just noticed that some of the shells I try the above with close sqlite3 after writing to db_in, in those cases I think the fifos may need to be bound to a file-descriptor to keep them open. I'm not going to add this to the example just now as it makes it more complex, but if file-descriptors are used by a solution then it should probably be added for completeness.
Related
I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;
Right now, I am creating files to make unterminating variables. But I'm curious if there's a simpler way to create variables that don't terminate.
I find Redis invaluable for persisting data like this. It is a quick and lightweight installation and allows you to store many types of data:
strings, including complete JSONs and binary data like JPEG/PNG/TIFF images - also with TTL (Time-to-Live) so data can be expired when no longer needed
numbers, including atomic integers, floats
lists/queues/stacks
hashes (like Python dictionaries)
sets, and sorted (ordered) sets
streams, bitfields, geospatial data and esoteric hyperlogs
PUB/SUB is also possible, where one or more machines/processes publish items and multiple consumers, who have subscribed to that topic, receive the published items.
It can also perform very fast operations on your data for you, like set intersections and unions, getting lengths of lists, moving items between lists, atomically adding/subtracting from numbers and so on.
You can also use it to pass data between processes, sub-processes, shell scripts, parent and child, child and parent (!) scripts and so on.
In addition to all that, it is networked, so you can set variables on one computer and read/alter them from another - very simply. For example, you can PUSH jobs to a queue, potentially from multiple machines, and run workers on multiple machines that wait for jobs on the queue, process them and return results to another list.
There is a discussion of the things you can store here.
Example: Store a string, then retrieve it:
redis-cli SET name fred
name=$(redis-cli GET name)
Example: Increment views of page 2 by 10, and then retrieve from different machine on network:
redis-cli INCRBY views:page:2 10
views=$(redis-cli -h 192.168.0.10 GET views:page:2)
Example: Push a value onto a list:
redis-cli LPUSH shoppingList bananas
Example: Blocking wait for next item in list - use RPOP for non-blocking:
item=$(redis-cli BRPOP shoppingList)
Also, there are bindings for Python, C/C++, Java, Ruby, PHP etc. So you can "inject" dummy/test data into, or extract debug data from a running Python program using the redis-cli tool even on a different computer.
Use environment variables to store your data.
ABC="abc"; export ABC
And the other question is, how to make environment variables persistents after reboot.
Depending on your shell, you may have different file to persist the veriables.
if using bash, run this command containing the variable's last value before reboot.
echo 'export ABD="hello"' >> $HOME/.bashrc
I think this is a good time to be using an SQL Database. It's more scalable and functional than having a fileful of "persistent variables".
It may require a little more setup, and I admit it isn't "simpler" per say, but it will probably be worth it in the long run. You will be able to do things with your variables and that may make your future scripts simpler.
I recommend going to YouTube and find a simple instruction on how to set up a local MySQL or MSSQL. There is a guy, Mike Dane, who makes really beginner-friendly instructions. Try searching "GiraffeAcademy SQL Beginner" and see if that helps you.
dba_scheduler_job_run_details is capable to keep an dbms_output.put_line lines in dba_scheduler_job_run_details.output column. And it is so, when the job flows and exits normally. However, when I explicitly call dbms_scheduler.stop_job - all the output is missed.
How to avoid this behaviour?
I don't think you can avoid this, in the same way that if you killed off a SQL*PLus session that was running a procedure, you would not see the output either, because DBMS_OUTPUT simply puts data into an array and the calling environment is then responsible for retrieving that information and putting it somewhere (eg outputting to the terminal). You kill/stop the calling environment..and there's nothing left to go and get that output.
A better option would be to not utilise DBMS_OUTPUT solely as a debugging/instrumentation mechanism. Much better options exist in terms of capturing debugging information including viewing it from alternate sessions in real time etc.
Check out Logger at https://github.com/OraOpenSource/Logger. It is in common use by many in the Oracle community.
I have made a little function that deletes files based on date. Prior to doing the deletions, it lets the user choose how many days/months back to delete files, telling them how many files and how much memory it would clean up.
It worked great in my test environment, but when I attempted to test it on a larger directory (approximately 100K files), it hangs.
I’ve stripped everything else from my code to ensure that it is the get_dir_info() function that is causing the issue.
$this->load->helper('file');
$folder = "iPad/images/";
set_time_limit (0);
echo "working<br />";
$dirListArray = get_dir_file_info($folder);
echo "still working";
When I run this, the page loads for approximately 60 seconds, then displays only the first message “working” and not the following message “still working”.
It doesn’t seem to be a system/php memory problem as it is coming back after 60 seconds and the server respects my set_time_limit() as I’ve had to use that for other processes.
Is there some other memory/time limit I might be hitting that I need to adjust?
from the CI user guide the get_dir_file_info() is:
Reads the specified directory and builds an array containing the filenames, filesize, dates, and permissions. Sub-folders contained within the specified path are only read if forced by sending the second parameter, $top_level_only to FALSE, as this can be an intensive operation.
so if you are saying that you have 100k files then the best way to do it, is to cut it into two steps:
First: use get_filenames('path/to/directory/') to retrieve all your files without their information.
Second: use get_file_info('path/to/file', $file_information) to retrieve a specific file info, as you might not need all the file information immediately. it can be done on file name click or something relevant.
the idea here is not to force your server to deal with large amount of process while in production. that would kill two things, responsiveness, and performance (I haven't found a better definition for performance) but the idea here is clear.
Every once in a while I have to fire up a GUI program from my terminal session to do something. It usually is Chrome to display some HTML file are some task alike.
These programs however throw warnings all over the place and it can actually become ridiculous to write anything so I always wanted to redirect stderr/stdout to /dev/null.
While ($PROGRAM &) &>/dev/null seems okay I decided to create a simple Bash function for it so I don't have to repeat myself everytime.
So for now my solution is something like this:
#
# silly little function
#
gui ()
{
if [ $# -gt 0 ] ; then
($# &) &>/dev/null
else
echo "missing argument"
fi
}
#
# silly little example
#
alias google-chrome='gui google-chrome'
So what I'm wondering about is:
Is there a way without an endless list of aliases that's still snappy?
Are there different strategies to accomplish this?
Do other shells offer different solutions?
In asking these questions I want to point out that your strategies and solutions might deviate substantially form mine. Redirecting output to /dev/null and aliasing it was the only way I know but there might be entirely different ways that are more efficient.
Hence this question :-)
As others have pointed in the comments, I think the real problem is a way to distinguish between gui vs. non-gui apps on the commandline. As such, the cleanest way I could think of is to put this part of your script:
#!/bin/bash
if [ $# -gt 0 ] ; then
($# &) &>/dev/null
else
echo "missing argument"
fi
into a file called gui, then chmod +x it and put it in your ~/bin/ (make sure ~/bin is in your $PATH). Now you can launch gui apps with:
`gui google-chrome`
on the prompt.
Alternatively, you can do the above, then make use of bind:
bind 'RETURN: "\e[1~gui \e[4~\n"'
This will allow you to just do:
google-chrome
on the prompt and it would automatically append gui before google-chrome
Or, you can bind the above action to F12 instead of RETURN with
bind '"\e[24~": "\e[1~gui \e[4~\n"'
To separate what you want launched with gui vs. non-gui.
More discussion on binding here and here.
These alternatives offer you a way out of endless aliases; a mix of:
Putting gui in your ~/bin/, and
Binding use of gui to F12 as shown above
seems the most ideal (albeit hacky) solution.
Update - #Enno Weichert's Resultant Solution:
Rounding this solution out ...
This would take care of aliases (in a somewhat whacky way though) and different escape encodings (in a more pragmatic rather than exhaustive way).
Put this in $(HOME)/bin/quiet
#!/bin/bash -i
if [ $# -gt 0 ] ; then
# Expand if $1 is an alias
if [ $(alias -p | awk -F "[ =]" '{print $2}' | grep -x $1) > 0 ] ; then
set -- $(alias $1 | awk -F "['']" '{print $2}') "${#:2}"
fi
($# &) &>/dev/null
else
echo "missing argument"
fi
And this in $(HOME)/.inputrc
#
# Bind prepend `quiet ` to [ALT][RETURN]
#
# The condition is of limited use actually but serves to seperate
# TTY instances from Gnome Terminal instances for me.
# There might very well be other VT emulators that ID as `xterm`
# but use totally different escape codes!
#
$if $term=xterm
"\e\C-j": "\eOHquiet \eOF\n"
$else
"\e\C-m": "\e[1~quiet \e[4~\n"
$endif
Although it is an ugly hack, I sometimes use nohup for a similar needs. It has the side effect of redirecting the command's output, and it makes the program independent of the terminal session.
For the case of running GUI programs in a desktop envinroment it has only little risk for resource leaks as the program will anyway end with window manager session. Nevertheless it should be taken into account.
There is also the option of opening another terminal session. In Gnome for instance, you can use gnome-terminal --comand 'yourapp'.
But this will result in opening many useless terminal windows.
First, it's a great idea to launch GUI applications from the terminal, as this reduces mouse usage. It's faster, and more convenient in terms of options and arguments. For example, take the browser. Say you have the URL in the clipboard, ready to paste. Just type, say, ice (for Iceweasel) and hit Shift-Insert (to paste) and Enter. Compare this to clicking an icon (possibly in a menu), wait for the window to load (even worse if there is a start up page), then click the URL bar (or hit Ctrl-L), then Ctrl-V... So I understand you desire for this to work.
But, I don't see how this would require an "infinite" list of aliases and functions. Are you really using that many GUI applications? And even so, aliases are one line - functions, which may be more practical for handling arguments, are perhaps 1-5 lines of (sparse) code. And, don't feel you need to set them up once and for all - set them up one by one as you go along, when the need arises. Before long, you'll have them all.
Also, if you have a tabbed terminal, like urxvt (there is a Perl extension), you'd benefit from moving from "GUI:s" to "CLI:s": for downloading, there's rtorrent; for IRC, irssi; instead of XEmacs (or emacs), emacs -nw; there is a CLI interface to vlc for streaming music; for mail, Emacs' rmail; etc. etc.! Go hunt :)
(The only sad exception I've run across that I think is a lost cause is the browser. Lynx, W3M, etc., may be great from a technical perspective, but that won't always even matter, as modern web pages are simply not designed with those, text-only browsers in mind. In all honesty, a lot of those pages look a lot less clear in those browsers.)
Hint: To get the most out of a tabbed terminal, you'd like the "change tab" shortcuts "close" (e.g., Alt-J for previous tab and Alt-K for next tab, not the arrow keys that'll make you reach).
Last, one solution that'll circumvent this problem, and that is to launch the "GUI:s" as background processes (with &) in your ~/.xinitrc (including your terminal emulator). Not very flexible, but great for the stuff you always use, every time you use your computer.
Ok, so I kept thinking that the one function should be enough. No aliases, no bashisms, no nonsense. But it seemed to me that the only the way to do that without possibly affecting regular use, such as expansions and completions, was to put the function at the end of the command. This is not as easy as one might first assume.
First I considered a function tacked onto the end in which I would call, say, printf "%b" "\u" in order to cut the current line, plug in another printf, paste it back in, and a quote or two at the beginning, then do what little is needed at the end. I don't know how to make this work though, I'm sorry to say. And even if I did, I couldn't hope for any real reliability/portability with this method due to the varying ways shells interpret escape sequences, not to mention the terminal emulators they run in. Perhaps stty could offer a way forward along these lines, but if so, you won't find it here... now, anyway.
I eventually instead resorted to actually copying the current command's /proc/{PID}/cmdline to a variable, then (shamefully) killing it entirely, and finally wrapping it as I pleased. On the plus side, this is very easily done, very quickly done (though I can imagine arguing its 'efficiency' either way), and seems mostly to work, regardless of the original input, whether that be an alias, a variable, a function, etc. I believe it is also POSIX portable (though I can't remember if I need to specify the kill SIGNALS by name for POSIX or not), and is definitely no nonsense.
On the other hand, its elegance certainly leaves much to be desired, and, though it's probably not worth worrying about, it does waste entirely a single PID. And it doesn't stop completely the shell spam; that is, in my shell I have enabled background jobs reporting with set and so, when first run, the shell kindly informs me that I've just opened and wasted a PID in two lines. Also, because I copy the ../cmdline instead of interfacing directly with the 0 file descriptor, I expect pipes and ; and etc to be problematic. This I can, and likely will, fix myself very soon.
I will fix that, that is, if I cannot find a way to instead make use of, as I suspect can be done,SIGTSTP + SIGCONT by first suspending the process outside a subshell then within one continuing it after redirecting the subshell's output. It seems that this works unreliably for reasons I haven't yet discovered, but I think it's promising. Perhaps nohup and a trap (to effectively rehup it, as it were) is what is needed, but I'm not really sure how to put those together either...
Amyway, without further ado, my semi-aborted, backwards gui launcher:
% _G() { (
_gui_cmd="$(tr '\0' ' ' </proc/"$\!"/cmdline )" ;
kill -9 "$\!" ;
exec eval "${_gui_cmd} &" )
&>/dev/null 2&>1
}
% google-chrome-beta --disk-cache-dir="/tmp/cache" --disk-cache-size=100000000 &_G
[1] 2674
[1] + 2674 killed google-chrome-beta --disk-cache-dir="/tmp/cache" --disk-cache-
%
So another problem one might encounter if attempting to do similar is the order of expansion the shell assumes. Without some trick such as mine you're sure to have some serious difficulty expanding your function before the prior command snags it as an argument. I believe I have guarded against this without unnecessary redundancy by simply tacking my _G function call onto the original command's & background intstruction. Simply add &_Gto the tail-end of the command you wish to run and, well, good luck.
-Mike
P.S. Ok, so writing that last sentence makes me think of what might be done with tee.