Bash to multithread multiple files - bash

I have some files, about 10 (variable), each with 2000 lines of code.
I have some function that will read each line of a document and do something with it.
I am trying to execute all of the files at the same time with a multithreaded process. I am trying to use Xargs for this, but I'm not sure how it would handle multiple indetermined files (they all have a similar name though: segmentA, segmentB etc etc).
Any suggestions?
Edit:
For further clarification, the function takes a document, and reads each line and sends each line to a solr server.

Related

Monitor A File For Additions And Get Last Added Line

I'm having trouble monitoring a file for changes. I need to be able to know when a file changes, and when it does, I need the new line that was added. I intend to parse each line and find ones that match certain criteria, and act on information in those lines. I know the expected number of matching lines ahead of time, but I do not know how many lines in total will be added to the file, or where the matching lines will be.
I've tried 2 packages so far, with no avail.
fsnotify/fsnotify
As fas as I can tell, fsnotify can only tell me when a file is modified, not what the details of the modification was. Since I need to know what exactly was added to the file, this is no good for me.
(As a side-question, can this be run in a loop? The example that I tried exited after just one modification. I need to monitor for multiple modifications.)
hpcloud/tail
This package tries to mimic the Unix tail command, but it seems to have its own issues. The output that I get includes timestamps and other data - I just want the added line, nothing else. Also, it seems to think a file has been modified multiple times, even when it's just one edit. Further, the deal breaker here is that it does not output the last line if the line was not followed by a newline character.
Delegating to tail
I came across this answer, which suggests to delegate this work to the tail command itself, but I need this to work cross-platform (specifically, macOS, Linux and Windows). I don't believe that an equivalent command exists on Windows.
How do I go about tackling this?
#user2515526,
Usually changed diff is out of scope of file watchers' functionality, because, you know, you could change an image, and a watcher would need to keep a track several Mb of a diff in memory, and what if we have thousands of files?
However, as bad as it sounds, this may be exactly the way you want to implement this (sure, depends on your app, etc. - could be fine for text files), i.e. - keeping a map of diffs (1 diff per file) since last modification. Cannot say I like it, but sounds like fsnotify has no support for changes/diffs that you need.
Also, regarding your question about running in a loop, maybe you can get some hints here: https://github.com/kataras/iris/blob/8370d76910cdd8de043753ed81ae080eae8dc798/utils/file.go
Its a framework that allows to build a server that watches for TypeScript file changes. So sounds similar to your case/question.
Cheers,
-D

How to optimize the file processing?

I'm working on a Perl/CGI script which reads an 8MB file with over 100k lines and displays it in chunks of 100 lines (using pagination).
Which one of the following will be faster
Storing the entire input file into an array and extracting 100 lines for each page (using array slicing)
my #extract = #main_content[101..200];
or
For each page, using the sed command to extract any 100 lines that the user wants to view.
sed -n '101,200'p filename
If you really want performance then don't use CGI, try using something that keeps a persistent copy of the data in memory between requests. 8mb is tiny these days but loading for every request would not be sensible nor would scanning the whole file. Modperl was the older way of doing this , it was a perl interpreter embedded in the webserver , the newer way is to use catalyst or dancer, instructions for those are outside the scope of this reply. You could get away with using CGI if this was only to be use occasionally and was password protected to limit use.

In Haskell, in Windows 7, can I read a file that is already write-locked by another program?

I have a 3rd party program that is running continuously, and is logging events in a text file. I want to write a small Haskell program that reads the text file while the other program is running and warns me when certain events are logged.
I looked around and it seems as if, for Windows, readFile is single write OR multiple read - it does not allow single write and multiple read. As I understand it, this is to avoid side effects like the write changing the file after/during reads.
Is there some way for me to work around this constraint on locks? The log file is only appended, and I am only looking for specific rows in the file, so I really don't mind if I don't get the most recent write, as I am interested in eventual consistency and will keep checking the file.

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

How do you fix wget from comingling download data when running multiple concurrent instances?

I am running a script that in turn calls another script multiple times in the background with different sets of parameters.
The secondary script first does a wget on an ftp url to get a listing of files at that url. It outputs that to a unique filename.
Simplified example:
Each of these is being called by a separate instance of the secondary script running in the background.
wget --no-verbose 'ftp://foo.com/' -O '/downloads/foo/foo_listing.html' >foo.log
wget --no-verbose 'ftp://bar.com/' -O '/downloads/bar/bar_listing.html' >bar.log
When I run the secondary script once at a time, everything behaves as expected. I get an html file with a list of files, links to them, and information about the files the same way I would when viewing an ftp url through a browser.
Continued simplified one at a time (and expected) example results:
foo_listing.html:
...
foo1.xml ...
foo2.xml ...
...
bar_listing.html:
...
bar3.xml ...
bar4.xml ...
...
When I run the secondary script many times in the background, some of the resulting files, although they have the base urls correct (the one that was passed in) the files listed are from a different run of wget.
Continued simplified multiprocessing (and actual) example results:
foo_listing.html:
...
bar3.xml ...
bar4.xml ...
...
bar_listing.html
correct, as above
Oddly enough, all other files I download seem to work just fine. It's only these listing files that get jumbled up.
The current workaround is to put in a 5 second delay between backgrounded processes. With only that one change everything works perfectly.
Does anybody know how to fix this?
Please don't recommend using some other method of getting the listing files or not running concurrently. I'd like to actually know how to fix this when using wget in many backgrounded processes if possible.
EDIT:
Note:
I am not referring to the status output that wget spews to the screen. I don't care at all about that (that is actually also being stored in separate log files and is working correctly). I'm referring to the data wget is downloading from the web.
Also, I cannot show the exact code that I am using as it is proprietary for my company. There is nothing "wrong" with my code as it works perfectly when putting in a 5 second delay between backgrounded instances.
Log bug with Gnu, use something else for now whenever possible, put in time delays between concurrent runs. Possibly create a wrapper for getting ftp directory listings that only allows one at a time to be retrieved.
:-/

Resources