I have some pretty huge global scss files (3000+ lines) which are impossible to scan manually. Therefore I have to find their entries via searching. However, because there is a lot of usage of nesting and &- selectors, searching can be very difficult. I've been toying with the idea of de-nesting the entire global file to aid searching.
I don't want to do it manually. How can I automate this process?
Related
I'm on Windows writing a C++ executable that deletes and replaces some files in a directory it creates during an earlier run session. Maybe I'm a little panicky, but since my directory and file arguments for the deletions are generated by parsing an input file's path, I worry about the parse throwing out a much higher or different directory due to an oversight and systematically deleting unrelated files unintentionally.
Is there a way to limit my executable's reign to only include write/delete access to files it has created during earlier run sessions, while retaining read access to everything else? Or at least provide a little extra peace of mind that, even if I really mis-speak my strings to DeleteFileA() and RemoveDirectoryA() I'll avoid causing catastrophic damage?
It doesn't need to be a restriction to the entire executable, it's good enough if it limits the function calls to delete and remove in some way.
I've never seen this line ending before and I am trying to load the file into a database.
The lines all have a fixed width. After the CSV text which contains the data (the length varies line-by-line), there is a CR followed by multiple spaces and ending with LF. The spaces provide the padding to equalize the line width.
Line1,Data 1,Data 2,Data 3,4,50D20202020200A
Line2,Data 11,Data 21,Data 31,41,510D2020200A
Line3,Data12,Data22,Data 32,42,520D202020200A
I am about to handle this with a stream reader / writer in C#, but there are 40 files that come in each month and if there is a way to convert them all at once instead of one line at a time, I would rather do that.
Any thoughts?
Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.
When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)
But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.
If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.
For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.
We have a PDF document processing system, implemented in AppleScript (where we call the scripts from the shell using osascript). In some of the scripts, we call Acrobat Preflight Droplets from the Applescript.
This does usually work without problems. However, in some cases, where the processed document is big or/and complex. the droplet returns control to the script before the report is written and the document is moved to the "success" or "failure" folder. The consequence is that the process continues, but without the moved file, it eventually fails.
The workaround so far has been to add a delay after those droplet calls. This does help, but it is a waste of time for small documents, and there will always be a document big and complex enough to take longer than the delay.
We also found out that the time needed for finishing writing the report and moving the document depends on the speed of the system (had to be expected…).
The workaround would be to calculate the delay from the document size, its number of pages, and a machine-dependent parameter. Document size, and number of pages are no big deal; they can be retrieved in the Applescript.
The problem is the machine-dependent parameter, which can be determined experimentally. But how do I make that parameter available to all the scripts needing it?
Incorporating it into the scripts is not an option, because we have a number of systems installed, and if we would do that, we'd end up in a maintenance nightmare. Passing it as an argument in the initial system call is also not possible, because the calls are many, and again would lead to a maintenance nightmare.
So, is there a way to set up a place where that machine parameter can be stored and easily called from any Applescript, no matter how it itself is called.
Thanks a lot for your advice.
You might find the Property List Suite in System Events useful. It’s a standard means of storing and then retrieving such information. Property List files themselves are simply XML files, so you can even create them outside of AppleScript and then read them within your scripts.
There’s a description with examples at https://apple.stackexchange.com/questions/58007/how-do-i-pass-variables-values-between-subsequent-applescript-runs-persistent
A simple suggestion if you only have one paramater to keep track of would be to just have a text file in a known location on each machine. The only content of the text file would be the machine paramater. I like to use the Application Support folder this kind of thing.
Assuming your machine parameter is CPU speed. You can save a text file in /Library/Application Support/Preflight Scripts/machinecpu.txt with the contents:
2.4
Then in Applescript, you would just read the text file.:
set machineParam to read file "Macintosh HD:Library:Application Support:Preflight Scripts:machinecpu.txt"
In order not to block the reactor I would like to read files asynchronously, but I've found no obvious way of doing it using EventMachine. I've tried a few different approaches, but none of them feels right:
Just read the file, it'll block the reactor, but what the hell, it's not that slow (unless it's a big file, and then it definitely is).
Open the file for reading and read a chunk on each tick (but how much to read? too much and it'll block the reactor, too little and reading will get slower than necessary).
EM.popen('cat some/file', FileReader) feels really weird, but works better than the alternatives above. In combination with the LineAndTextProtocol it reads lines pretty swiftly.
EM.attach, but I haven't found any examples of how to use it, and the only thing I've found on the mailing list is that it's deprecated in favour of…
EM.watch, which I've found no examples of how to use for reading files.
How do you read files within a EventMachine reactor loop?
EM.attach/watch cannot be used on files, as select/epoll on a disk-based file descriptor will always return readable.
Ultimately, it depends on what you're trying to do. If it's a small file, just File.read it. If it is larger, you can read small chunks over time. For example, EM::FileStreamer does this to send large file over the network.
Another common use-case is to tail a file and read in new contents when it changes. This can be achieved using EM.watch_file: http://github.com/jordansissel/eventmachine-tail
I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.