Java 7 parallel search for files recursively in a folder - parallel-processing

I would like to use the Visitor API, in Java 7, to search for some files recursively in a folder. Since I will search big folders, with 100.000+ files, sparsed through folders, I would like to do this parallel.
But, I can't, for example, spawn a thread for each folder. May Fork Join may be an idea but , from what I've understood, FJ is usually used when you know the data, for example, you have a given array and you want to process parts of 5 elements from it. So divide and conquer can be used very well in that case.
So can you please share your opinion of an idea that can allow me to search recursively for files fast ( must be parallel ) and also allow cancellation if the user desires so.
Thank you,
Ryu

I bet there will be no gain from parallel search on a single disk drive; the disk access/read time is significantly larger than any possible name comparisons you can make.
Did you actually write the code? Did you test it? Did you profile it? What have you deducted from profiling?
Remember that the first rule of optimization is: don't do it.

You cannot use Files.walkFileTree for that (I assume that is what you mean when you say "the Visitor API, in Java 7"); you have to implement the directory traversal yourself to be able to parallelize it.
Fork/join does, actually, fit this problem very well. There is even a relevant example at Fork and Join: Java Can Excel at Painless Parallel Programming Too!. There is an example program in that article that "count[s] the occurrences of a word in a set of document" by traversing files in a directory, and all its subdirectories (recursively).
The author provides some seemingly positive speedup measurements in the discussion section, but you should consider what Dariusz said about the problem probably being IO bound rather than CPU bound (i.e., just throwing lots of thread at it will not result in any speedup after some, probably low, number of threads). It's surprising, at least to me at least, that the example program from the article was faster with 12 threads than 8 threads.
Cancellation, afaics, is an orthogonal issue to this, and can be implemented in some standard way (e.g., polling a volatile flag).

Related

Parallel counting using a functional approach and immutable data structures?

I have heard and bought the argument that mutation and state is bad for concurrency. But I struggle to understand what the correct alternatives actually are?
For example, when looking at the simplest of all tasks: counting, e.g. word counting in a large corpus of documents. Accessing and parsing the document takes a while so we want to do it in parallel using k threads or actors or whatever the abstraction for parallelism is.
What would be the correct but also practical pure functional way, using immutable data structures to do this?
The general approach in analyzing data sets in a functional way is to partition the data set in some way that makes sense, for a document you might cut it up into sections based on size. i.e. four threads means the doc is sectioned into four pieces.
The thread or process then executes its algorithm on each section of the data set and generates an output. All the outputs are gathered together and then merged. For word counts, for example, a collection of word counts are sorted by the word, and then each list is stepped through using looking for the same words. If that word occurs in more than one list, the counts are summed. In the end, a new list with the sums of all the words is output.
This approach is commonly referred to as map/reduce. The step of converting a document into word counts is a "map" and the aggregation of the outputs is a "reduce".
In addition to the advantage of eliminating the overhead to prevent data conflicts, a functional approach enables the compiler to optimize to a faster approach. Not all languages and compilers do this, but because a compiler knows its variables are not going to be modified by an outside agent it can apply transforms to the code to increase its performance.
In addition, functional programming lets systems like Spark to dynamically create threads because the boundaries of change are clearly defined. That's why you can write a single function chain in Spark, and then just throw servers at it without having to change the code. Pure functional languages can do this in a general way making every application intrinsically multi-threaded.
One of the reasons functional programming is "hot" is because of this ability to enable multiprocessing transparently and safely.
Mutation and state are bad for concurrency only if mutable state is shared between multiple threads for communication, because it's very hard to argue about impure functions and methods that silently trash some shared memory in parallel.
One possible alternative is using message passing for communication between threads/actors (as is done in Akka), and building ("reasonably pure") functional data analysis frameworks like Apache Spark on top of it. Apache Spark is known to be rather suitable for counting words in a large corpus of documents.

How to scale an algorithm/service/system with multiple machines?

I had some interviews recently and it's quite normal to be asked some scale problems.
For example, you have a long list of words(dict) and list of characters as the inputs, design an algorithm to find out a shortest word which in dict contains all the chars in the char list. Then the interviewer asked how to scale your algorithm into multiple machines.
Another example is you have been designed a traffic light control system for an intersection in a city. How do you scale this control system to the whole city which has many intersections.
I always have no idea about this kind of "scale" problems, welcome any suggestions and comments.
Your first question is completely different from your second question. In fact the control of traffic lights in cities is a local operation. There are boxes nearby that you can tune and optical sensor on top of the light that detects waiting cars. I guess if you need to optimize for some objective function of flow, you can route information to a server process, then it can become how to scale this server process over multiple machines.
I am no expert in design of distributed algorithm, which spans a whole field of research. But the questions in undergrad interviews usually are not that specialized. After all they are not interviewing a graduate student specializing in those fields. Take your first question as an example, it is quite generic indeed.
Normally these questions involve multiple data structures (several lists and hashtables) interacting (joining, iterating, etc) to solve a problem. Once you have worked out a basic solution, scaling is basically copying that solution on many machines and running them with partitions of the input at the same time. (Of course, in many cases this is difficult if not impossible, but interview questions won't be that hard)
That is, you have many identical workers splitting the input workload and work at the same time, but those workers are processes in different machines. That brings the problem of communication protocol and network latency etc, but we will ignore these to get to the basics.
The most common way to scale is let the workers hold copies of smaller data structures and have them split the larger data structures as workload. In your example (first question), the list of characters is small in size, so you would give each worker a copy of the list, and a portion of the dictionary to work on with the list. Notice that the other way around won't work, because each worker holding a dictionary will consume a large amount of memory in total, and it won't save you anything scaling up.
If your problem gets larger, then you may need more layer of splitting, which also implies you need a way of combining the outputs from the workers taking in the split input. This is the general concept and motivation for the MapReduce framework and its derivatives.
Hope it helps...
For the first question, how to search words that contain all the char in the char list that can run on the same time on the different machine. (Not yet the shortest). I will do it with map-reduce as the base.
First, this problem is actually can run on different machine at the same time. This is because for each word in the database, you can check it on another machine (so to check another word, you didn't have to wait for the previous word or the next word, you can literally send each word to different computer to be checked).
Using map-reduce, you can map each word as a value and then check it if it contain every char in the char list.
Map(Word, keyout, valueout){
//Word comes from dbase, keyout & valueout is input for Reduce
if(check if word contain all char){
sharedOutput(Key, Word)//Basically, you send the word to a shared file.
//The output shared file, should be managed by the 'said like' hadoop
}
}
After this Map running, you get all the Word that you want from the database locate in shared file. As for the reduce step, you can actually used some simple step to reduce it based on it length. And tada, you get the shortest one.
As for the second question, multi threading come to my mind. It's actually a problem that not relate to each other. I mean each intersection has its own timer right? So to be able handle tons of intersection, you should use multi threading.
The simple term will be using each core in the processor to control each intersection. Rather then go loop through all intersection on by one. You can alocate them in each core so that the process will be faster.

How Duplicate File search is implemented in Gemini For Mac os

I tried to search for Duplicate files in my mac machine via command line.
This process took almost half an hour for 10 gb Data files whereas Gemini and cleanmymac apps takes lesser time to find the files.
So my point here is how this fastness is achieved in these apps,what is the concept behind it?, in which language code is written.
I tried googling for information but didnot get anything related to duplicate finder.
if you have any ideas please input them here.
First of all Gemini locates files with equal size, than it uses it’s own hash-like type-dependent algorithm to compare files content. That algorithm is not 100% accurate but much more quick than classical hashes.
I contacted support, asking them what algorithm they use. Their response was that they compare parts of each file to each other, rather than the whole file or doing a hash. As a result, they can only check maybe 5% (or less) of each file that's reasonably similar in size to each other, and get a reasonably accurate result. Using this method, they don't have to pay the cost of comparing the whole file OR the cost of hashing files. They could be even more accurate, if they used this method for the initial comparison, and then did full comparisons among the potential matches.
Using this method, files that are minor variants of each other may be detected as identical. For example, I've had two songs (original mix and VIP mix) that counted as the same. I also had two images, one with a watermark and one without, listed as identical. In both these cases, the algorithm just happened to pick parts of the file that were identical across the two files.

Faster searching through files in Perl

I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!
It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.
You could probably replace this:
foreach file in arrayofDataReferenceFiles
search in these files
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
dbmopen HASH,DBNAME,MASK
This binds a dbm(3), ndbm(3), sdbm(3), gdbm(3), or Berkeley DB file to a hash.
You'd normally access this stuff through tie but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.
As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.

Algorithm to find resource-free slot

I am designing a project management app for a factory. The app is expected to produce draft project plans. To schedule a task, the app should check three conditions:
task dependency - do not start before,
machine availability, and
shift work hours
I keep track of machine engagement in machine_allocations table:
machine_allocations
+------------+--------------+-----------------+---------------+
| machine_id | operation_id | start_timestamp | end_timestamp |
+------------+--------------+-----------------+---------------+
Shift hours follow a pattern.
Now, to find the earliest possible date-time for an operation I am thinking of a function:
function earliest_slot($machine_id, $for_duration, $no_sooner_than) {
// pseudo code
1. get records for the machine in question for after $no_sooner_than
2. put start and end timestamps into $unavailable array
3. add non-working times as new elements to the array
4. in a loop find timeslots which are not in the array
5. if a timeslot is found which is equal to or bigger than $for_duration, return that
}
My question is, is this a good approach? Are there simpler ways to do this?
Finding the earliest date-time for one operation at a time may not give you the best result. Consider the example where operation A uses machine 1 for a long time, operation B uses machine 1 for a short time and operation C uses machine 2 for a short time, but operation C must be done after B.
In this case, it is better to schedule B before A on machine 1, but your approach would not achieve this. Of course, writing and using software to manage this would be more difficult than what you have suggested, so you need to decide whether the benefit is worth the extra effort.
Have a look at Scheduling, Job Shop Scheduling and Scheduling algorithm.
First you need to think about what sort of information you can collect about tasks (such as dependencies, priorities, deadlines) and then decide how best to put it together.
You may find that an approach like you propose is good enough in your case. My addition to your proposed algorithm would be to sort the list of existing machine operations to make searching through them faster, that is you can stop as soon as you find a time where your operation fits because it's guaranteed to be the earliest time.
A relatively simple extension would be a priority system that allows you to bump lower-priority tasks forward (which may require the adjustment of their dependencies as well), but more complicated algorithms would consider multiple tasks at once and try to optimise the outcome. In the end it comes down to what's appropriate for your specific problem.
That depends when You want to plan work. If before starting work of machines then mayby a Branch&Bound algorithm, or something like it (mayby dynamic programming). If work have to planned when machines are working and You can not tell what jobs would be performed then for optimal solution You can not count (well I can't think about it). Mayby put next jobs on machine with smalles max time? Mayby a dynamic version of Ford-Bellmas alg (if you have couple layers of production). Hard to say.
I would do couple of approches and determine witch are best. The You can write an article about this :)

Resources