I want to write a simple program that recurses through a given directory, skips if one of a set of excluded directories / strings is found in the complete path&filename OR a file does not exist / is empty and lists the rest, along with file size and creation time.
My attempt looks like this, but it doesnt work:
dir_ignore=[".AppleDouble",".AppleDB",".AppleDesktop",".DS_Store"]
require 'find'
Find.find('/Volumes/Downloads') { |dir_entry|
next if dir_ignore.any? {|skip| dir_entry.include? skip} || !File.exist?(dir_entry)
dir = ["filename" => dir_entry],["filesize" => File.size(dir_entry)],["creationdate" => File.ctime(dir_entry)]
puts dir
}
The program still lists all the files in an ".AppleDouble" directory and finally crashes on a non-existant file. So obviously my boolean expression does not work...
Why?
Related
I have in the html the location variable sometimes is used with a class called "result-hood" sometimes is used with another class called "nearby"
location = result.search('span.result-hood').text[2..-2]
location2 = result.search('span.nearby').text[2..-2]
so if one of the above classes is not used the result is nill, my question is how to get always the one that is not nill, I was thinking about the ternary operator "?" , but don't know how to use it.
Thanks,
You want the || ("or") operator:
location || location2
It returns the left side if that is not nil or false, and otherwise it returns the right side.
CSS supports logical or operations using a comma as the delimiter, so your selector can just be:
location = result.search('span.result-hood,span.nearby').text[2..-2]
XPath also supports logical or operator itself, the equivalent XPath would look like
location = result.search('//span[#class="result-hood"]|//span[#class="nearby"]').text[2..-2]
Ternary operator in ruby:
loc = location.nil? ? location2 : location
Hope this works.
Since you're looking for one or the other you can reduce this code to:
location = result.search('span.result-hood').text[2..-2]
|| result.search('span.nearby').text[2..-2]
Where that search operation could be fairly expensive, so why run it twice when you might need to run it only once. Now that you've minimized it like this you can take it a step further:
location = %w[ span.result-hood span.nearby ].map do |selector|
result.search(selector).text[2..-2]
end.compact.first
This looks a little complicated but what it does is convert each selector into the text extracted from result.search(...).text[2..-2] and then take the first non-nil value.
That technically computes all possible bits of text before extracting, so you can make it "lazy" and evaluate each one in sequence instead, stopping at the first match:
location = %w[ span.result-hood span.nearby ].lazy.map do |selector|
result.search(selector).text[2..-2]
end.select(&:itself).first
The nice thing about this approach is you can clean it up a little by declaring a constant in advance:
LOCATIONS = %w[ span.result-hood span.nearby ]
Then later you have more minimal code like this that will automatically accommodate any changes made to that array both in terms of precedence and addition of others:
location = LOCATIONS.lazy.map do |selector|
result.search(selector).text[2..-2]
end.select(&:itself).first
I am trying to generate thumbnails with Ruby,on a linux machine.
The process,includes determining which of the 5 thumbnails,already generated,is the most meaningful(by meaningful,here,i meant to pick the one with the highest size,since a bigger size means more details).
Afterwards i went to rename the file having the biggest size into a generic name in order to use it later.The code doesn't seem to be working for me,and i can't understand the reason,is there any suggestions to improve it?
Thank you in advance.
Here is my code:
For your possible needs, the variable thumb_dir contains the path of the directory we are getting the thumbnails,from.
max = File.size("#{thumb_dir}/thumb01.jpg").to_f #
name = "thumb01.jpg"
for i in 2..5
if max < File.size("#{thumb_dir}/thumb0'"#{i}"'.jpg" ).to_f?
max = File.size("#{thumb_dir}/thumb0'"{i}"'.jpg"
name = "thumb0" + "#{i}" + ".jpg"
end
end
File.rename("#{thumb_dir}/#{name}", "thumbnail.jpg") `
i = (1..5).map {|i| File.size("#{thumb_dir}/thumb#{i}.jpg").to_f }.each_with_index.max[1]
File.rename("#{thumb_dir}/thumb#{i + 1}.jpg", "thumbnail.jpg")
What does it do ?
(1..5).map {|i| File.size("#{thumb_dir}/thumb#{i}.jpg").to_f }
We get an array of file sizes for thumb1.jpg up to thumb5.jpg
array.each_with_index.max[1]
Used to get the index of the greatest value of the array.
File.rename("#{thumb_dir}/thumb#{i+1}.jpg", "thumbnail.jpg")
Now that we know that i is the index of the greatest value in the array, then thumb#{(i+1)}.jpg is the file with the greatest size, so that's the one we want to replace the name of.
Remember that in Ruby there's a number of useful methods in Enumerable that make this sort of thing pretty straightforward:
# Expand to a list of possible thumbnail paths
thumbnails = (2..5).map { |n| '%s/thumb%02d.jpg' % [ thumb_dir, n ] }
# Find the biggest thumbnail by...
biggest_thumbnail = thumbnails.select do |path|
# ...only dealing with those files that exist...
File.exist?(path)
end.max_by do |path|
# ...and looking for the one with the maximum size.
File.size(path)
end
That should return the largest file if one exists. If not you get nil.
You can use that to rename:
if (biggest_thumbnail)
File.rename(biggest_thumbnail, 'thumbnail.jpg')
end
You'll want to back up your original images before unleashing something like this on it that could potentially delete a bunch of files.
I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}
I've got two arrays where I want to find all the elements in Array0 where the full string from Array1 is contained in the string of Array0. Here is the scenario:
I've got a string array that contains the full path of all the xml files in a certain directory. I then get a list of locations and only want to return the subset of xml file paths where the filename of the xml file is the loc id.
So, my Array0 has something like:
c:\some\directory\6011044.xml
c:\some\directory\6028393.xml
c:\some\directory\6039938.xml
c:\some\directory\6028833.xml
And my Array1 has:
6011044
6028833
...and I only want to have the results from Array0 where the filepath string contains string from Array1.
Here is what I've got...
filesToLoad = (from f in Directory.GetFiles(Server.MapPath("App_Data"), "*.xml")
where f.Contains(from l in locs select l.CdsCode.ToString())
select f).ToArray();
...but I get the following compiler error...
Argument '1': cannot convert from 'System.Collections.Generic.IEnumerable<string>' to 'string'
...which I can understand from an English standpoint, but do not know how to resolve.
Am I coming at the from the wrong angle?
Am I missing just one piece?
EDIT
Here is what I've changed it to:
filesToLoad = (Directory.GetFiles(Server.MapPath("App_Data"), "*.xml"))
.Where(path => locs.Any(l => path.Contains(l.CdsCode.ToString()))
).ToArray();
...but this still gets me all the .xml files even though one of them is not in my locs entity collection. What did I put in the wrong place?
Obviously I'm missing the main concept so perhaps a little explanation as to what each piece is doing would be helpful too?
Edit 2
See Mark's comment below. The answer to my problem, was me. I had one record in my locs collection that had a zero for the CDS value and thus was matching all records in my xml collection. If only I could find a way to code without myself, then I'd be the perfect developer!
You're missing Any:
string[] result = paths.Where(x => tests.Any(y => x.Contains(y))).ToArray();
you can also join them
var filesToLoad = (from f in Directory.GetFiles(Server.MapPath("App_Data"), "*.xml")
from l in locs
where f.Contains(l.CdsCode.ToString())
select f).ToArray();
I'm writing a small parser for Google and I'm not sure what's the best way to design it. The main problem is the way it will remember the position it stopped at.
During parsing it's going to append new searches to the end of a file and go through the file startig with the first line. Now I want to do it so, that if for some reason the execution is interrupted, the script knows the last search it has accomplished successfully.
One way is to delete a line in a file after fetching it, but in this case I have to handle order that threads access file and deleting first line in a file afaik can't be done processor-effectively.
Another way is to write the number of used line to a text file and skip the lines whose numbers are in that file. Or maybe I should use some database instead? TIA
There's nothing wrong with using a state file. The only catch will be that you need to ensure you have fully committed your changes to the state file before your program enters a section where it may be interrupted. Typically this is done with an IO#flush call.
For example, here's a simple state-tracking class that works on a line-by-line basis:
class ProgressTracker
def initialize(filename)
#filename = filename
#file = open(#filename)
#state_filename = File.expand_path(".#{File.basename(#filename)}.position", File.dirname(#filename))
if (File.exist?(#state_filename))
#state_file = open(#state_filename, File::RDWR)
resume!
else
#state_file = open(#state_filename, File::RDWR | File::CREAT)
end
end
def each_line
#file.each_line do |line|
mark_position!
yield(line) if (block_given?)
end
end
protected
def mark_position!
#state_file.rewind
#state_file.puts(#file.pos)
#state_file.flush
end
def resume!
if (position = #state_file.readline)
#file.seek(position.to_i)
end
end
end
You use it with an IO-like block call:
test = ProgressTracker.new(__FILE__)
n = 0
test.each_line do |line|
n += 1
puts "%3d %s" % [ n, line ]
if (n == 10)
raise 'terminate'
end
end
In this case, the program reads itself and will stop after ten lines due to a simulated error. On the second run it should display the next ten lines, if there are that many, or simply exit if there's no additional data to retrieve.
One caveat is that you need to remove the .position file associated with the input data if you want the file to be reprocessed, or if the file has been reset. It's also not possible to edit the file and remove earlier lines or it will throw off the offset tracking. So long as you're simply appending data to the file, or restarting it, everything will be fine.