Reading the last n lines from a huge text file - windows

I've tried something like this
file_in <- file("myfile.log","r")
x <- readLines(file_in, n=-100)
but I'm still waiting...
Any help would be greatly appreciated

I'd use scan for this, in case you know how many lines the log has :
scan("foo.txt",sep="\n",what="char(0)",skip=100)
If you have no clue how many you need to skip, you have no choice but to move towards either
reading in everything and taking the last n lines (in case that's feasible),
using scan("foo.txt",sep="\n",what=list(NULL)) to figure out how many records there are, or
using some algorithm to go through the file, keeping only the last n lines every time
The last option could look like :
ReadLastLines <- function(x,n,...){
con <- file(x)
open(con)
out <- scan(con,n,what="char(0)",sep="\n",quiet=TRUE,...)
while(TRUE){
tmp <- scan(con,1,what="char(0)",sep="\n",quiet=TRUE)
if(length(tmp)==0) {close(con) ; break }
out <- c(out[-1],tmp)
}
out
}
allowing :
ReadLastLines("foo.txt",100)
or
ReadLastLines("foo.txt",100,skip=1e+7)
in case you know you have more than 10 million lines. This can save on the reading time when you start having extremely big logs.
EDIT : In fact, I'd not even use R for this, given the size of your file. On Unix, you can use the tail command. There is a windows version for that as well, somewhere in a toolkit. I didn't try that out yet though.

You could do this with read.table by specifying the skip parameter. If your lines are not to be parsed to variables, specify the separator to be '\n' as #Joris Meys pointed out below, and also set as.is=TRUE to get character vectors instead of factors.
Small example (skipping the first 2000 lines):
df <- read.table('foo.txt', sep='\n', as.is=TRUE, skip=2000)

As #JorisMeys already mentioned the unix command tail would be the easiest way to solve this problem. However I want to propose a seek based R solution that starts reading the file from the end of the file:
tailfile <- function(file, n) {
bufferSize <- 1024L
size <- file.info(file)$size
if (size < bufferSize) {
bufferSize <- size
}
pos <- size - bufferSize
text <- character()
k <- 0L
f <- file(file, "rb")
on.exit(close(f))
while(TRUE) {
seek(f, where=pos)
chars <- readChar(f, nchars=bufferSize)
k <- k + length(gregexpr(pattern="\\n", text=chars)[[1L]])
text <- paste0(text, chars)
if (k > n || pos == 0L) {
break
}
pos <- max(pos-bufferSize, 0L)
}
tail(strsplit(text, "\\n")[[1L]], n)
}
tailfile(file, n=100)

You can read last n lines by following method
Step 1 - Open your file as your wish
df <- read.csv("hw1_data.csv")
Step 2 - Now use tail function to read n lines from last
tail(df, 2)

Some folks have said it already, but if you have a large log, it is most efficient to only read in what you need instead of reading it all into memory, then subsetting what you need.
For this, we use R's system() to run the Linux tail command.
Read the last 10 lines of the log:
system("tail path/to/my_file.log")
Read the last 2 lines of the log:
system("tail -n 2 path/to/my_file.log")
Read the last 2 lines of the log and capture the output in a character vector:
last_2_lines <- system("tail -n 2 path/to/my_file.log", intern = TRUE)

For seeing the last few lines:
tail(file_in,100)

Related

Is there a way to write single band raster from multiple raster stacks

I have 4 subfolder that contains 5 rasters with continuous values. So a build a loop with "for" function to :
list these raster files
stack these files per folder , i.e 4 rasterstacks objects (that contains 5 rasters)
I apllied a treshold to transform the the continuous raster in binary raster
Finally I wrote the binary raster using wirte.raster function.
My issue is in a step 4. Eventhough I use the argument "byLayer = T" in writeRaster function
the rasters saved were a rasterstack with the 5 binary rasters. And i want write it per raster, per file, per band
I really grateful if anyone give me any insights
setwd("Vole_raw_mean_Present/")
sub <- list.dirs(full.names=FALSE, recursive=FALSE)
for(j in 1:length(sub)) {
print(sub[j])
h <- list.files(path=sub[j], recursive=TRUE, full.names=TRUE, pattern='.tif')
print(h)
stack_present <- stack(h)
print(stack_present)
binary_0.2 <- stack_present >=0.2
writeRaster(binary_0.2, filename=paste0(sub[j], bylayer = T, suffix = "_bin.tif"), overwrite=TRUE)
}
This is wrong because the argument "bylayer" is lost as it becomes part of the filename)
writeRaster(binary_0.2, filename=paste0(sub[j], bylayer = T, suffix = "_bin.tif"), overwrite=TRUE)
It should be something like this (and it helps to do it in two steps)
f <- paste0(sub[j], "bin.tif")
writeRaster(binary_0.2, filename=f, bylayer=TRUE, overwrite=TRUE)
Illustrated here
library(raster)
b <- brick(system.file("external/rlogo.grd", package="raster"))
dir.create("test")
setwd("test")
writeRaster(b, filename="abc.tif", bylayer=T)
list.files()
#[1] "abc_1.tif" "abc_2.tif" "abc_3.tif"
writeRaster(b, filename="bin.tif", bylayer=T, suffix = paste0("f", 1:3))
list.files(pattern="bin")
#[1] "bin_f1.tif" "bin_f2.tif" "bin_f3.tif"
Alternatively, you can loop over the files within each folder

Fastest way to index large sorted hash file

I am building a file-based index for the sorted haveibeenpwned passwords text file and it got me wondering what's the fastest way to do this?
I figured a good way to build a quickly grep-able index would be split the sorted file into 256 files named with the first two hex digits (i.e. FF.txt, FE.txt, etc). I found ripgrep rg to be about 5 times faster than grep on my computer. So I tried something like this:
for i in {255..0}
do
start=$(date +%s)
hex="$(printf '%02x' $i | tr [:lower:] [:upper:])"
rg "^$hex" pwned-passwords-ntlm-ordered-by-hash-v4.txt > ntlm/$hex-ntlm.txt
echo 0x$hex completed in $(($(date +%s) - $start)) seconds
done
This is the fastest solution I could come up with. ripgrep is able to create each file in 25 seconds. So I'm looking at about 100 minutes to create this index. When I split the job in half, and run them in parallel, each pair of files is created in 80 seconds. So it seems best to just let ripgrep work its magic and work in series.
Obviously, I won't be indexing this list too often, but it's just fun to think about. Any thoughts on a faster way (aside from using a database) to index this file?
ripgrep, like any other tool that's able to work with unsorted input files at all, is the wrong tool for this job. When you're trying to grep sorted inputs, you want something that can bisect your input file to find a position in logarithmic time. For big enough inputs, even a slow O(log n) implementation will be faster than a highly optimized O(n) one.
pts-line-bisect is one such tool, though of course you're also welcome to write your own. You'll need to write it in a language with full access to the seek() syscall, which is not exposed in bash.
You are reading through the file 256 times, doing a full file scan every time. Consider an approach that reads the file once, writing each line into an open file descriptor. I'm thinking python would be an easy choice of implementation (if that's your thing). You could optimize by keeping the file open until you hit a new hex code at the beginning of the line. If you want to be even more clever, there is no need to go through the sorted file line by line. Based on Charles Duffy's hint, you could create a heuristic for sampling the file (using seek()) to get to the next hex value. Once the program has found the byte offset of the next hex value, the block of bytes can be written to the new file. However, since this is tagged as 'bash' let's keep the solution set in that domain:
while
read line
do
hex=${line:0:2}
echo $line >> ntlm/$hex-ntlm.txt
done < pwned-passwords-ntlm-ordered-by-hash-v4.txt
I wrote a Python3 script that solves fast binary-search lookups in the hash file without having to create an index. It doesn't directly address your question (indexing) but probably solves the underlying problem that you wanted to solve with an index - to quickly look up individual hashes. This script checks hundreds of passwords in seconds.
import argparse
import hashlib
parser = argparse.ArgumentParser(description='Searches passwords in https://haveibeenpwned.com/Passwords database.')
parser.add_argument('passwords', metavar='TEST', type=str, help='text file with passwords to test, one per line, utf-8')
parser.add_argument('database', metavar='DATABASE', type=str, help='the downloaded text file with sha-1:count')
args = parser.parse_args()
def search(f: object, pattern: str) -> str:
def search(left, right: int) -> str:
if left >= right:
return None
middle = (left + right) // 2
if middle == 0:
f.seek(0, 0)
test = f.readline()
else:
f.seek(middle - 1, 0)
_ = f.readline()
test = f.readline()
if test.upper().startswith(pattern):
return test
elif left == middle:
return None
elif pattern < test:
return search(left, middle)
else:
return search(middle, right)
f.seek(0, 2)
return search(0, f.tell())
fsource = open(args.passwords)
fdatabase = open(args.database)
source_lines = fsource.readlines()
for l in source_lines:
line = l.strip()
hash_object = hashlib.sha1(line.encode("utf-8"))
pattern = hash_object.hexdigest().upper()
print("%s:%s" % (line, str(search(fdatabase, pattern)).strip()))
fsource.close()
fdatabase.close()

How to see if a string exists in a huge (>19GB) sorted file?

I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

Erlang upper case and lower case sort

the question about a comparison of the upper and lower case..how can i do that in my sort function.any idea?
Ex: Inputfile : " I am Happy! "
Outputfile:
Happy!
I
am
thats what's happen with my program, but i would like so have:
am
I
Happy
My code:
-module(wp)
-compile([export_all]). % Open the File
sortFile(File1,File2) ->
{ok, File_Read} = file:read_file(File1),
% making a list
Liste = string:tokens(binary_to_list(File_Read), "\n "),
% isort List
Sort_List = isort(Liste),
ISort = string:join(Sort_List,"\n"),
%Written in the File.
{ok,Datei_Schreiben} = file:open(File2, write),
file:write(File_Write, Isort),
file:close(File_Write).
isort([]) -> [];
isort([X|XS])-> insert(X, isort(XS)).
insert(Elem, []) -> [Elem];
insert(Elem, [X|XS]) when Elem= [Elem,X|XS];
insert(Elem, [X|XS]) -> [X|insert(Elem,XS)].
how about something like this:
qsort1([]) -> [];
qsort1([H|T]) ->
qsort1([X || X <- T, string:to_lower(X) < string:to_lower(H)])
++ [H]
++ qsort1([X || X <- T, string:to_lower(X) >= string:to_lower(H)]).
7> qsort1(["I", "am","Happy"]).
["am","Happy","I"]
I believe that "happy" sorts less than "i"
8> "happy" < "i".
true
which is why my sorted order is a little differenct than your original post.
When there is at least N*log2(N) comparisons in sorting there is not necessary to make N*log2(N) but only N case transformations. (Almost all perl developers knows this trick.)
{ok, Bin} = file:read_file(?INPUT_FILE),
Toks = string:tokens(binary_to_list(Bin),"\n "),
Result = [[X,$\n] || {_,X} <- lists:sort([{string:to_lower(X), X} || X<-Toks])],
file:write_file(?OUTPUT_FILE, Result).
BTW lists:sort/1 merge sort has granted N*log2(N) and is pretty efficient in contrary to concise but less efficient quick sort implementation. What worse, quick sort has N^2 worst case.
Now, depending on whether you are on Windows or Unix/Linux, the lines in the files will be ended with different characters. Lets go with windows where its normally \r\n. Now assuming the input files are not too big, we can read them at once into a binary. The stream of data we get must be split into lines, then each line split into words (spaces). If the input file is very big and cannot fit in memory, then you will have to read it, line by line, in which case you might need an IN-Memory buffer to hold all the words ready for sorting, this would require ETS Table, or Memcached (an option i wont illustrate here). Lets write the code
-module(sick_sort).
-compile(export_all).
-define(INPUT_FILE,"C:/SICK_SORT/input.txt").
-define(OUTPUT_FILE_PATH,"C:/SICK_SORT/").
-define(OUTPUT_FILENAME,"output.txt").
start()->
case file:read_file(?INPUT_FILE) of
{ok,Binary} ->
%% input file read
AllLines = string:tokens(binary_to_list(Binary),"\r\n"),
SortedText = lists:flatten([XX ++ "\r\n" || XX <- lists:sort(string:tokens(AllLines," "))]),
EndFile = filename:join(?OUTPUT_FILE_PATH,?OUTPUT_FILENAME),
file:write_file(EndFile,SortedText),
ok;
Error -> {error,Error}
end.
That should work. Change the macros in the source file to suit your settings and then, just run sick_sort:start().
you have to compare low cap in your sort function:
(nitrogen#127.0.0.1)25> F= fun(X,Y) -> string:to_lower(X) < string:to_lower(Y) end.
#Fun<erl_eval.12.111823515>
(nitrogen#127.0.0.1)26> lists:sort(F,["I","am","Happy"]).
["am","Happy","I"]
(nitrogen#127.0.0.1)27>
EDIT:
In your code, the function that allows to sort the list are the operators > and < (if you want to see replicated string one of them should include =, otherwise you will do a usort). If you want to use a different comparison you can define it in a normal or anonymous function and then use it in the quicksort:
mycompare(X,Y) ->
string:to_lower(X) < string:to_lower(Y).
quicksort ([])->[];
([X|XS])-> quicksort([Y||Y<-XS,mycompare(X,Y)])++[X]++quicksort([Y||Y<-XS,mycompare(X,Y) == false]).

Resources