Go: read block of lines in a zip file [closed] - go

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to read a block of n lines in a zip files quickly as possible.
I'm beginer in Go. For bash lovers, I want to do the same as (to get a block of 500 lines between lines 199500 and 200000):
time query=$(zcat fake_contacts_200k.zip | sed '199500,200000!d')
real 0m0.106s
user 0m0.119s
sys 0m0.013s
Any idea is welcome.

Import archive/zip.
Open and read the archive file
as shown in the example right there in the docs.
Note that in order to mimic the behaviour of zcat you have to
first check the length of the File field of the zip.ReadCloser
instance returned by a call to zip.OpenReader,
and fail if it is not equal to 1 — that is, there is no files in the
archive or there are two or more files in it¹.
Note that you have to check the error value
returned by a call to zip.OpenReader for being equal to zip.ErrFormat,
and if it's equal, you have to:
Close the returned zip.ReadCloser.
Try to reinterpret the file as being gzip-formatted (step 4).
Take the first (and sole) File member and
call Open on it.
You can then read the file's contents from the returned io.ReaderCloser.
After reading, you need to call Close() on that instance and then
close the zip file as well. That's all. ∎
If step (2) failed because the file did not have the zip format,
you'd test whether it's gzip-formatted.
In order to do this, you do basically the same steps using the
compress/gzip package.
Note that contrary to the zip format, gzip does not provide file archival — it's merely a compressor, so there's no meta information on any files in the gzip stream, just the compressed data.
(This fact is underlined by the difference in the names of the packages.)
If an attempt to opening the same file as a gzip archive returns
the gzip.ErrHeader error, you bail out, otherwise you read the data
after which you close the reader. That's all. ∎
To process just the specific lines from the decompressed file,
you'd need to
Skip the lines before the first one to process.
Process the lines until, and including the last one to process.
Stop processing.
To interpret the data read from an io.Reader or io.ReadCloser,
it's best to use bufio.Scanner —
see the "Example (Lines)" there.
P.S.
Please read thoroughly this essay
to try to make your next question better that this one.
¹ You might as well read all the files and interpret their contents
as a contiguous stream — that would deviate from the behaviour of zcat
but that might be better. It really depends on your data.

Related

Why is in-place edditing of a file slower than making a new file?

As you can see in this answer. It seems like editing a text file in-place takes much more time than creating a new file, deleting the old file and moving a temporary file from another file-system and renaming it. Let alone creating a new file in the same file-system and just renaming it. I was wondering what is the reason behind that?
Because when you edit a file inplace you are opening the same file for both writing and reading. But when you use another file. you only read from one file and write to another file.
When you open a file for reading it's content are moved from disk to memory. Then after, when you want to edit the file you change the content of the file in the disk so the content you have in memory should be updated to prevent data inconsistency. But when you use a new file. You don't have to update the contents of the first file in the memory. You just read the whole file once and write the other file once. And don't update anything. Removing a file also takes very small time because you just remove it from the file system and you don't write any bits to the location of the file in the disk. The same goes for renaming. Moving can also be done very fast depending on the file-system but most likely not as fast as removing and renaming.
There is also another more important reason.
When you remove the numbers from the beginning of the first line, all of the other characters have to be shifted back a little. Then when you remove the numbers from the second line again all of the characters after that point have to be shifted back because the characters have to be consecutive. If you wanted to just change some characters, editing in place would have been a lit faster. But since you are changing the length of the file on each removal all of the other characters have to get shifted and that takes so much time. It's not exactly like this and it's much more complicated depending on the implementation of your operation system and your file-system but this is the idea behind it. It's like array operation. When you remove a single element from an array you have to shift all of the other elements of the array. Because it is an array. In contrast if you were to remove an element from a linked list you didn't need to shift other elements but files are implemented similar to arrays so that is that.
While tgwtdt's answer may give a few good insights it does not explain everything. Here is a counter example on a 140MB file:
$ time sed 's/a/b/g' data > newfile
real 0m2.612s
$ time sed -i -- 's/a/b/g' data
real 0m9.906s
Why is this a counter example, you may ask. Because I replace a with b which means that the replacement text has the same length. Thus, no data needs to be moved, but it still took about four times longer.
While tgwtdt gave a good reasoning for why in place usually takes longer, it's a question that cannot be answered 100% for the general case, because it is implementation dependent.

Textstream to read non-text files

Is using the Microsoft scripting filesystemobject's OpenTextFile method (to set a textstream-typed or untyped variable), with open type = 8 (for appending), and seeing if that line of code can execute without error, a reasonably reliable way to ascertain whether or not the file is locked in any of the typical ways (i.e. another user or program has it open or locked in usage, or it actually has a file attribute of Read Only, but that last thing is not my primary goal, yes I already know about reading Attributes)...?
I've heard of doing this, but I'm just wanting to get some input. Obviously, the documentation on opentextfile generally focuses on the apparent assumption that you are actually working with TEXT files.
But my question then is two-fold:
Is the simple test of seeing if OpenTextFile (path,8) executes successfully pretty much a green light to assume it is not locked for some reason?
Will this work for other file types, like docx, PDF, etc. I mean I know the line of code seems to work, but is it equally applicable to the question of whether the file is locked for some reason.

List all the users who currently logged in to the unix server [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
In unix, I use the who-command to list out all the users who currently logged in to the system.
I wish to write a bash shell script which displays the same outputs as the who-command.
I have tried the following:
vi log.sh -now there is a file log.sh
now, typed who
save and quit
give execute permission: chmod +x log.sh
execute: sh -vx log.sh
This will give the same output as using who.
However, is there another way to write such a shell script?
It is hard to answer as I suspect this is homework (and I don't want give you the full answer). Moreover I don't know how proficient you might be in various programming area. So, I will only try to make an answer that is in accordance with the How do I ask and answer Homework Community Wiki
Is there another way?....
Yes it is. Obviously: who has to work somehow. At the very worst, you might search into the source code to know how it works.
Fortunately this question does not require such extreme solution. As it has been said in a comment, who reads from /var/tmp/utmp or /var/run/utmp on my Debian system.
cat /var/run/utmp
You will see this is a binary "file". You have to somehow decode it. That's where man utmp might came to an help. It will expose the C structure corresponding to one record in the utmp file.
With that knowledge, you will be able to process the file with your favorite language. Please note bash (or any shell) is probably not the best language to deal with binary data structures.
As I said first, you didn't give enough background for I (us?) to give some precise advices. Anyway, if digging into the kernel data-structures is ... well ... way above what can be expected from you, maybe some "simple" solution based on grep/awk/bash/whatever might be sufficient to filter the output of:
ps -edf
Taking this as a challenge, I come with this solution:
#!/bin/bash
shopt -s extglob
while read record; do
# ut_name at offset 44 size 32
ut_name="${record:44:32}"
# ut_line at offset 8 size 32
ut_line="${record:8:32}"
echo ${ut_name%%*(.)} ${ut_line%%*(.)}
done < <(hexdump -v -e '384/1 "%_p"' -e '"\n"' /var/run/utmp)
# ^^^
# according to utmp.h, sizeof(struct utmp) is 384 bytes
# so hexdump outputs here one line for each record.
As bash is not good at handling binary data (especially containing \x00) I had to rely on hexdump with a custom format to "decode" utmp records.
Of course, this is far from being perfect, and producing and output really identical to the one given by who might require some decent amount of efforts. But this might be a good starting point...

Why I can't rename a file that is in use [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I just wounder why I can't rename a file that is opened, or in use by other program?
what is the purpose of that ?
The question is based on a false premise, you most certainly can rename a file that is in use on common file systems that are used on Windows. There is very little a process can do to prevent this, short from changing the ACL on the file to deny access. That is exceedingly rare.
Locking a file protects the file data, not the file metadata.
This feature has many uses, most notably the ReplaceFile() winapi function depends on it. Which is the way a program can save a file even if another process has it locked.
The one thing you cannot do is rename the file to move it to a different drive. Because that requires much more work then simply altering or moving the directory entry of the file. It also requires copying the file data from one drive to another. That of course is going to fail when the file data is locked.
Because file currently in use. You cannot change the name of the file.
when file is opened it's process is created. you can not change the name of process at runtime.
Hope question perfectly answered
It's a design decision resulting is less complicated behavior. When a file F is opened by process A, you must assume A works with the name of F as with useful information, e.g. displays it to the user, passes it around to some other process, stores it in configuration, MRU list, whatever, etc. Hence if process B renamed F, process A would now work with invalid information. Hence it is generally safer to disallow such manipulations.

idea for practice with shell scripting [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm looking for a shell script idea to work on for practice with shell scripting. Can you please suggest intermediate ideas to work on?
I'm a developer and I prefer working on an idea that deals with files.
For shell scripting, think of a task that you do frequently - and think how you would automate that task.
You can start off with a basic script that just about does what you need. Then you realize that there are small variations on the task, and you start to allow the script to handle those. And it gently becomes more complex.
Almost all of the scripts I have (some hundreds of them) started off as "I've done that before; how can I avoid having to do it again?".
Can you give an example?
No - because I don't know what tasks you do sufficiently often to be (minor) irritants that could be salved by writing a script.
Yes - because I've got scripts that I wrote incrementally, in an attempt to work around some issue or other in my environment.
One task that I'm working on - still a work in progress - is:
Identify duplicate files
Starting at some nominated directory (default, $HOME), find all the files, and for each file, establish a checksum (MD5, SHA1, SHA256 - it is not critical which) for the file; record the file name and checksum (and maybe device number and inode number).
Establish which checksums are repeated - hence identifying identical files.
Eliminate the unique checksums.
Group the duplicate files together with appropriate identifying information.
This much is fairly easy - it requires some medium-grade shell scripting and you might have to find a command to generate the checksum (but you might be OK with sum or cksum, though neither of those reaches even the level of MD5). I've done this in both shell and Perl.
The hard part - where I've not yet gotten a good solution - is then dealing with the duplicates. I have some 8,500 duplicated hashes, with about 27,000 file names in total. Some of the duplicates are images like smileys used in chat transcripts - there are a lot of that particular image. Others are duplicate PDF files collected from various machines at various times; I need to organize them so I have one copy of the file on disk, with perhaps links in the other locations. But some of the other locations should go - they were convenient ways to get the material from retired machines onto my current machine.
I have not yet got a good solution to the second part.
Here are two scripts from my personal library. They are simple enough not to require a full blown programming language, but aren't trivial, particularly if you aim to get all the details right (support all flags, return same exit code, etc.).
cvsadd
Write a script to perform a recursive cvs add so you don't have to manually add each sub-directory and its files. Make it so it detects the file types and adds the -kb flag for binary files as needed.
For bonus points: Allow the user to optionally specify a list of directories or files to restrict the search to. Handle file names with spaces correctly. If you can't figure out if a file is text or binary, ask the user.
#!/bin/bash
#
# Usage: cvsadd [FILE]...
#
# Mass `cvs add' script. Adds files and directories recursively, automatically
# figuring out if they are text or binary. If no file names are specified, looks
# for unversioned files and directories in the current directory.
svnfind
Write a wrapper around find which performs the same job, recursively finding files matching arbitrary criteria, but ignores .svn directories.
For bonus points: Allow other actions besides the default -print. Support the -H, -L, and -P options. Don't erroneously filter out files which simply happen to contain the substring .svn. Make usage identical to the regular find command.
#!/bin/bash
#
# Usage: svnfind [-H] [-L] [-P] [path...] [expression]
#
# Attempts to behave identically to a plain `find' command while ignoring .svn
# directories. Usage is identical to `find'.
You could try some simple CGI scripting. It can be done in shell and involves a lot of here documents, parsing and extracting of form values, a bit of escaping and whatever you want to do as payload. (I do not recommend exposing such a script to the hostile internet, though.)

Resources