How to extract specific lines from a huge data file? - text-files

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.

Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

Related

Archiving differences between time sequence of text files

There is a sensor network from which I download measurements every ten minutes or on demand. Each download is a text file consisting of several lines with a timestamp and values. The name of the text file also contains a timestamp of when the download occured. So as time progresses I collect a lot of text files, which consist a sequence. Because of the physical parameters which the values are taken from, there are little to no differences between adjacent text files.
As I want to archive into a (compressed) file all of the text files that are being downloaded, in an efficient way. So I thought that archiving the differences between adjacent text files is one such way.
I want some ideas to work it out in BASH, using well-known tools like tar and diff. I know also about git, but it is not useful for creating an archive file.
I will try to clarify a bit. A text file is consisting of several lines of the following space-separated format:
timestamp sensor_uuid value_1 ... value_N
Not every line has exactly the same (say N) values, but there is little variation of tokens per line. Also the values themselves have little variation in time. As they come from sensors, and there is a single sensor per line, the number of the lines of the text file depends on how many responses I got for each call. Zero lines is possible.
Finally the text filename takes its own timestamp, a concatenation of an original name with a date time string:
sensors_2019-12-11_153043.txt for today’s 15:30:43 request.
Needless to say that timestamps in the lines of this example filename are usually earlier than the filename’s, or even there are lines and timestamps repeated from text files created before.
So my idea for efficient archiving is putting the first text file into the archive and then putting only the updates, i.e. the differences between two adjacent text files, which eventually will be tracing back to the first one text file actually archived. But at retrieving I need to get a complete text file, as if it was itself archived and not its difference from the past.
Tar takes in the whole text files, and a couple of differences between the text files’ lines are not producing a repeatable pattern suitable for strong compression.
tar command already identifies the repeating patterns and compress them. But if you want to eliminate the parts that are repeated you can use "diff" command with some other simple manipulation of diff output and then redirect all to file.
Let's say we have 2 file "file1.txt" and "file2.txt" you can use this command line to get only the line added from the second file (file2.txt) :
diff -u file1.txt file2.txt | grep -E "^\+" | sed -E 's/^\+//' | grep -v "\+"
then we need just to redirect the output or to the same file (example file2.txt) or in another file and then delete the file2.txt before the tar operation.

Maximum number of input file for Ghostscript (gs)

I simply want to combine multiple eps files into one big file using gs command
the command work flawlessly except that when I specify more than 20 input files.
Somehow the command ignore input files starting from 21st input.
Anyone experience the same behavior? Is there a cap of number of input files specify anywhere?
I look through the site and couldn't find one.
sample command
gs -o output.eps -sDEVICE=eps2write file1.eps file2.eps .... file21.eps
Thank you.
Edit: add sample command
Almost certainly you have simply reached the maximum length of the command line for your Operating System. You can use the # syntax for Ghostscript to supply a file containing the command line instead.
https://www.ghostscript.com/doc/current/Use.htm#Input_control
Note that the EPS files will not be placed appropriately using that command, and this does not actually combine EPS files, it creates a new EPS file whose marking content should be the same as the input(s).
If you actually want to combine the EPS files its easy enough, but will require a small amount of programming to parse the EPS file headers and produce appropriate scale/translate operations, as well as stripping off any bitmap previews (which will also happen when you run them through Ghostscript).

Merge two text files

I'm using Windows and Notepad++ to separate file in txt. I have 2 files which is I have to merge it side by side or line by line for my data analysis.
Here is the example:
file1.txt
Abcdefghijk
abcdefghijk
file2.txt
123456
123456
then the output I want is like this:
Abcdefghijk123456
abcdefghijk123456
in the next file or output file. Does anybody here know how to do this?
Your question answered here by TheMadTechnician. Using powershell, you should take both source files (1 and 2) as arrays of lines. Then comes simple cycle, like "merge line x from file1 with line x from file2 as long you have some lines in file1".
Unfortunately its impossible with pure cmd.
#riki.. you could also write a batch program to do this pro grammatically. There should probably be no limit over the number of lines.
It may depend on the number of lines you're having in each files. I suggest to copy paste the same if it is less than 50 lines.
Otherwise,
use some powerful languages like python, c,php etc. And make it run before performing data analysis.
There is a free utility you can download and run on your computer, called txtcollector. I read about it here. I used it because I had a whole folder of files to concatenate. It was a breeze. The only slight imperfection I noticed was that I couldn't paste in the path to the specific folder in the first step (choosing the folder where the files to be concatenated were). However, I could do this when choosing where to save the result.

Script to remove parts of a filename

I am looking for a way to remove parts of my file names (big folder). I don't want to rename them all but I merely want to have an output of the edited file names in a text document or clipboard.
They all follow the similar pattern. The initial part of my file names are randomized by the system. I am not sure how to proceed in terms of what to use to complete the first part. Here is an example filename:
1231230#p9999_w_e_aa.jpg
I want to extract 9999 part (the part between the #p and the first underscore).
The machine I'm currently working from is running Windows 7.

Performing conditional change in .ods file in bash

How can I do a conditional change in an .ods document? I have two columns. One of them stores a string and the second a value. I want to search the document with a particular string that I have, say "xyz". If this matches any of the strings that are shown in the first column, I would like a value of 1 to be deducted from the cell in the same row, but from the second column. The data in the .ods document are separated by the different adjoining cells (so a tab?)
As an example, consider the following:
xyz 23
xxy 42
xzz 76
If I have the string "xxy", I would like the bash script to update the .ods file such that it looks as so:
xyz 23
xxy 41
xzz 76
Now, the strings that I am searching for are stored in a seperate .txt file. I would like to iterate over all of the strings in the .txt file and repeatedly perform the described operation in the .ods file. There can be cases where the are multiple occurrences of the same string. Any helps with this?
This should be a comment, but its a bit long
am searching for are stored in a text file.
No. A MS Excel files is not a text file. Its not even a file but rather an embedded filesystem where content is encapsuleted in OLE, or more recently as an xml tree. While there are both OLE and XML parsers available on Unix (I assume you want to run this on Linux/Unix/Posix since you've flagged this with bash, awk and sed) that just gets you access to where the data is stored. You still need a detailled understanding of the file format to be able to make changes. While it may be possible to do this in bash, it would be a lot easier in a dedicated programming language. Several do come with libraries for processing Excel files but vary in their support for file formats. Alternatively you could load it up in openoffice using its UNO API.

Resources