I have a large block of binary, take this for an example:
000110101100101001110001010101110010101010110101
(Not sure if the example is a multiple of 8 but...)
I'd like to split this block of text into 8 bit chunks, and output it to a file line by line, i.e. like:
00011010
11001010
01110001
etc...
Apologies if this is really simple, I've attempted using 'split' but can't get the right syntax and I'd ideally like to do this in bash. Thanks,
Try this with grep:
grep -Eo '.{8}' file > newfile
Output to newfile:
00011010
11001010
01110001
01010111
00101010
Same output to newfile with fold from GNU Core Utilities:
fold -w 8 file > newfile
Related
My requirement is to chop off the header and trailer records from a large file, I'm using a file of size 2.5GB with 1.8 million records. For doing so, I'm executing:
head -n $((count-1)) largeFile | tail -n $((count-2)) > outputFile
Whenever I select count>=725,000 records (size=1,063,577,322), the prompt is returning an error:
tail:unable to malloc memory
I assumed that the pipe buffer went full and tried:
head -n 1000000 largeFile | tail -n 720000 > outputFile
which should also fail since i'm passing count> 725000 to head, but, it generated the output.
Why it is so? As head is generating same amount of data (or more), both commands should fail, but the command is depending on tail count. Is it not the way where, first head writes into pipe and then tail uses pipe as input. If it is not, how parallelism is supported here, since tail works from end which is not known until head completes execution. Please correct me, I've assumed lot of things here.
PS: For the time being I've used grep to remove header and trailer. Also, ulimit on my machine returns:
pipe (512 byte) 64 {32 KB}
Thanks guys...
Just do this instead:
awk 'NR>2{print prev} {prev=$0}' largeFile > outputFile
it'll only store 1 line in memory at a time so no need to worry about memory issues.
Here's the result:
$ seq 5 | awk 'NR>2{print prev} {prev=$0}'
2
3
4
I did not test this with a large file, but it will avoid a pipe.
sed '1d;$d' largeFile > outputFile
Ed Morton and Walter A have already given workable alternatives; I'll take a stab at explaining why the original is failing. It's because of the way tail works: tail will read from the file (or pipe), starting at the beginning. It stores the last lines seen, and then when it reaches the end of the file, it outputs the stored lines. That means that when you use tail -n 725000, it needs to store the last 725,000 lines in memory, so it can print them when it reaches the end of the file. If 725,000 lines (most of a 2.5GB file) won't fit in memory, you get a malloc ("memory allocate") error.
Solution: use a process that doesn't have to buffer most of the file before outputting it, as both Ed and Walter's solutions do. As a bonus, they both trim the first line in the same process.
I'm trying to perform simple literal search/replace on a large (30G) one-line file, using sed.
I would expect this to take some time but, when I run it, it returns after a few seconds and, when I look at the generated file, it's zero length.
input file has 30G
$ ls -lha Full-Text-Tokenized-Single-Line.txt
-rw-rw-r-- 1 ubuntu ubuntu 30G Jun 9 19:51 Full-Text-Tokenized-Single-Line.txt
run the command:
$ sed 's/<unk>/ /g' Full-Text-Tokenized-Single-Line.txt > Full-Text-Tokenized-Single-Line-No-unks.txt
the output file has zero length!
$ ls -lha Full-Text-Tokenized-Single-Line-No-unks.txt
-rw-rw-r-- 1 ubuntu ubuntu 0 Jun 9 19:52 Full-Text-Tokenized-Single-Line-No-unks.txt
Things I've tried
running the very same example on a shorter file: works
using -e modifier: doesn't work
escaping "<" and ">": doesn't work
using a simple pattern line ('s/foo/bar/g') instead: doesn't work: zero-length file is returned.
EDIT (more information)
return code is 0
sed version is (GNU sed) 4.2.2
Just use awk, it's designed for handling records separated by arbitrary strings. With GNU awk for multi-char RS:
awk -v RS='<unk>' '{ORS=(RT?" ":"")}1' file
The above splits the input into records separated by <unk> so if enough <unk>s are present in the input then the individual records will be small enough to fit in memory. It then prints each record followed by a blank char so the overall impact to the data is that all <unk>s become blank chars.
If that direct approach doesn't work for you THEN it'd be time to start looking for alternative solutions.
with line-based editors like sed you can't expect this to work, since its unit of work (record) is the line terminated with line breaks.
One suggestion if you have white space in your file (to prevent searched pattern to split) is use
fold -s file_with_one_long_line |
sed 's/find/replace/g' |
tr -d '\n' > output
ps. fold default width is 80, in case you have words longer than 80 you can add -w 1000 or at least the longest word size to prevent word splitting.
Officially gnu sed has no line limit
http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq6_005.html
However the page state that:
"no limit" means there is no "fixed" limit. Limits are actually determined by one's hardware, memory, operating system, and which C library is used to compile sed.
I tried running sed on a 7gb single file could reproduce same issue.
This page https://community.hpe.com/t5/Languages-and-Scripting/Sed-Maximum-Line-Length/td-p/5136721 suggest using perl instead
perl -pe 's/start=//g;s/stop=//g;s/<unk>/ /g' file > output
If the tokens are space(not all whitespace) delimited and assuming your are only matching single words then you could use perl with space as the record separator
perl -040 -pe 's/<unk>/ /' file
or GNU awk to match all whitespace
awk -vRS="[[:space:]]" '{ORS=RT;sub(/<unk>/," ")} file
From this question, I found the split utilty, which takes a file and splits it into evenly sized chunks. By default, it outputs these chunks to new files, but I'd like to get it to output them to stdout, separated by a newline (or an arbitrary delimiter). Is this possible?
I tried cat testfile.txt | split -b 128 - /dev/stdout
which fails with the error split: /dev/stdoutaa: Permission denied.
Looking at the help text, it seems this tells split to use /dev/stdout as a prefix for the filename, not to write to /dev/stdout itself. It does not indicate any option to write directly to a single file with a delimiter. Is there a way I can trick split into doing this, or is there a different utility that accomplishes the behavior I want?
It's not clear exactly what you want to do, but perhaps the --filter option to split will help out:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
Maybe you can use that directly. For example, this will read a file 10 bytes at a time, passing each chunk through the tr command:
split -b 10 --filter "tr [:lower:] [:upper:]" afile
If you really want to emit a stream on stdout that has separators between chunks, you could do something like:
split -b 10 --filter 'dd 2> /dev/null; echo ---sep---' afile
If afile is a file in my current directory that looks like:
the quick brown fox jumped over the lazy dog.
Then the above command will result in:
the quick ---sep---
brown fox ---sep---
jumped ove---sep---
r the lazy---sep---
dog.
---sep---
From info page :
`--filter=COMMAND'
With this option, rather than simply writing to each output file,
write through a pipe to the specified shell COMMAND for each
output file. COMMAND should use the $FILE environment variable,
which is set to a different output file name for each invocation
of the command.
split -b 128 --filter='cat ; echo ' inputfile
Here is one way of doing it. You will get each 128 character into variable "var".
You may use your preferred delimiter to print or use it for further processing.
#!/bin/bash
cat yourTextFile | while read -r -n 128 var ; do
printf "\n$var"
done
You may use it as below at command line:
while read -r -n 128 var ; do printf "\n$var" ; done < yourTextFile
No, the utility will not write anything to standard output. The standard specification of it says specifically that standard output in not used.
If you used split, you would need to concatenate the created files, inserting a delimiter in between them.
If you just want to insert a delimiter every N th line, you may use GNU sed:
$ sed '0~3a\-----\' file
This inserts a line containing ----- every 3rd line.
To divide the file into chunks, separated by newlines, and write to stdout, use fold:
cat yourfile.txt | fold -w 128
...will write to stdout in "chunks" of 128 chars.
I have some dump files called dump_mydump_0.cfg, dump_mydump_250.cfg, ..., all the way up to dump_mydump_40000.cfg. For each dump file, I'd like to take the 16th line out, read them, and put them into one single file.
I'm using sed, but I came across some syntax errors. Here's what I have so far:
for lineNo in 16 ;
for fileNo in 0,40000 ; do
sed -n "${lineNo}{p;q;}" dump_mydump_file${lineNo}.cfg >> data.txt
done
Considering your files are named with intervals of 250, you should get it working using:
for lineNo in 16; do
for fileNo in {0..40000..250}; do
sed -n "${lineNo}{p;q;}" dump_mydump_file${fileNo}.cfg >> data.txt
done
done
Note both the bash syntax corrections -- do, done, and {0..40000..250} --, and the input file name, that should depend on ${fileNo} instead of ${lineNo}.
Alternatively, with (GNU) awk:
awk "FNR==16{print;nextfile}" dump_mydump_{0..40000..250}.cfg > data.txt
(I used the filenames as shown in the OP as opposed to the ones which would have been generated by the bash for loop, if corrected to work. But you can edit as needed.)
The advantage is that you don't need the for loop, and you don't need to spawn 160 processes. But it's not a huge advantage.
This might work for you (GNU sed):
sed -ns '16wdata.txt' dump_mydump_{0..40000..250}.cfg
I have a nearly 3 GB file that I would like to add two lines to the top of. Every time I try to manually add these lines, vim and vi freeze up on the save (I let them try to save for about 10 minutes each). I was hoping that there would be a way to just append to the top, in the same way you would append to the bottom of the file. The only things I have seen so far however include a temporary file, which I feel would be slow due to the file size.
I was hoping something like:
grep -top lineIwant >> fileIwant
Does anyone know a good way to append to the top of the file?
Try
cat file_with_new_lines file > newfile
I did some benchmarking to compare using sed with in-place edit (as suggested here) to cat (as suggested here).
~3GB bigfile filled with dots:
$ head -n3 bigfile
................................................................................
................................................................................
................................................................................
$ du -b bigfile
3025635308 bigfile
File newlines with two lines to insert on top of bigfile:
$ cat newlines
some data
some other data
$ du -b newlines
26 newlines
Benchmark results using dumbbench v0.08:
cat:
$ dumbbench -- sh -c "cat newlines bigfile > bigfile.new"
cmd: Ran 21 iterations (0 outliers).
cmd: Rounded run time per iteration: 2.2107e+01 +/- 5.9e-02 (0.3%)
sed with redirection:
$ dumbbench -- sh -c "sed '1i some data\nsome other data' bigfile > bigfile.new"
cmd: Ran 23 iterations (3 outliers).
cmd: Rounded run time per iteration: 2.4714e+01 +/- 5.3e-02 (0.2%)
sed with in-place edit:
$ dumbbench -- sh -c "sed -i '1i some data\nsome other data' bigfile"
cmd: Ran 27 iterations (7 outliers).
cmd: Rounded run time per iteration: 4.464e+01 +/- 1.9e-01 (0.4%)
So sed seems to be way slower (80.6%) when doing in-place edit on large files, probably due to moving the intermediary temp file to the location of the original file afterwards. Using I/O redirection sed is only 11.8% slower than cat.
Based on these results I would use cat as suggested in this answer.
Try doing this :
using sed :
sed -i '1i NewLine' file
Or using ed :
ed -s file <<EOF
1i
NewLine
.
w
q
EOF
The speed of such an operation depends greatly on the underlying file system. To my knowledge there isn't a FS optimized for this particular operation. Most FS organize files using full disk blocks, excepted for the last one, which may be partially used by the end of the file. Indeed, a file of size N would take N/S blocks, where S is the block size, and one more block for the remaining part of the file (of size N%S, % being the remainder operator), if N is not divisible by S.
Usually, these blocks are referenced by their indices on the disk (or partition), and these indices are stored within the FS metadata, attached to the file entry which allocates them.
From this description, you can see that it could be possible to prepend content whose size would be a multiple of the block size, by just updating the metadata with the new list of blocks used by the file. However, if that prepended content doesn't fill exactly a number of blocks, then the existing data would have to be shifted by that exceeding amount.
Some FS may implement the possibility of having partially used blocks within the list (and not only as the last entry) of used ones for files, but this is not a trivial thing to do.
See these other SO questions for further details:
Prepending Data to a file
Is there a file system with a low level prepend operation
At a higher level, even if that operation is supported by the FS driver, it is still possible that programs don't use the feature.
For the instance of that problem you are trying to solve, the best way is probably a program capable of catening the new content and the existing one to a new file.
cat file
Unix
linux
It append to the the two lines of the file at the same time using the command
sed -i '1a C \n java ' file
cat file
Unix
C
java
Linux
you want to INSERT means using i and Replace means using c