Remove first line from bigfile using bash

Remove first line from bigfile using bash - bash

I have a text file and want to remove the first line (header), to read the file without header into a pipeline. This seems like a trivial question that has been answered many times, but due to the size of the files I'm facing, the solutions i found so far were not working. For my test runs i used echo "$(tail -n +2 "$FILE_NAME")" > "$FILE_NAME", but running this with my a bigger file results in the following error: bash: xrealloc: cannot allocate 18446744071562067968 bytes (1679360 bytes allocated) Is there any method that edits the file in place? Loading them into the memory wont work, some of my files are up to 400 Gb in size.
Thanks for the help!

You can use code like this:
awk 'NR!=1 {print}' input_file >output file
This will send to output file all but first line. You can use this construction to do your operations:
awk 'NR!=1 {print}' input_file|operation1|operation2...
Changing your command on this way can do the work:
tail -n +2 "$FILE_NAME" > "${FILE_NAME}.new"
This will need double diskspace

Tail is reasonably efficient for this operation.
The issue is with you wanting to overwrite the original file.
Using bash "$()" to defer the creation of the output file means bash has to hold the content in memory, hence the error message. For large files you would be better off writing the output to a temporary file, then use mv to move that over the original.
When sed is used in overwrite mode it does exactly this (for anything over a few lines).

sed -i 1d "$FILE_NAME"
It runs sed with the verysimple script 1d which picks the first line (the 1 selector) and deletes it (the d command). Thanks to the in-place option -i your file will be overwritten without using an intermediate file.
Even though you do not bother with an intermediate file, sed uses his own intermediate file internally. Your disk usage will suffer up to twice the file size during this operation.

I'm just going to address the "edit the file in place" portion of the question, although it appears that was not really what you were looking for. You will find many solutions describing features that claim to do in-place editing, but usually those solutions don't actually edit the file at all. Instead, they write to a temporary file and then overwrite the original with the temporary file. (eg, sed --in-place is a common solution which writes to a temporary file). Editing the file in place is something that you almost never actually want to do, since mutating a file is dangerous. Truly, if you believe you want to edit a file in place, give it serious thought and assume that you are wrong. However, if for some reason you really do need to do it, it's probably safest to just do it:
#include <err.h>
#include <stdio.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <unistd.h>
FILE * xfopen(const char *path, const char *mode);
int is_regular(int, const char *);
int
main(int argc, char **argv)
{
const char *rpath = argc > 1 ? argv[1] : "stdin";
const char *wpath = argc > 1 ? argv[1] : "stdout";
FILE *fr = argc > 1 ? xfopen(rpath, "r") : stdin;
FILE *fw = argc > 1 ? xfopen(wpath, "r+") : stdout;
char buf[BUFSIZ];
int c;
size_t rc;
off_t length = 0;
/* Discard the first line */
while( (c = getc(fr)) != EOF && c != '\n' ) {
;
}
if( c != EOF) while( (rc = fread(buf, 1, BUFSIZ, fr)) > 0) {
size_t wc;
wc = fwrite(buf, 1, rc, fw);
length += wc;
if( wc!= rc) {
break;
}
}
if( fclose(fr) ) {
err(EXIT_FAILURE, "%s", rpath);
}
if( is_regular(fileno(fw), wpath) && ftruncate(fileno(fw), length)) {
err(EXIT_FAILURE, "%s", wpath);
}
if( fclose(fw)) {
err(EXIT_FAILURE, "%s", wpath);
}
return EXIT_SUCCESS;
}
FILE *
xfopen(const char *path, const char *mode)
{
FILE *fp = fopen(path, mode);
if( fp == NULL ) {
perror(path);
exit(EXIT_FAILURE);
}
return fp;
}
int
is_regular(int fd, const char *name)
{
struct stat s;
if( fstat(fd, &s) == -1 ) {
perror(name);
exit(EXIT_FAILURE);
}
return !!(s.st_mode & S_IFREG);
}
By being explicit, it's pretty clear that you can easily lose data in the file. But if you want to avoid reading the entire file into memory, or avoid having two copies on some backing media at the same time, there's no way to avoid doing that and any solution which obscures that risk is fooling you. So making it explicit and knowing where the dangers lie is the right thing to do.

We can use the -i (in-place) option with sed to write the change back to the input file instead of printing the result to stdout:
sed -i '1d' FILE

Related

How does > /dev/null eat up output streams?

I've used /dev/null a lot in bash programming to send unnecessary output into a black hole.
For example, this command:
$ echo 'foo bar' > /dev/null
$
Will not echo anything. I've read that /dev/null is an empty file used to dispose of unwanted output through redirection. But how exactly does this disposal take place? I can't imagine /dev/null writing the content to a file and then immediately deleting that file. So what actually happens when you redirect to this file?

>/dev/null redirects the command standard output to the null device, which is a special device which discards the information written to it.
It's all implemented via file_operations (drivers/char/mem.c if you're curious to look yourself):
static const struct file_operations null_fops = {
.llseek = null_lseek,
.read = read_null,
.write = write_null,
.splice_write = splice_write_null,
};
write_null is what's called when you write to /dev/null. It always returns the same number of bytes that you write to it:
static ssize_t write_null(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
return count;
}
That's it. The buffer is just ignored.

Regarding how the parameters to the read function is passed in simple char driver

I am newbei to driver programming i am started writing the simple char driver . Then i created special file for my char driver mknod /dev/simple-driver c 250 0 .when it type cat /dev/simple-driver. it shows the string "Hello world from Kernel mode!". i know that function
static const char g_s_Hello_World_string[] = "Hello world tamil_vanan!\n\0";
static const ssize_t g_s_Hello_World_size = sizeof(g_s_Hello_World_string);
static ssize_t device_file_read(
struct file *file_ptr
, char __user *user_buffer
, size_t count
, loff_t *possition)
{
printk( KERN_NOTICE "Simple-driver: Device file is read at offset =
%i, read bytes count = %u", (int)*possition , (unsigned int)count );
if( *possition >= g_s_Hello_World_size )
return 0;
if( *possition + count > g_s_Hello_World_size )
count = g_s_Hello_World_size - *possition;
if( copy_to_user(user_buffer, g_s_Hello_World_string + *possition, count) != 0 )
return -EFAULT;
*possition += count;
return count;
}
is get called . This is mapped to (*read) in file_opreation structure of my driver .My question is how this function is get called , how the parameters like struct file,char,count, offset are passed bcoz is i simply typed cat command ..Please elabroate how this happening

In Linux all are considered as files. The type of file, whether it is a driver file or normal file depends upon the mount point where it is mounted.
For Eg: If we consider your case : cat /dev/simple-driver traverses back to the mount point of device files.
From the device file name simple-driver it retrieves Major and Minor number.
From those number(especially from minor number) it associates the driver file for your character driver.
From the driver it uses struct file ops structure to find the read function, which is nothing but your read function:
static ssize_t device_file_read(struct file *file_ptr, char __user *user_buffer, size_t count, loff_t *possition)
User_buffer will always take sizeof(size_t count).It is better to keep a check of buffer(In some cases it throws warning)
String is copied to User_buffer(copy_to_user is used to check kernel flags during copy operation).
postion is 0 for first copy and it increments in the order of count:position+=count.
Once read function returns the buffer to cat. and cat flushes the buffer contents on std_out which is nothing but your console.

cat will use some posix version of read call from glibc. Glibc will put the arguments on the stack or in registers (this depends on your hardware architecture) and will switch to kernel mode. In the kernel the values will be copied to the kernel stack. And in the end your read function will be called.

Using sed to transform a C struct and typedef

I have a couple structure definitions in my input code. For example:
struct node {
int val;
struct node *next;
};
or
typedef struct {
int numer;
int denom;
} Rational;
I used the following line to convert them into one line and copy it twice.
sed '/struct[^(){]*{/{:l N;s/\n//;/}[^}]*;/!t l;s/ */ /g;p;p}'
the result is this:
struct node { int val; struct node *next;};
struct node { int val; struct node *next;};
struct node { int val; struct node *next;};
typedef struct { int numer; int denom;} Rational;
typedef struct { int numer; int denom;} Rational;
typedef struct { int numer; int denom;} Rational;
This is what I want:
I would like the first line to be restored to the original structure block
I would like the second line to turn into to a function heading that looks like this...
void init_structName( structName *var, int data1, int data2 )
-structName is basically the name of the structure.
-var is any name you like.
-data1, data2.... are values that are in the struct.
3.I would like the third line to turn into to the function body. Where I initialize the the data parameters. It would look like this.
{
var->data1 = data1;
var->data2 = data2;
}
Keep in mind that ALL my struct definitions in the input file are placed in one line and copied three times. So when the code finds a structure defintion it can assume that there will be two more copies below.
For example, this is the output I want if the input file had the repeating lines shown above.
struct node {
int val;
struct node *next;
};
void init_node(struct node *var, int val, struct node *next)
{
var->val = val;
var->next = next;
}
typedef struct {
int numer;
int denom;
} Rational;
void init_Rational( Rational *var, int numer, int denom )
{
var->numer = numer;
var->denom = denom;
}
In case someone was curious. These functions will be called from the main function to initialize the struct variables.
Can someone help? I realize this is kind of tough.
Thanks so much!!

Seeing that sed is Turing Complete, it is possible to do it in a single go, but that doesn't mean that the code is very user friendly =)
My attempt at a solution would be:
#!/bin/sed -nf
/struct/b continue
p
d
: continue
# 1st step:
s/\(struct\s.*{\)\([^}]*\)\(}.*\)/\1\
\2\
\3/
s/;\(\s*[^\n}]\)/;\
\1/g
p
s/.*//
n
# 2nd step:
s/struct\s*\([A-Za-z_][A-Za-z_0-9]*\)\s*{\([^}]*\)}.*/void init_\1(struct \1 *var, \2)/
s/typedef\s*struct\s*{\([^}]*\)}\s*\([A-Za-z_][A-Za-z_0-9]*\)\s*;/void init_\2(struct \2 *var, \1)/
s/;/,/g
s/,\s*)/)/
p
s/.*//
n
# 3rd step
s/.*{\s*\([^}]*\)}.*/{\
\1}/
s/[A-Za-z \t]*[\* \t]\s*\([A-Za-z_][A-Za-z_0-9]*\)\s*;/\tvar->\1 = \1;\
/g
p
I'll try to explain everything I did, but firstly I must warn that this probably isn't very generalized. For example, it assumes that the three identical lines follow each other (ie. no other line between them).
Before starting, notice that the file is a script that requires the "-n" flag to run. This tells sed to not print anything to standard output unless the script explicitly tells it to (through the "p" command, for example). The "-f" options is a "trick" to tell sed to open the file that follows. When executing the script with "./myscript.sed", bash will execute "/bin/sed -nf myscript.sed", so it will correctly read the rest of the script.
Step zero would be just a check to see if we have a valid line. I'm assuming every valid line contains the word struct. If the line is valid, the script branches (jumps, the "b" command is equivalent to the goto statement in C) to the continue label (differently from C, labels start with ":", rather than ending with it). If it isn't valid, we force it to be printed with the "p" command, and then delete the line from pattern space with the "d" command. By deleting the line, sed will read the next line and start executing the script from the beginning.
If the line is valid, the actions to change the lines start. The first step is to generate the struct body. This is done by a series of commands.
Separate the line into three parts, everything up to the opening bracket, everything up to the closing bracket (but without including it), and everything from the closing bracket (now including it). I should mention that one of the quirks of sed is that we search for newlines with "\n", but write newlines with a "\" followed by an actual newline. That's why this command is split into three different lines. IIRC this behaviour is specific to POSIX sed, but probably the GNU version (present in most Linux distributions) allows writing a newline with "\n".
Add a newline after every semicolon. The this works is a bit awkward, we copy everything after the semicolon after a newline inserted after the semicolon. The g flag tells sed to do this repeatedly, and that is why it works. Also note again the newline escaping.
Force the result to be printed
Before the second step, we manually clear the lines from the pattern-space (ie. buffer), so we can start fresh for the next line. If we did this with the "d" command, sed would start reading the commands from the start of the file again. The "n" command then reads the next line into the pattern-space. After that, we start the commands to transform the line into a function declaration:
We first match the word struct, followed by zero or more white space, then followed by a C identifier that can start with underscore or alphabetic letters, and can contain underscores and alphanumeric characters. The identifier is captured into the "variable" "\1". We then match the content between brackets, which is stored into "\2". These are then used to generate the function declaration.
We then do the same process, but now for the "typedef" case. Notice that now the identifier is after the brackets, so "\1" now contains the contents inside the brackets and "\2" contains the identifier.
Now we replace all semicolons with commas, so it can start looking more like a function definition.
The last substitute command removes the extra comma before the closing parenthesis.
Finally print the result.
Again, before the last step, manually clean the pattern-space and read the next line. The step will then generate the function body:
Match and capture everything inside the brackets. Notice the ".*" before the opening bracket and after the closing bracket. This is used so only the contents of the brackets are written afterwards. When writing the output, we place the opening the bracket in a separate line.
We match alphabetic characters and spaces, so we can skip the type declaration. We require at least a white space character or an asterisk (for pointers) to mark the start of the identifier. We then proceed to capture the identifier. This only works because of what follows the capture: we explicitly require that after the identifier there are only optional white spaces followed by a semicolon. This forces the expression to get the identifier characters before the semicolon, ie. if there are more than two words, it will only get the last word. Therefore it would work with "unsigned int var", capturing "var" correctly. When writing the output, we place some indentation, followed by the desired format, including the escaped newline.
Print the final output.
I don't know if I was clear enough. Feel free to ask for any clarifications.
Hope this helps =)

This should give you a few tips on how inappropriate sed actually is for this sort of task. I couldn't figure out how to do it in one pass and by the time I finished writing the scripts, I noticed you were expecting somewhat different results.
Your problem is better suited for a scripting language and a parsing library. Consider python + pyparsing (here is an example C struct parsing grammar, but you would need something much simpler than that) or perl6's rules.
Still, perhaps this will be of some use if you decide to stick to sed:
pass-one.sh
#!/bin/sed -nf
/^struct/ {
s|^\(struct[^(){]*{\)|\1\n|
s|[^}];|;\n|gp
a \\n
}
/^typedef/ {
h
# create signature
s|.*{\(.*\)} \(.*\);|void init_\2( \2 *var, \1 ) {|
# insert argument list to signature and remove trailing ;
s|\([^;]*\); ) {|\1 ) {|g
s|;|,|g
p
g
# add constructor (further substitutions follow in pass-two)
s|.*{\(.*\)}.*|\1|
s|;|;\n|g
s|\n$||p
a }
a \\n
}
pass-two.sh
#!/bin/sed -f
# fix struct indent
/^struct/ {
:loop1
n
s|^ | |
t loop1
}
# unsigned int name -> var->name = name
/^void init_/{
:loop2
n
s|.* \(.*\);| var->\1 = \1;|
t loop2
}
Usage
$ cat << EOF | ./pass-one.sh | ./pass-two.sh
struct node { int val; struct node *next;};
typedef struct { int numer; int denom;} Rational;
struct node { int val; struct node *next;};
typedef struct { int numer; unsigned int denom;} Rational;
EOF
struct node {
int va;
struct node *nex;
};
void init_Rational( Rational *var, int numer, int denom ) {
var->numer = numer;
var->denom = denom;
}
struct node {
int va;
struct node *nex;
};
void init_Rational( Rational *var, int numer, unsigned int denom ) {
var->numer = numer;
var->denom = denom;
}

gcc and lccwin32:different result

i try to compile this code:
#include <stdio.h>
void print(FILE *a)
{
int main();
int count=20;
int c;
int stop=0;
char answer;
while(!stop){
while((c=getc(a))!=EOF){
fprintf(stdout,"%c",c);
if(c=='\n'){
count--;
if(!count){
printf("do you want continue:y=for continue/q=for quit");
fflush(stdin);
answer=getchar();
if(answer=='y' || answer=='Y')
count=20;
else if(answer=='Q' || answer=='q'){
printf("you quit this program,press any key and hit the enter to close");
stop=1;
break;
}
else{
printf("argument is unacceptable,rolling back action");
main();
}
}
}
}
if(c==EOF)
stop=1;
}
}
void halt()/*do nothing just for halt and waiting for input*/
{
int a;
scanf("%d",&a);
}
int main()
{
FILE *in,*fopen();
char name1[25];
int a;
printf("enter the name of the file you want to show:");
scanf("%24s",name1);
in=fopen(name1,"r");
if(in==NULL){
printf("the files doesnt exist or it is in another directory, try to enter again\n");
main();
}
else
print(in);
fclose(in);
halt();
return 0;
}
the purpose of the program is to show 20 line content of a file. i compiled it in windows xp with lccwin32 and it works as expected. but problem arise when i change my os to linux (Ubuntu:pricise pangolin 12.04 LTS Desktop) and compile it with gcc.first it seems works fine but until the 20th line and prompt is out, when i put the argument (y for continue , q for quit)and hit the enter, but nothings happen. it just slipped away to elsepart which is starting again the program.so is it the gcc i have buggy or my code doesnt suit with gcc or may be i missed something?

I hate scanf. I would suggest replacing the scanf("%24s",name1) with fgets(s,24,stdin);
(And then unfortunately doing if (s[strlen(s)-1] == '\n') s[strlen(s)-1] = '\0' to get rid of the \n at the end.
I would also suggest:
Not use recursion on main
Use int main(int argc, char *argv[]) and then passing the name of your file as an argument (so you would check that argc > 1 and then use argv[1] as the filename, and then when running the program do ./programname filename)
Still not using scanf

In addition to the issues reported by #Foon you also have those problems :
fflush(stdin) is not working as you think it does.
scanf() leaves the newline character in the input buffer.
Your problem is that there is still a newline (\n) in the input buffer when you call getchar(), so your y/q answer is not even read.
Replacing fflush(stdin) with a solution from 1., or replacing fflush()+getchar() with scanf("\n%c",&answer); should solve that particular issue.

Bash 'printf' equivalent for command prompt?

I'm looking to pipe some String input to a small C program in Windows's command prompt. In bash I could use
$ printf "AAAAA\x86\x08\x04\xed" | ./program
Essentially, I need something to escape those hexadecimal numbers in command prompt.
Is there an equivalent or similar command for printf in command prompt/powershell?
Thanks

In PowerShell, you would do it this way:
"AAAAA{0}{1}{2}{3}" -f 0x86,0x08,0x04,0xed | ./program

I recently came up with the same question myself and decided that for someone developing Windows exploits it is worth installing cygwin :)
Otherwise one could build a small C program mimicking printf's functionality:
#include <string.h>
int main(int argc, char *argv[])
{
int i;
char tmp[3];
tmp[2] = '\0';
if (argc > 1) {
for (i = 2; i < strlen(argv[1]); i += 4) {
strncpy(tmp, argv[1]+i, 2);
printf("%c", (char)strtol(tmp, NULL, 16));
}
}
else {
printf("USAGE: printf.exe SHELLCODE\n");
return 1;
}
return 0;
}
The program only handles "\xAB\xCD" strings, but it shouldn't be difficult to extend it to handle "AAAAA\xAB\xCD" strings if one needs it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Remove first line from bigfile using bash - bash

We can use the -i (in-place) option with sed to write the change back to the input file instead of printing the result to stdout: sed -i '1d' FILE

Related

How does > /dev/null eat up output streams?

Regarding how the parameters to the read function is passed in simple char driver

Using sed to transform a C struct and typedef

gcc and lccwin32:different result

Bash 'printf' equivalent for command prompt?

Categories

Resources