multi string search and replace - bash

I have 2 text files
File1 has more than 400K lines. Each line is similar to this sample:
hstor,table,"8bit string",ABCD,0000,0,4,19000101,today
File2 has a list of new 8bit strings to replace the current ones in file1 while preserving the rest in file1.
So file1 goes from
hstor,table,"OLD 8bit string",ABCD,0000,0,4,19000101,today
to
hstor,table,"NEW 8bit string",ABCD,0000,0,4,19000101,today
I can't sed 400K times
How can I script this so that all the OLD 8bit strings in file1 are replaced with the NEW 8bit strings listed in file2?

This might work for you (GNU sed):
sed 's#.*#s/[^,]*/&/3#' file2 | cat -n | sed -f - file1
This converts file2 into a sed script file and then runs it on file1.
The first sed script takes each line in file2 and changes it to substitution command which replaces the third field in the target with the contents of the current line of file2.
This is piped into a cat command which inserts line numbers which will be used by the sed script to address each substitution command.
The final sed command uses the /dev/stdin to read in a sed script and runs it against the input file file1.

In case you need to do this multiple times and performance is important, I wrote a program in C to do this. It's a modified version of this code. I know you did not use any C-tag, but I got the impression that your main concern was to just get the job done.
NOTE:
I take no responsibility for it. It is quite a bit of a quick hack, and I do assume some stuff. One assumption is that the string you want to replace does not contain any commas. Another is that no line is longer than 100 bytes. A third assumption is that the input files are named file and rep respectively. If you want to try it out, make sure to inspect the data afterwards. It writes to stdout so you just redirect the output to a new file. It does the job in around two seconds.
Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
int main()
{
/* declare a file pointer */
FILE *infile;
FILE *replace;
char *buffer;
char *rep_buffer;
long numbytes;
long rep_numbytes;
/* open an existing file for reading */
infile = fopen("file", "r");
replace = fopen("rep", "r");
/* quit if the file does not exist */
if(infile == NULL)
return 1;
if(replace == NULL)
return 1;
/* Get the number of bytes */
fseek(infile, 0L, SEEK_END);
numbytes = ftell(infile);
fseek(replace, 0L, SEEK_END);
rep_numbytes = ftell(replace);
/* reset the file position indicator to
the beginning of the file */
fseek(infile, 0L, SEEK_SET);
fseek(replace, 0L, SEEK_SET);
/* grab sufficient memory for the
buffer to hold the text */
buffer = (char*)calloc(numbytes, sizeof(char));
rep_buffer = (char*)calloc(rep_numbytes, sizeof(char));
/* memory error */
if(buffer == NULL)
return 1;
if(rep_buffer == NULL)
return 1;
/* copy all the text into the buffer */
fread(buffer, sizeof(char), numbytes, infile);
fclose(infile);
fread(rep_buffer, sizeof(char), rep_numbytes, replace);
fclose(replace);
char line[100]={0};
char *i=buffer;
char *r=rep_buffer;
while(i<&buffer[numbytes-1]) {
int n;
/* Copy from infile until second comma */
for(n=0; i[n]!=','; n++);
n++;
for(; i[n]!=','; n++);
n++;
memcpy(line, i, n);
/* Copy a line from replacement */
int m;
for(m=0; r[m]!='\n'; m++);
memcpy(&line[n], r, m);
/* Skip corresponding text from infile */
int k;
for(k=n; i[k]!=','; k++);
/* Copy the rest of the line */
int l;
for(l=k; i[l]!='\n'; l++);
memcpy(&line[n+m], &i[k], l-k);
/* Next line */
i+=l;
r+=m+1;
/* Print to stdout */
printf("%s", line);
}
/* free the memory we used for the buffer */
free(buffer);
free(rep_buffer);
}

Related

only read last line of text file (C++ Builder)

Is there an efficient way to read the last line of a text file? Right now i'm simply reading each line with code like below. Then S holds the last line read. Is there a good way to grab that last line without looping through entire text file?
TStreamReader* Reader;
Reader = new TStreamReader(myfile);
while (!Reader->EndOfStream)
{
String S = Reader->ReadLine();
}
Exactly as Remy Lebeau commented:
Use file access functions FileOpen,FileSeek,FileRead
look here for example of usage:
Convert the Linux open, read, write, close functions to work on Windows
load your file by chunks from end into memory
so make a static buffer and load file into it from end by chunks ...
stop on eol (end of line) usually CR,LF
just scan for 13,10 ASCII codes or their combinations from end of chunk. Beware some files have last line also terminated so you should skip that the first time ...
known eols are:
13
10
13,10
10,13
construct line
if no eol found add whole chunk to string, if found add just the part after it ...
Here small example:
int hnd,siz,i,n;
const int bufsz=256; // buffer size
char buf[bufsz+1];
AnsiString lin; // last line output
buf[bufsz]=0; // string terminator
hnd=FileOpen("in.txt",fmOpenRead); // open file
siz=FileSeek(hnd,0,2); // obtain size and point to its end
for (i=-1,lin="";siz;)
{
n=bufsz; // n = chunk size to load
if (n>siz) n=siz; siz-=n;
FileSeek(hnd,siz,0); // point to its location (from start)
FileRead(hnd,buf,n); // load it to buf[]
if (i<0) // first time pass (skip last eol)
{
i=n-1; if (i>0) if ((buf[i]==10)||(buf[i]==13)) n--;
i--; if (i>0) if ((buf[i]==10)||(buf[i]==13)) if (buf[i]!=buf[i+1]) n--;
}
for (i=n-1;i>=0;i--) // scan for eol (CR,LF)
if ((buf[i]==10)||(buf[i]==13))
{ siz=0; break; } i++; // i points to start of line and siz is zero so no chunks are readed after...
lin=AnsiString(buf+i)+lin; // add new chunk to line
}
FileClose(hnd); // close file
// here lin is your last line

Changing tab-completion for read builtin in bash

The current tab-completion while "read -e" is active in bash seems to be only matching filenames:
read -e
[[TabTab]]
abc.txt bcd.txt cde.txt
I want the completion to be a set of strings defined by me, while file/dir/hostname-completion etc. should be deactivated for the duration of "read -e".
Outside of a script
complete -W 'string1 string2 string3' -E
works well, but i cant get this kind of completion to work inside a script while using "read -e".
Although it seems like a reasonable request, I don't believe that is possible.
The existing implementation of the read builtin sets the readline completion environment to a fairly basic configuration before calling readline to handle -e input.
You can see the code in builtins/read.def, in the edit_line function: it sets rl_attempted_completion_function to NULL for the duration of the call to readline. readline has several completion overrides, so it's not 100% obvious that this resets the entire completion environment, but as far as I know this is the function which is used to implement programmable completion as per the complete command.
With some work, you could probably modify the definition of the read command to allow a specific completion function instead of or in addition to the readline standard filename completion function. That would require a non-trivial understanding of bash internals, but it would be a reasonable project if you wanted to gain familiarity with those internals.
As a simpler but less efficient alternative, you could write your own little utility which just accepts one line of keyboard input with readline and echoes it to stdout. Then invoke read redirecting its stdin to your utility:
read -r < <(my_reader string1 string2 string3)
(That assumes that my_reader uses its command-line arguments to construct the potential completion list for the readline library. You'd probably want the option to present a prompt as well.)
The readline documentation includes an example of an application which does simple custom completion; once you translate it from the K&R function prototype syntax, it might be pretty easy to adapt to your needs.
Edit: After I looked at that example again, I thought it had a lot of unnecessary details, so I wrote the following example with fewer unnecessary details. I might upload it to github, but for now it's here even though it's nearly 100 lines:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <readline/readline.h>
static void version(const char* progname) {
fprintf(stderr, "%s 0.1\n", progname);
}
static void usage(const char* progname) {
fprintf(stderr, "Usage: %s [-fhv] [-p PROMPT] [-n PROGNAME] [COMPLETION...]\n", progname);
fprintf(stderr,
"Reads one line using readline, and prints it to stdout.\n"
"Returns success if a line was read.\n"
" -p PROMPT Output PROMPT before requesting input.\n"
" -n PROGNAME Set application name to PROGNAME for readline config file\n"
" (Default: %s).\n"
" -f Use filename completion as well as specified completions.\n"
" -h Print this help text and exit.\n"
" -v Print version number and exit.\n"
" COMPLETION word to add to the list of possible completions.\n",
progname);
}
/* Readline really likes globals, so none of its hooks take a context parameter. */
static char** completions = NULL;
static char* generate_next_completion(const char* text, int state) {
static int index = 0;
if (state == 0) index = 0; /* reset index if we're starting */
size_t textlen = strlen(text);
while (completions[index++])
if (strncmp(completions[index - 1], text, textlen) == 0)
return strdup(completions[index - 1]);
return NULL;
}
/* We use this if we will fall back to filename completion */
static char** generate_completions(const char* text, int start, int end) {
return rl_completion_matches(text, generate_next_completion);
}
int main (int argc, char **argv) {
const char* prompt = "";
const char* progname = strrchr(argv[0], '/');
progname = progname ? progname + 1 : argv[0];
rl_readline_name = progname;
bool use_file_completion = false;
for (;;) {
int opt = getopt(argc, argv, "+fp:n:hv");
switch (opt) {
case -1: break;
case 'f': use_file_completion = true; continue;
case 'p': prompt = optarg; continue;
case 'n': rl_readline_name = optarg; continue;
case 'h': usage(progname); return 0;
case 'v': version(progname); return 0;
default: usage(progname); return 2;
}
break;
}
/* The default is stdout, which would interfere with capturing output. */
rl_outstream = stderr;
completions = argv + optind;
rl_completion_entry_function = rl_filename_completion_function;
if (*completions) {
if (use_file_completion)
rl_attempted_completion_function = generate_completions;
else
rl_completion_entry_function = generate_next_completion;
} else {
/* No specified strings */
if (!use_file_completion)
rl_inhibit_completion = true;
}
char* line = readline(prompt);
if (line) {
puts(line);
free(line);
return 0;
} else {
fputc('\n', rl_outstream);
return 1;
}
}

Same .txt files, different sizes?

I have a program that reads from a .txt file
I use the cmd prompt to execute the program with the name of the text file to read from.
ex: program.exe myfile.txt
The problem is that sometimes it works, sometimes it doesn't.
The original file is 130KB and doesn't work.
If I copy/paste the contents, the file is 65KB and works.
If I copy/paste the file and rename it, it's 130KB and doesn't work.
Any ideas?
After more testing it shows that this is what makes it not work:
int main(int argc, char *argv[])
{
char *infile1
char tmp[1024] = { 0x0 };
FILE *in;
for (i = 1; i < argc; i++) /* Skip argv[0] (program name). */
{
if (strcmp(argv[i], "-sec") == 0) /* Process optional arguments. */
{
opt = 1; /* This is used as a boolean value. */
/*
* The last argument is argv[argc-1]. Make sure there are
* enough arguments.
*/
if (i + 1 <= argc - 1) /* There are enough arguments in argv. */
{
/*
* Increment 'i' twice so that you don't check these
* arguments the next time through the loop.
*/
i++;
optarg1 = atoi(argv[i]); /* Convert string to int. */
}
}
else /* not -sec */
{
if (infile1 == NULL) {
infile1 = argv[i];
}
else {
if (outfile == NULL) {
outfile = argv[i];
}
}
}
}
in = fopen(infile1, "r");
if (in == NULL)
{
fprintf(stderr, "Unable to open file %s: %s\n", infile1, strerror(errno));
exit(1);
}
while (fgets(tmp, sizeof(tmp), in) != 0)
{
fprintf(stderr, "string is %s.", tmp);
//Rest of code
}
}
Whether it works or not, the code inside the while loop gets executed.
When it works tmp actually has a value.
When it doesn't work tmp has no value.
EDIT:
Thanks to sneftel, we know what the problem is,
For me to use fgetws() instead of fgets(), I need tmp to be a wchar_t* instead of a char*.
Type casting seems to not work.
I tried changing the declaration of tmp to
wchar_t tmp[1024] = { 0x0 };
but I realized that tmp is a parameter in strtok() used elsewhere in my code.
I here is what I tried in that function:
//tmp is passed as the first parameter in parse()
void parse(wchar_t *record, char *delim, char arr[][MAXFLDSIZE], int *fldcnt)
{
if (*record != NULL)
{
char*p = strtok((char*)record, delim);
int fld = 0;
while (p) {
strcpy(arr[fld], p);
fld++;
p = strtok('\0', delim);
}
*fldcnt = fld;
}
else
{
fprintf(stderr, "string is null");
}
}
But typecasting to char* in strtok doesn't work either.
Now I'm looking for a way to just convert the file from UTF-16 to UTF-8 so tmp can be of type char*
I found this which looks like it can be useful but in the example it uses input from the user as UTF-16, how can that input be taken from the file instead?
http://www.cplusplus.com/reference/locale/codecvt/out/
It sounds an awful lot like the original file is UTF-16 encoded. When you copy/paste it in your text editor, you then save the result out as a new (default encoding) (ASCII or UTF-8) text file. Since a single character takes 2 bytes in a UTF-16-encode file but only 1 byte in a UTF-8-encoded file, that results in the file size being roughly halved when you save it out.
UTF-16 is fine, but you'll need to use Unicode-aware functions (that is, not fgets) to work with it. If you don't want to deal with all that Unicode jazz right now, and you don't actually have any non-ASCII characters to deal with in the file, just do the manual conversion (either with your copy/paste or with a command-line utility) before running your program.

How Should I read a File line-by-line in C?

I'd like to read a file line-by-line. I have fgets() working okay, but am not sure what to do if a line is longer than the buffer sizes I've passed to fgets()? And furthermore, since fgets() doesn't seem to be Unicode-aware, and I want to allow UTF-8 files, it might miss line endings and read the whole file, no?
Then I thought I'd use getline(). However, I'm on Mac OS X, and while getline() is specified in /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/stdio.h, it's not in /usr/include/stdio, so gcc doesn't find it in the shell. And it's not particularly portable, obviously, and I'd like the library I'm developing to be generally useful.
So what's the best way to read a file line-by-line in C?
First of all, it's very unlikely that you need to worry about non-standard line terminators like U+2028. Normal text files are not expected to contain them, and the very overwhelming majority of all existing software that reads normal text files doesn't support them. You mention getline() which is available in glibc but not in MacOS's libc, and it would surprise me if getline() did support such fancy line terminators. It's almost a certainly that you can get away with just supporting LF (U+000A) and maybe also CR+LF (U+000D U+000A). To do that, you don't need to care about UTF-8. That's the beauty of UTF-8's ASCII compatibility and is by design.
As for supporting lines that are longer than the buffer you pass to fgets(), you can do this with a little extra logic around fgets. In pseudocode:
while true {
fgets(buffer, size, stream);
dynamically_allocated_string = strdup(buffer);
while the last char (before the terminating NUL) in the buffer is not '\n' {
concatenate the contents of buffer to the dynamically allocated string
/* the current line is not finished. read more of it */
fgets(buffer, size, stream);
}
process the whole line, as found in the dynamically allocated string
}
But again, I think you will find that there's really quite a lot of software out there that simply doesn't bother with that, from software that parses system config files like /etc/passwd to (some) scripting languages. Depending on your use case, it may very well be good enough to use a "big enough" buffer (e.g. 4096 bytes) and declare that you don't support lines longer than that. You can even call it a security feature (a line length limit is protection against resource exhaustion attacks from a crafted input file).
Based on this answer, here's what I've come up with:
#define LINE_BUF_SIZE 1024
char * getline_from(FILE *fp) {
char * line = malloc(LINE_BUF_SIZE), * linep = line;
size_t lenmax = LINE_BUF_SIZE, len = lenmax;
int c;
if(line == NULL)
return NULL;
for(;;) {
c = fgetc(fp);
if(c == EOF)
break;
if(--len == 0) {
len = lenmax;
char * linen = realloc(linep, lenmax *= 2);
if(linen == NULL) {
// Fail.
free(linep);
return NULL;
}
line = linen + (line - linep);
linep = linen;
}
if((*line++ = c) == '\n')
break;
}
*line = '\0';
return linep;
}
To read stdin:
char *line;
while ( line = getline_from(stdin) ) {
// do stuff
free(line);
}
To read some other file, I first open it with fopen():
FILE *fp;
fp = fopen ( filename, "rb" );
if (!fp) {
fprintf(stderr, "Cannot open %s: ", argv[1]);
perror(NULL);
exit(1);
}
char *line;
while ( line = getline_from(fp) ) {
// do stuff
free(line);
}
This works very nicely for me. I'd love to see an alternative that uses fgets() as suggested by #paul-tomblin, but I don't have the energy to figure it out tonight.

"Extra content at the end of the document" error using libxml2 to read from file handle created with shm_open

I'm trying to write a unit test that checks some xml parsing code. The unit test creates a file descriptor on an in-memory xml doc using shm_open and then passes that to xmlTextReaderForFd(). But I'm getting an "Extra content at the end of the document" error on the subsequent xmlTextReaderRead(). The parsing code works fine on a file descriptor created from an actual file (I've done a byte-for-byte comparison with the shm_open created one and it's the exact same set of bytes.) Why is libxml2 choking on a file descriptor created with shm_open?
Here's my code:
void unitTest() {
int fd = shm_open("/temporary", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
char *pText = "<?xml version=\"1.0\"?><foo></foo>";
write(fd, pText, strlen(pText) + 1);
lseek(fd, 0, SEEK_SET);
xmlTextReaderPtr pReader = xmlReaderForFd(
fd, // file descriptor
"/temporary", // base uri
NULL, // encoding
0); // options
int result = xmlTextReaderRead(pReader);
// result is -1
// Get this error at console:
// /temporary:1: parser error : Extra content at the end of the document
// <?xml version="1.0"?><foo></foo>
// ^
}
I figured out the problem. I was writing out the NULL terminator and that's what was causing libxml2 to choke (although I could have sworn I already tried it without the NULL terminator, d'oh!) The fixed code should simply be:
write(fd, pText, strlen(pText));
Also, make sure you are reading the file as binary, not text. 'Text' strips out CR/LF, reduces the size of the file and leaves detritus at the end of the buffer.
Example (VS 2010):
struct _stat32 stat;
char *buf;
FILE *f = fopen("123.XML", "rb"); // right
//f = fopen("123.XML", "rt"); // WRONG!
_fstat(fileno(f), &stat);
buf = (char *)malloc(stat.st_size);
int ret = fread(buf, stat.st_size, 1, f);
assert(ret == 1);
// etc.

Resources