Does anyone have a simple shell script or c program to generate random files of a set size
with random content under linux?
How about:
head -c SIZE /dev/random > file
openssl rand can be used to generate random bytes.
The command is below:
openssl rand [bytes] -out [filename]
For example,openssl rand 2048 -out aaa will generate a file named aaa containing 2048 random bytes.
Here are a few ways:
Python:
RandomData = file("/dev/urandom", "rb").read(1024)
file("random.txt").write(RandomData)
Bash:
dd if=/dev/urandom of=myrandom.txt bs=1024 count=1
using C:
#include <stdio.h>
#include <stdlib.h>
int main()
{
int byte_count = 1024;
char data[4048];
FILE *fp;
fp = fopen("/dev/urandom", "rb");
fread(&data, 1, byte_count, fp);
int n;
FILE *rand;
rand=fopen("test.txt", "w");
fprintf(rand, data);
fclose(rand);
fclose(rand);
}
Python. Call it make_random.py
#!/usr/bin/env python
import random
import sys
import string
size = int(sys.argv[1])
for i in xrange(size):
sys.stdout.write( random.choice(string.printable) )
Use it like this
./make_random 1024 >some_file
That will write 1024 bytes to stdout, which you can capture into a file. Depending on your system's encoding this will probably not be readable as Unicode.
Here's a quick an dirty script I wrote in Perl. It allows you to control the range of characters that will be in the generated file.
#!/usr/bin/perl
if ($#ARGV < 1) { die("usage: <size_in_bytes> <file_name>\n"); }
open(FILE,">" . $ARGV[0]) or die "Can't open file for writing\n";
# you can control the range of characters here
my $minimum = 32;
my $range = 96;
for ($i=0; $i< $ARGV[1]; $i++) {
print FILE chr(int(rand($range)) + $minimum);
}
close(FILE);
To use:
./script.pl file 2048
Here's a shorter version, based on S. Lott's idea of outputting to STDOUT:
#!/usr/bin/perl
# you can control the range of characters here
my $minimum = 32;
my $range = 96;
for ($i=0; $i< $ARGV[0]; $i++) {
print chr(int(rand($range)) + $minimum);
}
Warning: This is the first script I wrote in Perl. Ever. But it seems to work fine.
You can use my generate_random_file.py script (Python 3) that I used to generate test data in a project of mine.
It works both on Linux and Windows.
It is very fast, because it uses os.urandom() to generate the random data in chunks of 256 KiB instead of generating and writing each byte separately.
Related
I have 2 text files
File1 has more than 400K lines. Each line is similar to this sample:
hstor,table,"8bit string",ABCD,0000,0,4,19000101,today
File2 has a list of new 8bit strings to replace the current ones in file1 while preserving the rest in file1.
So file1 goes from
hstor,table,"OLD 8bit string",ABCD,0000,0,4,19000101,today
to
hstor,table,"NEW 8bit string",ABCD,0000,0,4,19000101,today
I can't sed 400K times
How can I script this so that all the OLD 8bit strings in file1 are replaced with the NEW 8bit strings listed in file2?
This might work for you (GNU sed):
sed 's#.*#s/[^,]*/&/3#' file2 | cat -n | sed -f - file1
This converts file2 into a sed script file and then runs it on file1.
The first sed script takes each line in file2 and changes it to substitution command which replaces the third field in the target with the contents of the current line of file2.
This is piped into a cat command which inserts line numbers which will be used by the sed script to address each substitution command.
The final sed command uses the /dev/stdin to read in a sed script and runs it against the input file file1.
In case you need to do this multiple times and performance is important, I wrote a program in C to do this. It's a modified version of this code. I know you did not use any C-tag, but I got the impression that your main concern was to just get the job done.
NOTE:
I take no responsibility for it. It is quite a bit of a quick hack, and I do assume some stuff. One assumption is that the string you want to replace does not contain any commas. Another is that no line is longer than 100 bytes. A third assumption is that the input files are named file and rep respectively. If you want to try it out, make sure to inspect the data afterwards. It writes to stdout so you just redirect the output to a new file. It does the job in around two seconds.
Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
int main()
{
/* declare a file pointer */
FILE *infile;
FILE *replace;
char *buffer;
char *rep_buffer;
long numbytes;
long rep_numbytes;
/* open an existing file for reading */
infile = fopen("file", "r");
replace = fopen("rep", "r");
/* quit if the file does not exist */
if(infile == NULL)
return 1;
if(replace == NULL)
return 1;
/* Get the number of bytes */
fseek(infile, 0L, SEEK_END);
numbytes = ftell(infile);
fseek(replace, 0L, SEEK_END);
rep_numbytes = ftell(replace);
/* reset the file position indicator to
the beginning of the file */
fseek(infile, 0L, SEEK_SET);
fseek(replace, 0L, SEEK_SET);
/* grab sufficient memory for the
buffer to hold the text */
buffer = (char*)calloc(numbytes, sizeof(char));
rep_buffer = (char*)calloc(rep_numbytes, sizeof(char));
/* memory error */
if(buffer == NULL)
return 1;
if(rep_buffer == NULL)
return 1;
/* copy all the text into the buffer */
fread(buffer, sizeof(char), numbytes, infile);
fclose(infile);
fread(rep_buffer, sizeof(char), rep_numbytes, replace);
fclose(replace);
char line[100]={0};
char *i=buffer;
char *r=rep_buffer;
while(i<&buffer[numbytes-1]) {
int n;
/* Copy from infile until second comma */
for(n=0; i[n]!=','; n++);
n++;
for(; i[n]!=','; n++);
n++;
memcpy(line, i, n);
/* Copy a line from replacement */
int m;
for(m=0; r[m]!='\n'; m++);
memcpy(&line[n], r, m);
/* Skip corresponding text from infile */
int k;
for(k=n; i[k]!=','; k++);
/* Copy the rest of the line */
int l;
for(l=k; i[l]!='\n'; l++);
memcpy(&line[n+m], &i[k], l-k);
/* Next line */
i+=l;
r+=m+1;
/* Print to stdout */
printf("%s", line);
}
/* free the memory we used for the buffer */
free(buffer);
free(rep_buffer);
}
I have an array of strings of about 100,000 elements. I need to iterate through each element and replace some words with other words. This takes a few seconds in pure perl. I need to speed this up as much as I can. I'm testing using the following snippet:
use strict;
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 100000; $i++ ) {
$string =~ s/old1/new1/ig;
$string =~ s/old2/new2/ig;
$string =~ s/old3/new3/ig;
$string =~ s/old4/new4/ig;
$string =~ s/old5/new5/ig;
}
I know this doesn't actually replace anything in the test string, but it's for speed testing only.
I had my hopes set on Inline::C. I've never worked with Inline::C before but after reading up on it a bit, I thought it was fairly simple to implement. But apparently, even calling a stub function that does nothing is a lot slower. Here's the snippet I tested with:
use strict;
use Benchmark qw ( timethese );
use Inline 'C';
timethese(
5,
{
"Pure Perl" => \&pure_perl,
"Inline C" => \&inline_c
}
);
sub pure_perl {
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
$string =~ s/old1/new1/ig;
$string =~ s/old2/new2/ig;
$string =~ s/old3/new3/ig;
$string =~ s/old4/new4/ig;
$string =~ s/old5/new5/ig;
}
}
sub inline_c {
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
$string = findreplace( $string, "old1", "new1" );
$string = findreplace( $string, "old2", "new2" );
$string = findreplace( $string, "old3", "new3" );
$string = findreplace( $string, "old4", "new4" );
$string = findreplace( $string, "old5", "new5" );
}
}
__DATA__
__C__
char *
findreplace( char *text, char *what, char *with ) {
return text;
}
on my Linux box, the result is:
Benchmark: timing 5 iterations of Inline C, Pure Perl...
Inline C: 6 wallclock secs ( 5.51 usr + 0.02 sys = 5.53 CPU) # 0.90/s (n=5)
Pure Perl: 2 wallclock secs ( 2.51 usr + 0.00 sys = 2.51 CPU) # 1.99/s (n=5)
Pure Perl is twice as fast as calling an empty C function. Not at all what I expected! Again, I've never worked with Inline::C before so maybe I am missing something here?
In the version using Inline::C, you kept everything that was in the original pure Perl script, and changed just one thing: Additionally, you've replaced Perl's highly optimized s/// with a worse implementation. Invoking your dummy function actually involves work whereas none of the s/// invocations do much in this case. It is a priori impossible for the Inline::C version to run faster.
On the C side, the function
char *
findreplace( char *text, char *what, char *with ) {
return text;
}
is not a "do nothing" function. Calling it involves unpacking arguments. The string pointed to by text has to be copied to the return value. There is some overhead which you are paying for each invocation.
Given that s/// does no replacements, there is no copying involved in that. In addition, Perl's s/// is highly optimized. Are you sure you can write a better find & replace that is faster to make up for the overhead of calling an external function?
If you use the following implementation, you should get comparable speeds:
sub inline_c {
my $string = "This is some string. It's only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
findreplace( $string );
findreplace( $string );
findreplace( $string );
findreplace( $string );
findreplace( $string );
}
}
__END__
__C__
void findreplace( char *text ) {
return;
}
Benchmark: timing 5 iterations of Inline C, Pure Perl...
Inline C: 6 wallclock secs ( 5.69 usr + 0.00 sys = 5.69 CPU) # 0.88/s (n=5)
Pure Perl: 6 wallclock secs ( 5.70 usr + 0.00 sys = 5.70 CPU) # 0.88/s (n=5)
The one possibility of gaining speed is to exploit any special structure involved in the search pattern and replacements and write something to implement that.
On the Perl side, you should at least pre-compile the patterns.
Also, since your problem is embarrassingly parallel, you are better off looking into chopping up the work into as many chunks as you have cores to work with.
For example, take a look at the Perl entries in the regex-redux task in the Benchmarks Game:
Perl #4 (fork only): 14.13 seconds
and
Perl #3 (fork & threads): 14.47 seconds
versus
Perl #1: 34.01 seconds
That is, some primitive exploitation of parallelization possibilities results in a 60% speedup. That problem is not exactly comparable because the substitutions must be done sequentially, but still gives you an idea.
If you have eight cores, dole out the work to eight cores.
Also, consider the following script:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Fake::Text;
use List::Util qw( sum );
use Time::HiRes qw( time );
use constant INPUT_SIZE => $ARGV[0] // 1_000_000;
run();
sub run {
my #substitutions = (
sub { s/dolor/new1/ig },
sub { s/fuga/new2/ig },
sub { s/facilis/new3/ig },
sub { s/tempo/new4/ig },
sub { s/magni/new5/ig },
);
my #times;
for (1 .. 5) {
my $data = read_input();
my $t0 = time;
find_and_replace($data, \#substitutions);
push #times, time - $t0;
}
printf "%.4f\n", sum(#times)/#times;
return;
}
sub find_and_replace {
my $data = shift;
my $substitutions = shift;
for ( #$data ) {
for my $s ( #$substitutions ) {
$s->();
}
}
return;
}
{
my #input;
sub read_input {
#input
or #input = map fake_sentences(1)->(), 1 .. INPUT_SIZE;
return [ #input ];
}
}
In this case, each invocation of find_and_replace takes about 2.3 seconds my laptop. The five replications run in about 30 seconds. The overhead is the combined cost of generating the 1,000,000 sentence data set and copying it four times.
I am working on updating our kernel drivers to work with linux kernel 4.4.0 on Ubuntu 16.0.4. The drivers last worked with linux kernel 3.9.2.
In one of the modules, we have a procfs entries created to read/write the on-board fan monitoring values. Fan monitoring is used to read/write the CPU or GPU temperature/modulation,etc. values.
The module is using the following api to create procfs entries:
struct proc_dir_entry *create_proc_entry(const char *name, umode_t
mode,struct proc_dir_entry *parent);
Something like:
struct proc_dir_entry * proc_entry =
create_proc_entry("fmon_gpu_temp",0644,proc_dir);
proc_entry->read_proc = read_proc;
proc_entry->write_proc = write_proc;
Now, the read_proc is implemented something in this way:
static int read_value(char *buf, char **start, off_t offset, int count, int *eof, void *data) {
int len = 0;
int idx = (int)data;
if(idx == TEMP_FANCTL)
len = sprintf (buf, "%d.%02d\n", fmon_readings[idx] / TEMP_SAMPLES,
fmon_readings[idx] % TEMP_SAMPLES * 100 / TEMP_SAMPLES);
else if(idx == TEMP_CPU) {
int i;
len = sprintf (buf, "%d", fmon_readings[idx]);
for( i=0; i < FCTL_MAX_CPUS && fmon_cpu_temps[i]; i++ ) {
len += sprintf (buf+len, " CPU%d=%d",i,fmon_cpu_temps[i]);
}
len += sprintf (buf+len, "\n");
}
else if(idx >= 0 && idx < READINGS_MAX)
len = sprintf (buf, "%d\n", fmon_readings[idx]);
*eof = 1;
return len;
}
This read function definitely assumes that the user has provided enough buffer space to store the temperature value. This is correctly handled in userspace program. Also, for every call to this function the read value is in totality and therefore there is no support/need for subsequent reads for same temperature value.
Plus, if I use "cat" program on this procfs entry from shell, the 'cat' program correctly displays the value. This is supported, I think, by the setting of EOF to true and returning read bytes count.
New linux kernels do not support this API anymore.
My question is:
How can I change this API to new procfs API structure keeping the functionality same as: every read should return the value, program 'cat' should also work fine and not go into infinite loop ?
The primary user interface for read files on Linux is read(2). Its pair in kernel space is .read function in struct file_operations.
Every other mechanism for read file in kernel space (read_proc, seq_file, etc.) is actually an (parametrized) implementation of .read function.
The only way for kernel to return EOF indicator to user space is returning 0 as number of bytes read.
Even read_proc implementation you have for 3.9 kernel actually implements eof flag as returning 0 on next invocation. And cat actually perfoms the second invocation of read for find that file is end.
(Moreover, cat performs more than 2 invocations of read: first with 1 as count, second with count equal to page size minus 1, and the last with remaining count.)
The simplest way for "one-shot" read implementation is using seq_file in single_open() mode.
This is what I have"
#!/bin/bash
MAX=3
for((ctr = 0;ctr < MAX; ++ctr))
do
./make.o > out$ctr.txt
output$ctr.txt
done
so I want to take put the output of make.o into out$ctr.txt and in my make.o I call cin, could I take output$ctr.txt as input? I would rather not use input redirection since I would have to rewrite the program.
EDIT: I do not want to use < because then it will give me the contents of the file output$ctr.txt, I do want the actual name of the file not the contents
Do you mean like this
./make.o < output$ctr.txt > out$ctr.txt
Edit: if you want the name, then just do this:
./make.o output$ctr.txt > out$ctr.txt
or maybe this, to echo the name so it can be read from C++ cin:
echo output$ctr.txt | ./make.o > out$ctr.txt
But what you actually want is:
./make.o output$ctr.txt >out$ctr.txt
where output$ctr.txt is a command-line argument to your program.
Assuming a C++ program, since you mention cin, you handle command-line arguments like this:
int main(int argc, char *argv[])
{
if (argc < 2) {
// argv[0] usually contains the program name
std::cerr << "missing argument\n"
<< "Syntax: " << argv[0] << " input-file\n";
return -1;
}
char *input = argv[1]; // = "output$ctr.txt"
// ...
}
First Guess
It sounds like you just want;
./make.o >out$ctr.txt <output$ctr.txt
> redirects the file descriptor STDOUT_FILENO which is associated with FILE *stdout and std::cout.
< redirects the file descriptor STDIN_FILENO which is associated with FILE *stdin and std::cin.
Just a comment about you loop, another more simpler way to write it is
#!/bin/bash
for i in {0..2}
do
echo $i
done
I'd like to read a file line-by-line. I have fgets() working okay, but am not sure what to do if a line is longer than the buffer sizes I've passed to fgets()? And furthermore, since fgets() doesn't seem to be Unicode-aware, and I want to allow UTF-8 files, it might miss line endings and read the whole file, no?
Then I thought I'd use getline(). However, I'm on Mac OS X, and while getline() is specified in /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/stdio.h, it's not in /usr/include/stdio, so gcc doesn't find it in the shell. And it's not particularly portable, obviously, and I'd like the library I'm developing to be generally useful.
So what's the best way to read a file line-by-line in C?
First of all, it's very unlikely that you need to worry about non-standard line terminators like U+2028. Normal text files are not expected to contain them, and the very overwhelming majority of all existing software that reads normal text files doesn't support them. You mention getline() which is available in glibc but not in MacOS's libc, and it would surprise me if getline() did support such fancy line terminators. It's almost a certainly that you can get away with just supporting LF (U+000A) and maybe also CR+LF (U+000D U+000A). To do that, you don't need to care about UTF-8. That's the beauty of UTF-8's ASCII compatibility and is by design.
As for supporting lines that are longer than the buffer you pass to fgets(), you can do this with a little extra logic around fgets. In pseudocode:
while true {
fgets(buffer, size, stream);
dynamically_allocated_string = strdup(buffer);
while the last char (before the terminating NUL) in the buffer is not '\n' {
concatenate the contents of buffer to the dynamically allocated string
/* the current line is not finished. read more of it */
fgets(buffer, size, stream);
}
process the whole line, as found in the dynamically allocated string
}
But again, I think you will find that there's really quite a lot of software out there that simply doesn't bother with that, from software that parses system config files like /etc/passwd to (some) scripting languages. Depending on your use case, it may very well be good enough to use a "big enough" buffer (e.g. 4096 bytes) and declare that you don't support lines longer than that. You can even call it a security feature (a line length limit is protection against resource exhaustion attacks from a crafted input file).
Based on this answer, here's what I've come up with:
#define LINE_BUF_SIZE 1024
char * getline_from(FILE *fp) {
char * line = malloc(LINE_BUF_SIZE), * linep = line;
size_t lenmax = LINE_BUF_SIZE, len = lenmax;
int c;
if(line == NULL)
return NULL;
for(;;) {
c = fgetc(fp);
if(c == EOF)
break;
if(--len == 0) {
len = lenmax;
char * linen = realloc(linep, lenmax *= 2);
if(linen == NULL) {
// Fail.
free(linep);
return NULL;
}
line = linen + (line - linep);
linep = linen;
}
if((*line++ = c) == '\n')
break;
}
*line = '\0';
return linep;
}
To read stdin:
char *line;
while ( line = getline_from(stdin) ) {
// do stuff
free(line);
}
To read some other file, I first open it with fopen():
FILE *fp;
fp = fopen ( filename, "rb" );
if (!fp) {
fprintf(stderr, "Cannot open %s: ", argv[1]);
perror(NULL);
exit(1);
}
char *line;
while ( line = getline_from(fp) ) {
// do stuff
free(line);
}
This works very nicely for me. I'd love to see an alternative that uses fgets() as suggested by #paul-tomblin, but I don't have the energy to figure it out tonight.