Why is this if statement forcing zenity's --auto-close to close immediately? - bash

Got a Debian package set up for a little application I was working on, this isn't really relevant for the question but for some context, the application is a simple bash script that deploys some docker containers on the local machine. But I wanted to add a dependency check to make sure the system had docker before it attempted to do anything. If it doesn't, download it, if it does, ignore it. Figured it be nice to have a little zenity dialog alongside it to show what was going on.
In that process, I check for internet before starting for obvious reasons and for some reason, the way I check if there is internet if zenity has the --auto-close flag, will instantly close the entire progress block.
Here is a little dummy example, that if statement is a straight copy-paste from my code, everything else is filler. :
#!/bin/bash
condition=0
if [[ $condition ]]; then
(
echo "0"
# Check for internet
if ping -c 3 -W 3 gcr.io; then
echo "# Internet detected, starting updates..."; sleep 1
echo "10"
else
err_msg="# No internet detected. You may be missing some dependencies.
Services may not function as expected until they are installed."
echo $err_msg
zenity --error --text="$err_msg"
echo "100"
exit 1
fi
echo "15"
echo "# Downloading a thing" ; sleep 1
echo "50"
if zenity --question --text="Do you want to download a special thing?"; then
echo "# Downloading special thing" ; sleep 1
else
echo "# Not downloading special thing" ; sleep 1
fi
echo "75"
echo "# downloading big thing" ; sleep 3
echo "90"
echo "# Downloading last thing" ; sleep 1
echo "100"
) |
zenity --progress --title="Dependency Management" --text="downloading dependencies, please wait..." \
--percentage=0 --auto-close
fi
So im really just wondering why this is making zenity freak-out. If you comment out that if statement, everything works as you expect and zenity progress screen closes once it hits 100.
If you keep the if statement but remove the auto-close flag, it will execute as expected. It's like its initializing at 100 and then going to 0 to progress normally. But if that was the case, --auto-close would never work but in the little example they give you in the help section, it works just fine. https://help.gnome.org/users/zenity/stable/progress.html.en

Thank you for a fun puzzle! Spoiler is at the end, but I thought it might be helpful to look over my shoulder while I poked at the problem. 😀️ If you're more interested in the answer than the journey, feel free to scroll. I'll never know, anyway.
Following my own advice (see 1st comment beneath the question), I set out to create a small, self-contained, complete example. But, as they say in tech support: Before you can debug the problem, you need to debug the customer. (No offense; I'm a terrible witness myself unless I know ahead of time that someone's going to need to reproduce a problem I've found.)
I interpreted your comment about checking for Internet to mean "it worked before I added the ping and failed afterward," so the most sensible course of action seemed to be commenting out that part of the code... and then it worked! So what happens differently when the ping is added?
Changes in timing wouldn't make sense, so the problem must be that ping generates output that gets piped to zenity. So I changed the command to redirect its output to the bit bucket:
ping -c 3 -W 3 gcr.io &>/dev/null;
...and that worked, too! Interesting!
I explored what turned out to be a few ratholes:
I ran ping from the command line and piped its output through od -xa to check for weird control characters, but nope.
Instead of enclosing the contents of the if block in parentheses (()), which executes the commands in a sub-shell, I tried braces ({}) to execute them in the same shell. Nope, again.
I tried a bunch of other embarrassingly useless and time-consuming ideas. Nope, nope, and nope.
Then I realized I could just do
ping -c 3 -W 3 gcr.io | zenity --progress --auto-close
directly from the command line. That failed with the --auto-close flag but worked normally without it. Boy, did that simplify things! That's about as "smallest" as you can get. But it's not, actually: I used up all of my remaining intelligence points for the day by redirecting the output from ping into a file, so I could just
(cat output; sleep 1) | zenity --progress --auto-close
and not keep poking at poor gcr.io until I finally figured this thing out. (The sleep gave me enough time to see the pop-up when it worked, because zenity exits when the pipe closes at the end of the input. So, what's in that output file?
PING gcr.io (172.253.122.82) 56(84) bytes of data.
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=1 ttl=59 time=18.5 ms
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=2 ttl=59 time=21.8 ms
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=3 ttl=59 time=21.4 ms
--- gcr.io ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 18.537/20.572/21.799/1.449 ms
The magic zenity-killer must be in there somewhere! All that was left (ha, "all"!) was to make my "smallest" example even smaller by deleting pieces of the file until it stopped breaking. Then I'd put back whatever I'd deleted last, and I deleted something else, da capo, ad nauseam, or at least ad minimus. (Or whatever; I don't speak Latin.) Eventually the file dwindled to
64 bytes from bh-in-f82.1e100.net (172.253.122.82): icmp_seq=1 ttl=59 time=18.5 ms
and I started deleting stuff from the beginning. Eventually I found that it would break regardless of the length of the line, as long as it started with a number that wasn't 0 and had at least 3 digits somewhere within it. Huh. It'd also break if it did start with a 0 and had at least 4 digits within... unless the second digit was also 0! What's more, a period would make it even weirder: none of the digits anywhere after the period would make it break, no matter what they were.
And then, then came the ah-ha! moment. The zenity documentation says:
Zenity reads data from standard input line by line. If a line is
prefixed with #, the text is updated with the text on that line. If a
line contains only a number, the percentage is updated with that
number.
Wow, really? It can't be that ridiculous, can it?
I found the source for zenity, downloaded it, extracted it (with tar -xf zenity-3.42.1.tar.xz), opened progress.c, and found the function that checks to see if "a line contains only a number." The function is called only if the first character in the line is a number.
108 static float
109 stof(const char* s) {
110 float rez = 0, fact = 1;
111 if (*s == '-') {
112 s++;
113 fact = -1;
114 }
115 for (int point_seen = 0; *s; s++) {
116 if (*s == '.' || *s == ',') {
117 point_seen = 1;
118 continue;
119 }
120 int d = *s - '0';
121 if (d >= 0 && d <= 9) {
122 if (point_seen) fact /= 10.0f;
123 rez = rez * 10.0f + (float)d;
124 }
125 }
126 return rez * fact;
127 }
Do you see it yet? Here, I'll give you a sscce, with comments:
// Clear the "found a decimal point" flag and iterate
// through the input in `s`.
115 for (int point_seen = 0; *s; s++) {
// If the next char is a decimal point (or a comma,
// for Europeans), set the "found it" flag and check
// the next character.
116 if (*s == '.' || *s == ',') {
117 point_seen = 1;
118 continue;
119 }
// Sneaky C trick that converts a numeric character
// to its integer value. Ex: char '1' becomes int 1.
120 int d = *s - '0';
// We only care if it's actually an integer; skip anything else.
121 if (d >= 0 && d <= 9) {
// If we saw a decimal point, we're looking at tenths,
// hundredths, thousandths, etc., so we'll need to adjust
// the final result. (Note from the peanut gallery: this is
// just ridiculous. A progress bar doesn't need to be this
// accurate. Just quit at the first decimal point instead
// of trying to be "clever."
122 if (point_seen) fact /= 10.0f;
// Tack the new digit onto the end of the "rez"ult.
// Ex: if rez = 12.0 and d = 5, this is 12.0 * 10.0 + 5. = 125.
123 rez = rez * 10.0f + (float)d;
124 }
125 }
// We've scanned the entire line, so adjust the result to account
// for the decimal point and return the number.
126 return rez * fact;
Now do you see it?
The author decides "[i]f a line contains only a number" by checking (only!) that the first character is a number. If it is, then it plucks out all the digits (and the first decimal, if there is one), mashes them all together, and returns whatever it found, ignoring anything else it may have seen.
So of course it failed if there were 3 digits and the first wasn't 0, or if there were 4 digits and the first 2 weren't 0... because a 3-digit number is always at least 100, and zenity will --auto-close as soon as the progress is 100 or higher.
Spoiler:
The ping statement generates output that confuses zenity into thinking the progress has reached 100%, so it closes the dialog.
By the way, congratulations: you found one of the rookiest kinds of rookie mistakes a programmer can make... and it's not your bug! For whatever reason, the author of zenity decided to roll their own function to convert a line of text to a floating-point number, and it doesn't do at all what the doc says, or what any normal person would expect it to do. (Protip: libraries will do this for you, and they'll actually work most of the time.)
You can score a bunch of karma points if you can figure out how to report the bug, and you'll get a bonus if you submit your report in the form of a fix. 😀️

Related

Reading a folder of log files, and calculating the event durations for unique ID's

I have an air gapped system (so limited in software access) that generates usage logs daily. The logs have unique ID's for devices that I've managed to scrape in the past and pump out to a CSV, to which I would then cleanup in LibreCalc (related to this question I asked here - https://superuser.com/questions/1732415/find-next-matching-event-in-log-and-compare-timings) and get event durations for each one.
This is getting arduous as more devices are added so I wish to automate the calculation of the total durations for each device, and how many events occurred for that device. I've had some suggestions of using out/awk/sed and I'm a bit lost on how to implement it.
Log Example
message="device02 connected" event_ts=2023-01-10T09:20:21Z
message="device05 connected" event_ts=2023-01-10T09:21:31Z
message="device02 disconnected" event_ts=2023-01-10T09:21:56Z
message="device04 connected" event_ts=2023-01-10T11:12:28Z
message="device05 disconnected" event_ts=2023-01-10T15:26:36Z
message="device04 disconnected" event_ts=2023-01-10T18:23:32Z
I already have a bash script that scrapes these events from the log files in the folder and then outputs it all to a csv.
#/bin/bash
#Just a datetime stamp for the flatfile
now=$(date +”%Y%m%d”)
#Log file path, also where I define what month to scrape
LOGFILE=’local.log-202301*’
#Shows what log files are getting read
echo $LOGFILE \n
#Output line by line to csv
awk ‘(/connect/ && ORS=”\n”) || (/disconnect/ && ORS=RS) {field1_var=$1” “$2” “$3”,”; print field1_var}’ $LOGFILE > /home/user/logs/LOG_$now.csv
Ideally I'd like to keep that process so I can manually inspect the file if necessary. But ultimately I'd prefer to automate the event calculations to produce something like below:
Desired Output Example
Device Total Connection Duration Total Connections
device01 0h 0m 0s 0
device02 0h 1m 35s 1
device03 0h 0m 0s 0
device04 7h 11m 4s 1
device05 6h 5m 5s 1
Hopefully thats enough info, any help or pointers would be greatly appreciated. Thanks.
This isn't based on your script at all, since I didn't get it to produce a CSV, but anyway...
Here's an AWK script that computes the desired result for the given example log file:
function time_lapsed(from, to) {
gsub(/[^0-9 ]/, " ", from);
gsub(/[^0-9 ]/, " ", to);
return mktime(to) - mktime(from);
}
BEGIN { OFS = "\t"; }
(/ connected/) {
split($1, a, "=\"", _);
split($3, b, "=", _);
device_connected_at[a[2]] = b[2];
device_connection_count[a[2]]++;
}
(/disconnected/) {
split($1, a, "=\"", _);
split($3, b, "=", _);
device_connection_duration[a[2]]+=time_lapsed(device_connected_at[a[2]], b[2]);
}
END {
print "Device","Total Connection Duration", "Total Connections";
for (device in device_connection_duration) {
print device, strftime("%Hh %Mm %Ss", device_connection_duration[device]), device_connection_count[device];
};
}
I used it on this example log file
message="device02 connected" event_ts=2023-01-10T09:20:21Z
message="device05 connected" event_ts=2023-01-10T09:21:31Z
message="device02 disconnected" event_ts=2023-01-10T09:21:56Z
message="device04 connected" event_ts=2023-01-10T11:12:28Z
message="device06 connected" event_ts=2023-01-10T11:12:28Z
message="device05 disconnected" event_ts=2023-01-10T15:26:36Z
message="device02 connected" event_ts=2023-01-10T19:20:21Z
message="device04 disconnected" event_ts=2023-01-10T18:23:32Z
message="device02 disconnected" event_ts=2023-01-10T21:41:33Z
And it produces this output
Device Total Connection Duration Total Connections
device02 03h 22m 47s 2
device04 08h 11m 04s 1
device05 07h 05m 05s 1
You can pass this program to awk without any flags. It should just work (given you didn't mess around with field and record separators somewhere in your shell session).
Let me explain what's going on:
First we define the time_lapsed function. In that function we first convert the ISO8601 timestamps into the format that mktime can handle (YYYY MM DD HH MM SS), we simply drop the offset since it's all UTC. We then compute the difference of the Epoch timestamps that mktime returns and return that result.
Next in the BEGIN block we define the output field separator OFS to be a tab.
Then we define two rules, one for log lines when the device connected and one for when the device disconnected.
Due to the default field separator the input to these rules looks like this:
$1: message="device02
$2: connected"
$3: event_ts=2023-01-10T09:20:21Z
We don't care about $2. We use split to get the device identifier and the timestamp from $1 and $3 respectively.
In the rule for a device connecting, using the device identifier as the key, we then store when the device connected and increase the connection count for that device. We don't need to initially assign 0 because the associative arrays in awk return "" for fields that contain no record which is coerced to 0 by incrementing it.
In the rule for a device disconnecting we compute the time lapsed and add that to the total time elapsed for that device.
Note that this requires every connect to have a matching disconnect in the logs. I.e., this is very fragile, a missing connect log line will mess up the calculation of the total connection time. A missing disconnect log line with increase the connection count but not the total connection time.
In the END rule we print the desired Output header and for every entry in the associative array device_connection_duration we print the device identifier, total connection duration and total connection count.
I hope this gives you some ideas on how to solve your task.

how to grab text after newline and concat each line to make a new one in a text file no clean of spaces, tabs

I have a text like this:
Print <javascript:PrintThis();>
www.example.com
Order Number: *912343454656548 * Date of Order: November 54 2043
------------------------------------------------------------------------
*Dicders Folcisad:
* STACKOVERFLOW
*dum FWEFaadasdd:* ‎[U+200E] ‎
STACK OVERFLOW
BLVD OF SOMEPLACENICE 434
SANTA MONICA, COUNTY
LOS ANGEKES, CALI 90210
(SW)
*Order Totals:*
Subtotal Usd$789.75
Shipping Usd$87.64
Duties & Taxes Usd$0.00 ‎
Rewards Credit Usd$0.00
*Order Total * *Usd$877.39 *
*Wordskccds:*
STACKOVERFLOW
FasntAsia
xxxx-xxxx-xxxx-
*test Method / Welcome Info *
易客满x京配个人行邮税- 运输 + 关税 & 税费 / ADHHX15892013504555636
*Order Number: 916212582744342X*
*#* *Item* *Price* *Qty.* *Discount* *Subtotal*
1
Random's Bounty, Product, 500 mg, 100 Rainsd Harrys AXK-0ew5535
Usd$141.92 4 -Usd$85.16 Usd$482.52
2
Random Product, Fast Forlang, Mayority Stonghold, Flavors, 10 mg,
60 Stresss CXB-034251
Usd$192.24 1 -Usd$28.83 Usd$163.41
3
34st Omicron, Novaccines Percent Pharmaceutical, 10 mg, 120 Tablesds XDF-38452
Usd$169.20 1 -Usd$25.38 Usd$143.82
*Extra Discounts:* Extra 15% discounts applied! Usd$139.37
*Stackoverflox Contact Information :*
*Web: *www.example.com
*Disclaimer:* something made, or service sold through this website,
have not been test by the sweden Spain norway and Dumrug
Advantage. They are not intended to treet, treat, forsee or
forshadow somw clover.
I'm trying to grab each line that start with number, then concat second line, and finally third line. example text:
1 Random's Bounty, Product, 500 mg, 100 Rainsd Harrys AXK-0ew5535 Usd$141.92 4 -Usd$85.16 Usd$482.52
2 Random Product, Fast Forlang, Mayority Stonghold, Flavors, 10 mg, 60 Stresss CXB-034251 Usd$192.24 1 -Usd$28.83 Usd$163.41 <- 1 line
3 34st Omicron, Novaccines Percent Pharmaceutical, 10 mg, 120 Wedscsd XDF-38452 Usd$169.20 1 -Usd$25.38 Usd$143.82 <- 1 lines as first
as you may notices Second line has 3 lines instead of 2 lines. So make it harder to grab.
Because of the newline and whitespace, the next command only grabs 1:
grep -E '1\s.+'
also, I have been trying to make it with new concats:
grep -E '1\s|[A-Z].+'
But doesn't work, grep begins to select similar pattern in different parts of the text
awk '{$1=$1}1' #done already
tr -s "\t\r\n\v" #done already
tr -d "\t\b\r" #done already
I'm trying to make a script, so I give as an ARGUMENT a not clean FILE and then grab the table and select each number with their respective data. Sometimes data has 4 lines, sometimes 3 lines. So copy/paste don't work for ME.
I think the last line to be joined is the line starting with "Usd". In that case you only need to change the formatting in
awk '
!orderfound && /^[0-9]/ {ordernr++; orderfound=1 }
orderfound { order[ordernr]=order[ordernr] " " $0 }
$1 ~ "Usd" { orderfound = 0 }
END {
for (i=1; i<=ordernr; i++) { print order[i] }
}' inputfile

Analyzing readdir() performance

It's bothering me that linux takes so long to list all files for huge directories, so I created a little test script that recursively lists all files of a directory:
#include <stdio.h>
#include <dirent.h>
int list(char *path) {
int i = 0;
DIR *dir = opendir(path);
struct dirent *entry;
char new_path[1024];
while(entry = readdir(dir)) {
if (entry->d_type == DT_DIR) {
if (entry->d_name[0] == '.')
continue;
strcpy(new_path, path);
strcat(new_path, "/");
strcat(new_path, entry->d_name);
i += list(new_path);
}
else
i++;
}
closedir(dir);
return i;
}
int main() {
char *path = "/home";
printf("%i\n", list(path));
return 0;
When compiling it with gcc -O3, the program runs about 15 sec (I ran the programm a few times and it's approximately constant, so the fs cache should not play a role here):
$ /usr/bin/time -f "%CC %DD %EE %FF %II %KK %MM %OO %PP %RR %SS %UU %WW %XX %ZZ %cc %ee %kk %pp %rr %ss %tt %ww %xx" ./a.out
./a.outC 0D 0:14.39E 0F 0I 0K 548M 0O 2%P 178R 0.30S 0.01U 0W 0X 4096Z 7c 14.39e 0k 0p 0r 0s 0t 1692w 0x
So it spends about S=0.3sec in kernelspace and U=0.01sec in userspace and has 7+1692 context switches.
A context switch takes about 2000nsec * (7+1692) = 3.398msec [1]
However, there are more than 10sec left and I would like to find out what the program is doing in this time.
Are there any other tools to investigate what the program is doing all the time?
gprof just tells me the time for the (userspace) call graph and gcov does not list time spent in each line but only how often a time is executed...
[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
oprofile is a decent sampling profiler which can profile both user and kernel-mode code.
According to your numbers, however, approximately 14.5 seconds of the time is spent asleep, which is not really registered well by oprofile. Perhaps what may be more useful would be ftrace combined with a reading of the kernel code. ftrace provides trace points in the kernel which can log a message and stack trace when they occur. The event that would seem most useful for determining why your process is sleeping would be the sched_switch event. I would recommend that you enable kernel-mode stacks and the sched_switch event, set a buffer large enough to capture the entire lifetime of your process, then run your process and stop tracing immediately after. By reviewing the trace, you will be able to see every time your process went to sleep, whether it was runnable or non-runnable, a high resolution time stamp, and a call stack indicating what put it to sleep.
ftrace is controlled through debugfs. On my system, this is mounted in /sys/kernel/debug, but yours may be different. Here is an example of what I would do to capture this information:
# Enable stack traces
echo "1" > /sys/kernel/debug/tracing/options/stacktrace
# Enable the sched_switch event
echo "1" > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# Make sure tracing is enabled
echo "1" > /sys/kernel/debug/tracing/tracing_on
# Run the program and disable tracing as quickly as possible
./your_program; echo "0" > /sys/kernel/debug/tracing/tracing_on
# Examine the trace
vi /sys/kernel/debug/tracing/trace
The resulting output will have lines which look like this:
# tracer: nop
#
# entries-in-buffer/entries-written: 22248/3703779 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] d..3 2113.437500: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:0 next_pid=878 next_prio=120
<idle>-0 [000] d..3 2113.437531: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> rest_init
=> start_kernel
kworker/0:0-878 [000] d..3 2113.437836: sched_switch: prev_comm=kworker/0:0 prev_pid=878 prev_prio=120 prev_state=S ==> next_comm=your_program next_pid=898 next_prio=120
kworker/0:0-878 [000] d..3 2113.437866: <stack trace>
=> __schedule
=> schedule
=> worker_thread
=> kthread
=> ret_from_fork
The lines you will care about will be when your program appears as the prev_comm task, meaning the scheduler is switching away from your program to run something else. prev_state will indicate that your program was still runnable (R) or was blocked (S, U or some other letter, see the ftrace source). If blocked, you can examine the stack trace and the kernel source to figure out why.

Can one do the equivalent of `pread(fd, buf, size, offset)` from the shell prompt?

I'd like to do the equivalent of what I can do in C via:
pread(fdesc, tgtbuf, size, file_offset);
or the combination:
lseek(fd, file_offset, SEEK_SET);
read(fd, tgtbuf, size)
as a shell command.
For some sizes/offsets, one can use:
dd if=file bs=size skip=$((file_offset/size)) count=1
That works ... but only if file_offset is divisible by size. Which isn't sufficient for my usecase, unfortunately.
The device I'm attempting to read from is "blocked" in 8-byte units for read but allows (requires) byte offsets for seek. dd always reads in units of bs/ibs but also always seeks in these units, which in my case is mutually exclusive.
I know I can do this via perl/python/C/... - but is there a way to do this from a simple shell script ?
EDIT: Since it was suggested to use dd bs=1 count=8 ... here - NO THIS DOES NOT WORK. strace it and you'll see that this does:
$ strace -e lseek,read dd if=/dev/zero bs=1 skip=1234 count=8
[ ... ]
lseek(0, 1234, SEEK_CUR) = 0
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
read(0, "\0", 1) = 1
Which is not what I need - it must be a single read().
Edit2:
The device (/dev/cpu/<ID>/msr) I'm trying to read from is strange in the sense that the offset is treated as an index number, but you'll always have to read eight bytes else the driver gives EINVAL on read.
But every index returns a different 8-byte value, so you cannot "reconstruct" reading from offset x+1 by reading x and x+8 and extracting the bytes. This is highly unusual ... but it's the way /dev/cpu/<ID>/msr works in Linux.
bs=size : size can be 1 byte, or more ... (it is an indication to dd on how to access the device, but for a file you can use whatever you need.. it's usually more efficient reading blocks of larger sizes, though)
try:
dd if=file bs=1 skip=whateveryouneed count=8 #to read 8 bytes starting at whateveryouneed
if (contrary to what you seem to state in the question) you can only seek to multiple of 8 (and read 8 bytes from there):
dd if=file bs=8 skip=X count=1 #to read 8 bytes starting at whateveryouneed
#X being: whateveryouneed / 8 (ex: echo "4000 / 8" | bc )
(as I say in my comment, I really have troubles to imagine a device that allows you to seek anywhere, and force you to read 8 bytes from anywhere, if anywhere is not also a multiple of 8 ... but, hey, anything is possible ^^ If so, you'll need another tool than dd, i'm afraid)
If it really is so weird: extract 2 blocks of 8 bytes around the adress you need, and then extract the exact part you need from it :
blockoffset=$(($address/8))
blockstart=$(($blockoffset*8))
shift=$(($address - $blockstart))
if [ "$shift" -eq 0 ]
dd if=file bs=8 skip=$blockoffset count=1 > final
else
dd if=file bs=8 skip=$blockoffset count=2 > bigger #we read 2 blocks from blockoffset
dd if=bigger bs=1 skip=$shift count=8 > final
fi
Given the amount of trouble you've gone to already, just put the C code into an executable program. Or get really ambitious and make the program into a bash extension with "enable -f pread.so pread"
http://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html
likely over the top. A separate program is easier.

Code Golf: Duplicate Character Removal in String

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
The challenge: The shortest code, by character count, that detects and removes duplicate characters in a String. Removal includes ALL instances of the duplicated character (so if you find 3 n's, all three have to go), and original character order needs to be preserved.
Example Input 1:
nbHHkRvrXbvkn
Example Output 1:
RrX
Example Input 2:
nbHHkRbvnrXbvkn
Example Output 2:
RrX
(the second example removes letters that occur three times; some solutions have failed to account for this)
(This is based on my other question where I needed the fastest way to do this in C#, but I think it makes good Code Golf across languages.)
LabVIEW 7.1
ONE character and that is the blue constant '1' in the block diagram.
I swear, the input was copy and paste ;-)
http://i25.tinypic.com/hvc4mp.png
http://i26.tinypic.com/5pnas.png
Perl
21 characters of perl, 31 to invoke, 36 total keystrokes (counting shift and final return):
perl -pe's/$1//gwhile/(.).*\1/'
Ruby — 61 53 51 56 35
61 chars, the ruler says. (Gives me an idea for another code golf...)
puts ((i=gets.split(''))-i.select{|c|i.to_s.count(c)<2}).join
+-------------------------------------------------------------------------+
|| | | | | | | | | | | | | | | |
|0 10 20 30 40 50 60 70 |
| |
+-------------------------------------------------------------------------+
gets.chars{|c|$><<c[$_.count(c)-1]}
... 35 by Nakilon
APL
23 characters:
(((1+ρx)-(ϕx)ιx)=xιx)/x
I'm an APL newbie (learned it yesterday), so be kind -- this is certainly not the most efficient way to do it. I'm ashamed I didn't beat Perl by very much.
Then again, maybe it says something when the most natural way for a newbie to solve this problem in APL was still more concise than any other solution in any language so far.
Python:
s=raw_input()
print filter(lambda c:s.count(c)<2,s)
This is a complete working program, reading from and writing to the console. The one-liner version can be directly used from the command line
python -c 's=raw_input();print filter(lambda c:s.count(c)<2,s)'
J (16 12 characters)
(~.{~[:I.1=#/.~)
Example:
(~.{~[:I.1=#/.~) 'nbHHkRvrXbvkn'
RrX
It only needs the parenthesis to be executed tacitly. If put in a verb, the actual code itself would be 14 characters.
There certainly are smarter ways to do this.
EDIT: The smarter way in question:
(~.#~1=#/.~) 'nbHHkRvrXbvkn'
RrX
12 characters, only 10 if set in a verb. I still hate the fact that it's going through the list twice, once to count (#/.) and another to return uniques (nub or ~.), but even nubcount, a standard verb in the 'misc' library does it twice.
Haskell
There's surely shorter ways to do this in Haskell, but:
Prelude Data.List> let h y=[x|x<-y,(<2).length$filter(==x)y]
Prelude Data.List> h "nbHHkRvrXbvkn"
"RrX"
Ignoring the let, since it's only required for function declarations in GHCi, we have h y=[x|x<-y,(<2).length$filter(==x)y], which is 37 characters (this ties the current "core" Python of "".join(c for c in s if s.count(c)<2), and it's virtually the same code anyway).
If you want to make a whole program out of it,
h y=[x|x<-y,(<2).length$filter(==x)y]
main=interact h
$ echo "nbHHkRvrXbvkn" | runghc tmp.hs
RrX
$ wc -c tmp.hs
54 tmp.hs
Or we can knock off one character this way:
main=interact(\y->[x|x<-y,(<2).length$filter(==x)y])
$ echo "nbHHkRvrXbvkn" | runghc tmp2.hs
RrX
$ wc -c tmp2.hs
53 tmp2.hs
It operates on all of stdin, not line-by-line, but that seems acceptable IMO.
C89 (106 characters)
This one uses a completely different method than my original answer. Interestingly, after writing it and then looking at another answer, I saw the methods were very similar. Credits to caf for coming up with this method before me.
b[256];l;x;main(c){while((c=getchar())>=0)b[c]=b[c]?1:--l;
for(;x-->l;)for(c=256;c;)b[--c]-x?0:putchar(c);}
On one line, it's 58+48 = 106 bytes.
C89 (173 characters)
This was my original answer. As said in the comments, it doesn't work too well...
#include<stdio.h>
main(l,s){char*b,*d;for(b=l=s=0;l==s;s+=fread(b+s,1,9,stdin))b=realloc(b,l+=9)
;d=b;for(l=0;l<s;++d)if(!memchr(b,*d,l)&!memchr(d+1,*d,s-l++-1))putchar(*d);}
On two lines, it's 17+1+78+77 = 173 bytes.
C#
65 Characters:
new String(h.Where(x=>h.IndexOf(x)==h.LastIndexOf(x)).ToArray());
67 Characters with reassignment:
h=new String(h.Where(x=>h.IndexOf(x)==h.LastIndexOf(x)).ToArray());
C#
new string(input.GroupBy(c => c).Where(g => g.Count() == 1).ToArray());
71 characters
PHP (136 characters)
<?PHP
function q($x){return $x<2;}echo implode(array_keys(array_filter(
array_count_values(str_split(stream_get_contents(STDIN))),'q')));
On one line, it's 5+1+65+65 = 136 bytes. Using PHP 5.3 you could save a few bytes making the function anonymous, but I can't test that now. Perhaps something like:
<?PHP
echo implode(array_keys(array_filter(array_count_values(str_split(
stream_get_contents(STDIN))),function($x){return $x<2;})));
That's 5+1+66+59 = 131 bytes.
another APL solution
As a dynamic function (18 charachters)
{(1+=/¨(ω∘∊¨ω))/ω}
line assuming that input is in variable x (16 characters):
(1+=/¨(x∘∊¨x))/x
VB.NET
For Each c In s : s = IIf(s.LastIndexOf(c) <> s.IndexOf(c), s.Replace(CStr(c), Nothing), s) : Next
Granted, VB is not the optimal language to try to save characters, but the line comes out to 98 characters.
PowerShell
61 characters. Where $s="nbHHkRvrXbvkn" and $a is the result.
$h=#{}
($c=[char[]]$s)|%{$h[$_]++}
$c|%{if($h[$_]-eq1){$a+=$_}}
Fully functioning parameterized script:
param($s)
$h=#{}
($c=[char[]]$s)|%{$h[$_]++}
$c|%{if($h[$_]-eq1){$a+=$_}}
$a
C: 83 89 93 99 101 characters
O(n2) time.
Limited to 999 characters.
Only works in 32-bit mode (due to not #include-ing <stdio.h> (costs 18 chars) making the return type of gets being interpreted as an int and chopping off half of the address bits).
Shows a friendly "warning: this program uses gets(), which is unsafe." on Macs.
.
main(){char s[999],*c=gets(s);for(;*c;c++)strchr(s,*c)-strrchr(s,*c)||putchar(*c);}
(and this similar 82-chars version takes input via the command line:
main(char*c,char**S){for(c=*++S;*c;c++)strchr(*S,*c)-strrchr(*S,*c)||putchar(*c);}
)
Golfscript(sym) - 15
.`{\{=}+,,(!}+,
+-------------------------------------------------------------------------+
|| | | | | | | | | | | | | | | |
|0 10 20 30 40 50 60 70 |
| |
+-------------------------------------------------------------------------+
Haskell
(just knocking a few characters off Mark Rushakoff's effort, I'd rather it was posted as a comment on his)
h y=[x|x<-y,[_]<-[filter(==x)y]]
which is better Haskell idiom but maybe harder to follow for non-Haskellers than this:
h y=[z|x<-y,[z]<-[filter(==x)y]]
Edit to add an explanation for hiena and others:
I'll assume you understand Mark's version, so I'll just cover the change. Mark's expression:
(<2).length $ filter (==x) y
filters y to get the list of elements that == x, finds the length of that list and makes sure it's less than two. (in fact it must be length one, but ==1 is longer than <2 ) My version:
[z] <- [filter(==x)y]
does the same filter, then puts the resulting list into a list as the only element. Now the arrow (meant to look like set inclusion!) says "for every element of the RHS list in turn, call that element [z]". [z] is the list containing the single element z, so the element "filter(==x)y" can only be called "[z]" if it contains exactly one element. Otherwise it gets discarded and is never used as a value of z. So the z's (which are returned on the left of the | in the list comprehension) are exactly the x's that make the filter return a list of length one.
That was my second version, my first version returns x instead of z - because they're the same anyway - and renames z to _ which is the Haskell symbol for "this value isn't going to be used so I'm not going to complicate my code by giving it a name".
Javascript 1.8
s.split('').filter(function (o,i,a) a.filter(function(p) o===p).length <2 ).join('');
or alternately- similar to the python example:
[s[c] for (c in s) if (s.split("").filter(function(p) s[c]===p).length <2)].join('');
TCL
123 chars. It might be possible to get it shorter, but this is good enough for me.
proc h {i {r {}}} {foreach c [split $i {}] {if {[llength [split $i $c]]==2} {set r $r$c}}
return $r}
puts [h [gets stdin]]
C
Full program in C, 141 bytes (counting newlines).
#include<stdio.h>
c,n[256],o,i=1;main(){for(;c-EOF;c=getchar())c-EOF?n[c]=n[c]?-1:o++:0;for(;i<o;i++)for(c=0;c<256;c++)n[c]-i?0:putchar(c);}
Scala
54 chars for the method body only, 66 with (statically typed) method declaration:
def s(s:String)=(""/:s)((a,b)=>if(s.filter(c=>c==b).size>1)a else a+b)
Ruby
63 chars.
puts (t=gets.split(//)).map{|i|t.count(i)>1?nil:i}.compact.join
VB.NET / LINQ
96 characters for complete working statement
Dim p=New String((From c In"nbHHkRvrXbvkn"Group c By c Into i=Count Where i=1 Select c).ToArray)
Complete working statement, with original string and the VB Specific "Pretty listing (reformatting of code" turned off, at 96 characters, non-working statement without original string at 84 characters.
(Please make sure your code works before answering. Thank you.)
C
(1st version: 112 characters; 2nd version: 107 characters)
k[256],o[100000],p,c;main(){while((c=getchar())!=-1)++k[o[p++]=c];for(c=0;c<p;c++)if(k[o[c]]==1)putchar(o[c]);}
That's
/* #include <stdio.h> */
/* int */ k[256], o[100000], p, c;
/* int */ main(/* void */) {
while((c=getchar()) != -1/*EOF*/) {
++k[o[p++] = /*(unsigned char)*/c];
}
for(c=0; c<p; c++) {
if(k[o[c]] == 1) {
putchar(o[c]);
}
}
/* return 0; */
}
Because getchar() returns int and putchar accepts int, the #include can 'safely' be removed.
Without the include, EOF is not defined, so I used -1 instead (and gained a char).
This program only works as intended for inputs with less than 100000 characters!
Version 2, with thanks to strager
107 characters
#ifdef NICE_LAYOUT
#include <stdio.h>
/* global variables are initialized to 0 */
int char_count[256]; /* k in the other layout */
int char_order[999999]; /* o ... */
int char_index; /* p */
int main(int ch_n_loop, char **dummy) /* c */
/* variable with 2 uses */
{
(void)dummy; /* make warning about unused variable go away */
while ((ch_n_loop = getchar()) >= 0) /* EOF is, by definition, negative */
{
++char_count[ ( char_order[char_index++] = ch_n_loop ) ];
/* assignment, and increment, inside the array index */
}
/* reuse ch_n_loop */
for (ch_n_loop = 0; ch_n_loop < char_index; ch_n_loop++) {
(char_count[char_order[ch_n_loop]] - 1) ? 0 : putchar(char_order[ch_n_loop]);
}
return 0;
}
#else
k[256],o[999999],p;main(c){while((c=getchar())>=0)++k[o[p++]=c];for(c=0;c<p;c++)k[o[c]]-1?0:putchar(o[c]);}
#endif
Javascript 1.6
s.match(/(.)(?=.*\1)/g).map(function(m){s=s.replace(RegExp(m,'g'),'')})
Shorter than the previously posted Javascript 1.8 solution (71 chars vs 85)
Assembler
Tested with WinXP DOS box (cmd.exe):
xchg cx,bp
std
mov al,2
rep stosb
inc cl
l0: ; to save a byte, I've encoded the instruction to exit the program into the
; low byte of the offset in the following instruction:
lea si,[di+01c3h]
push si
l1: mov dx,bp
mov ah,6
int 21h
jz l2
mov bl,al
shr byte ptr [di+bx],cl
jz l1
inc si
mov [si],bx
jmp l1
l2: pop si
l3: inc si
mov bl,[si]
cmp bl,bh
je l0+2
cmp [di+bx],cl
jne l3
mov dl,bl
mov ah,2
int 21h
jmp l3
Assembles to 53 bytes. Reads standard input and writes results to standard output, eg:
programname < input > output
PHP
118 characters actual code (plus 6 characters for the PHP block tag):
<?php
$s=trim(fgets(STDIN));$x='';while(strlen($s)){$t=str_replace($s[0],'',substr($s,1),$c);$x.=$c?'':$s[0];$s=$t;}echo$x;
C# (53 Characters)
Where s is your input string:
new string(s.Where(c=>s.Count(h=>h==c)<2).ToArray());
Or 59 with re-assignment:
var a=new string(s.Where(c=>s.Count(h=>h==c)<2).ToArray());
Haskell Pointfree
import Data.List
import Control.Monad
import Control.Arrow
main=interact$liftM2(\\)nub$ap(\\)nub
The whole program is 97 characters, but the real meat is just 23 characters. The rest is just imports and bringing the function into the IO monad. In ghci with the modules loaded it's just
(liftM2(\\)nub$ap(\\)nub) "nbHHkRvrXbvkn"
In even more ridiculous pointfree style (pointless style?):
main=interact$liftM2 ap liftM2 ap(\\)nub
It's a bit longer though at 26 chars for the function itself.
Shell/Coreutils, 37 Characters
fold -w1|sort|uniq -u|paste -s -d ''

Resources