Can I change the byte ordering from little endian to big endian in qiskit? - endianness

The regular Matrix representation of a CNOT gate as found in literature is:
CNOT =
\begin{bmatrix}
1 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 0 & 1\\
0 & 0 & 1 & 0
\end{bmatrix}
However in Qiskit, the matrix is represented as
CNOT =
\begin{bmatrix}
1 & 0 & 0 & 0\\
0 & 0 & 0 & 1\\
0 & 0 & 1 & 0\\
0 & 1 & 0 & 0
\end{bmatrix}
Is this something related to the Big-endian/Little-endian issue? Is there a way to represent my matrix the same way it is recovered in literature?

Yes, as you mentioned this has to do with the little endian bit-order in Qiskit. Most textbooks (and the first matrix you showed) are in big endian order.
If you want to know more you could check out these posts/documentation:
https://qiskit.org/documentation/tutorials/circuits/3_summary_of_quantum_operations.html#Basis-vector-ordering-in-Qiskit
https://quantumcomputing.stackexchange.com/questions/8244/big-endian-vs-little-endian-in-qiskit
If you want to convert your Qiskit circuit to big endian you can just use the reverse_bits method:
from qiskit import QuantumCircuit
from qiskit.quantum_info import Operator
circuit = QuantumCircuit(2)
circuit.cx(0, 1)
print('Little endian:')
print(Operator(circuit))
print('Big endian:')
print(Operator(circuit.reverse_bits()))
gives:
Little endian:
Operator([[1.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, 1.+0.j],
[0.+0.j, 0.+0.j, 1.+0.j, 0.+0.j],
[0.+0.j, 1.+0.j, 0.+0.j, 0.+0.j]],
input_dims=(2, 2), output_dims=(2, 2))
Big endian:
Operator([[1.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 1.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, 1.+0.j],
[0.+0.j, 0.+0.j, 1.+0.j, 0.+0.j]],
input_dims=(2, 2), output_dims=(2, 2))

Related

Creating sparse "band" matrices using PyTorch efficiently

I want to construct the precision matrix for a model that has an AR(1)-like structure. This means that the precision matrix is a sparse band-like matrix and if you were using R, you would use functionality such as bandSparse from the Matrix library.
The matrix should be of the form (sorry, can't find any LaTeX support):
$$\begin{bmatrix}
1 & -\alpha & & & \
-\alpha & 1+\alpha^2 & -\alpha & & \
& & \ddots & & \
& & -\alpha & 1+\alpha^2 & -\alpha \
& & & -\alpha & 1
\end{bmatrix}$$
Below is working code, but as I am sure that you can tell, won't be the most direct way of creating this matrix. Any thoughts on how to do it better would be appreciated. Thanks!
N = 100
alpha = 0.6
dependence = torch.cat([torch.tensor([1., -alpha, 0]).reshape(1, 3),
torch.tensor([-alpha, 1+alpha**2, -alpha]).reshape(1, 3).expand(N-2, 3),
torch.tensor([0., -alpha, 1.]).reshape(1, 3)], 0)
band_matrix = torch.zeros(N, N)
total_pad = N - 3
band_matrix[0] = torch.cat((dependence[0], torch.zeros(total_pad)), 0)
for i in range(1, N-1):
left_pad = i - 1
right_pad = total_pad - left_pad
band_matrix[i] = torch.cat((torch.zeros(left_pad),
dependence[i],
torch.zeros(right_pad)), 0)
band_matrix[N-1] = torch.cat((torch.zeros(total_pad), dependence[N-1]), 0)

How to split the 8 bit input into two 4 bit data

I'm writing a code for QPSK modulation in VHDL. I need to split the 8 bit input data into odd and even bits and each bit is replicated How can i do it.
for example if my input is 11001001 then odd and even bits are odd= 1010 even =1001 my output should be like odd= 11001100 and even is 11000011
Use the concatenation operator '&':
dbl_odds <= v(7) & v(7) & v(5) & v(5) & v(3) & v(3) & v(1) & v(1);
dbl_evens <= v(6) & v(6) & v(4) & v(4) & v(2) & v(2) & v(0) & v(0);

Loop optimisation

I am trying to understand what cache or other optimizations could be done in the source code to get this loop faster. I think it is quite cache friendly but, are there any experts out there that could squeeze a bit more performance tuning this code?
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
SIDEBACK = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1) + &
STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEOWN = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K) + &
STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K) + &
STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1) + &
STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0
END DO
END DO
END DO
Ok, I think I've tried everything I reasonably could, and my conclusion unfortunately is that there is not too much room for optimizations, unless you are willing to go into parallelization. Let's see why, let's see what you can and can't do.
Compiler optimizations
Compilers nowadays are extremely good at optimizing code, much much more than humans are. Relying on the optimizations done by the compilers also have the added benefit that they don't ruin the readability of your source code. Whatever you do, (when optimizing for speed) always try it with every reasonable combination of compiler flags. You can even go as far as to try multiple compilers. Personally I only used gfortran (included in GCC) (OS is 64-bit Windows), which I trust to have efficient and correct optimization techniques.
-O2 almost always improve the speed drastically, but even -O3 is a safe bet (among others, it includes delicious loop unrolling). For this problem, I also tried -ffast-math and -fexpensive-optimizations, they didn't have any measurable effect, but -march-corei7(cpu architecture-specific tuning, specific to Core i7) had, so I did the measurements with -O3 -march-corei7
So how fast it actually is?
I wrote the following code to test your solution and compiled it with -O3 -march-corei7. Usually it ran under 0.78-0.82 seconds.
program benchmark
implicit none
real :: start, finish
integer :: I, J, K
real :: SIDEBACK, SIDEOWN, SIDEFRONT
integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)
call cpu_time(start)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
SIDEBACK = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1) + &
STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEOWN = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K) + &
STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K) + &
STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1) + &
STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0
END DO
END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
!Print the original value as well, because one can never be too paranoid.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start
end program
Ok, so this is as far as the compiler can take us. What's next?
Store intermediate results?
As you might suspect from the question mark, this one didn't really work. Sorry. But let's not rush that forward.
As mentioned in the comments, your current code calculates every partial sum multiple times, meaning one iteration's STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1) will be the next iteration's STEN(I,J-1,K-1) + STEN(I,J,K-1) + STEN(I,J+1,K-1), so no need to fetch and calculate again, you can store those partial results.
The problem is, that we cannot store too many partial results. As you said, your code is already quite cache-friendly, every partial sum you store means one less array element you can store in L1 cache. We could store a few values, from the last few iterations of I (values for index I-2, I-3, etc.), but the compiler almost certainly does that already. I have 2 proofs for this suspicion. First, my manual loop unrolling made the program slower, by about 5%
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX, 8
SIDEBACK(0) = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1)
SIDEBACK(1) = STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1)
SIDEBACK(2) = STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEBACK(3) = STEN(I+2,J-1,K-1) + STEN(I+2,J,K-1) + STEN(I+2,J+1,K-1)
SIDEBACK(4) = STEN(I+3,J-1,K-1) + STEN(I+3,J,K-1) + STEN(I+3,J+1,K-1)
SIDEBACK(5) = STEN(I+4,J-1,K-1) + STEN(I+4,J,K-1) + STEN(I+4,J+1,K-1)
SIDEBACK(6) = STEN(I+5,J-1,K-1) + STEN(I+5,J,K-1) + STEN(I+5,J+1,K-1)
SIDEBACK(7) = STEN(I+6,J-1,K-1) + STEN(I+6,J,K-1) + STEN(I+6,J+1,K-1)
SIDEBACK(8) = STEN(I+7,J-1,K-1) + STEN(I+7,J,K-1) + STEN(I+7,J+1,K-1)
SIDEBACK(9) = STEN(I+8,J-1,K-1) + STEN(I+8,J,K-1) + STEN(I+8,J+1,K-1)
SIDEOWN(0) = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K)
SIDEOWN(1) = STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K)
SIDEOWN(2) = STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEOWN(3) = STEN(I+2,J-1,K) + STEN(I+2,J,K) + STEN(I+2,J+1,K)
SIDEOWN(4) = STEN(I+3,J-1,K) + STEN(I+3,J,K) + STEN(I+3,J+1,K)
SIDEOWN(5) = STEN(I+4,J-1,K) + STEN(I+4,J,K) + STEN(I+4,J+1,K)
SIDEOWN(6) = STEN(I+5,J-1,K) + STEN(I+5,J,K) + STEN(I+5,J+1,K)
SIDEOWN(7) = STEN(I+6,J-1,K) + STEN(I+6,J,K) + STEN(I+6,J+1,K)
SIDEOWN(8) = STEN(I+7,J-1,K) + STEN(I+7,J,K) + STEN(I+7,J+1,K)
SIDEOWN(9) = STEN(I+8,J-1,K) + STEN(I+8,J,K) + STEN(I+8,J+1,K)
SIDEFRONT(0) = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1)
SIDEFRONT(1) = STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1)
SIDEFRONT(2) = STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
SIDEFRONT(3) = STEN(I+2,J-1,K+1) + STEN(I+2,J,K+1) + STEN(I+2,J+1,K+1)
SIDEFRONT(4) = STEN(I+3,J-1,K+1) + STEN(I+3,J,K+1) + STEN(I+3,J+1,K+1)
SIDEFRONT(5) = STEN(I+4,J-1,K+1) + STEN(I+4,J,K+1) + STEN(I+4,J+1,K+1)
SIDEFRONT(6) = STEN(I+5,J-1,K+1) + STEN(I+5,J,K+1) + STEN(I+5,J+1,K+1)
SIDEFRONT(7) = STEN(I+6,J-1,K+1) + STEN(I+6,J,K+1) + STEN(I+6,J+1,K+1)
SIDEFRONT(8) = STEN(I+7,J-1,K+1) + STEN(I+7,J,K+1) + STEN(I+7,J+1,K+1)
SIDEFRONT(9) = STEN(I+8,J-1,K+1) + STEN(I+8,J,K+1) + STEN(I+8,J+1,K+1)
RES(I ,J,K) = ( SIDEBACK(0) + SIDEOWN(0) + SIDEFRONT(0) + &
SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) ) / 27.0
RES(I + 1,J,K) = ( SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) ) / 27.0
RES(I + 2,J,K) = ( SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) ) / 27.0
RES(I + 3,J,K) = ( SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) ) / 27.0
RES(I + 4,J,K) = ( SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) ) / 27.0
RES(I + 5,J,K) = ( SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) ) / 27.0
RES(I + 6,J,K) = ( SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) ) / 27.0
RES(I + 7,J,K) = ( SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) + &
SIDEBACK(9) + SIDEOWN(9) + SIDEFRONT(9) ) / 27.0
END DO
END DO
END DO
And what's worse, it's easy to show we are already pretty close the theoretical minimal possible execution time. In order to calculate all these averages, the absolute minimum we need to do, is access every element at least once, and divide them by 27.0. So you can never get faster than the following code, which executes under 0.48-0.5 seconds on my machine.
program benchmark
implicit none
real :: start, finish
integer :: I, J, K
integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)
call cpu_time(start)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
!This of course does not do what you want to do,
!this is just an example of a speed limit we can never surpass.
RES(I, J, K) = STEN(I, J, K) / 27.0
END DO
END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start
end program
But hey, even a negative result is a result. If just accessing every element once (and dividing by 27.0) takes up more than half of the execution time, that just means memory access is the bottle neck. Then maybe you can optimize that.
Less data
If you don't need the full precision of 64-bit doubles, you can declare your array with a type of real(kind=4). But maybe your reals are already 4 bytes. In that case, I believe some Fortran implementations support non-standard 16-bit doubles, or depending on your data you can just use integers (maybe floats multiplied by a number then rounded to integer). The smaller your base type is, the more elements you can fit into the cache. The most ideal would be integer(kind=1), of course, it caused more than a 2x speed up on my machine, compared to real(kind=4). But it depends on the precision you need.
Better locality
Column major arrays are slow when you need data from neighbouring column, and row major ones are slow for neighbouring rows.
Fortunately there is a funky way to store data, called a Z-order curve, which does have applications similar to your use case in computer graphics.
I can't promise it will help, maybe it will be terribly counterproductive, but maybe not. Sorry, I didn't feel like implementing it myself, to be honest.
Parallelization
Speaking of computer graphics, this problem is trivially and extremely well parallelizable, maybe even on a GPU, but if you don't want to go that far, you can just use a normal multicore CPU. The Fortran Wiki seems like a good place to search for Fortran parallelization libraries.

Read the quantity of carriage return used in Word

I am calculating accuracy from a word document by calculating the total number of changes made once track review is on. Incorrect use of punctuation is calculated as 1/4 mark, while for contextual or grammar errors a full 1 mark is deducted.
Right now all carriage returns are being calculated as 1 full mark. I want this either to be removed completely or can pass it along as 1/4 mark deduction. I am using the following for counting . ; and , as 1/4 mark deduction.
For Each myRevision In ActiveDocument.Revisions
myRevision.Range.Select
If myRevision.Type = wdRevisionInsert Then
lngRevisions = Len(Selection.Text)
For i = 1 To lngRevisions
If Mid(Selection.Text, i, 1) = "," Then
punct = punct + 1
Else
End If
If Mid(Selection.Text, i, 1) = "." Then
punct = punct + 1
Else
End If
If Mid(Selection.Text, i, 1) = ";" Then
punct = punct + 1
Else
End If
If Mid(Selection.Text, i, 1) = "" Then
punct = punct + 1
Else
End If
Next i
Count = Count + 1
Else
End If
Next
tCorrections = Count + punct * 0.25 - punct
Accuracy = ((tWords - tCorrections) / tWords) * 100
Accuracy = Round(Accuracy, 1)
Use an array of types names (aLabels) and a string of the types occurring in your data (sC) via a mapping (aMap) of types to counting slots for a flexible way to classify your string(s). As in this demo:
Option Explicit
Dim aLabels : aLabels = Split("Vowels Consants Digits Punctuations EOLs Unclassified")
ReDim aCounts(UBound(aLabels))
Dim sC : sC = "abce1,2." & vbCr
Dim aMap : aMap = Array(0, 1, 1, 0, 2, 3, 2, 3, 4)
Dim sD : sD = sC & "d" & sC & "bb111."
Dim p, i
For p = 1 To Len(sD)
i = Instr(sC, Mid(sD, p, 1))
If 0 = i Then
i = UBound(aLabels)
Else
i = aMap(i - 1)
End If
aCounts(i) = aCounts(i) + 1
Next
For i = 0 To UBound(aLabels)
WScript.Echo Right(" " & aCounts(i), 3), aLabels(i)
Next
output:
cscript 42505210.vbs
4 Vowels
6 Consants
7 Digits
5 Punctuations
2 EOLs
1 Unclassified
Based on such raw data (frequencies of types) you an add specials weights.
Update wrt comment:
As I said: Add weights after calculating the raw frequencies:
... as above ...
Dim nSum
' Std - all weights = 1
nSum = 0 : For Each i In aCounts : nSum = nSum + i : Next
WScript.Echo "all pigs are equal:", nSum
' No EOLs
nSum = 0 : For Each i In aCounts : nSum = nSum + i : Next : nSum = nSum - aCounts(4)
WScript.Echo "EOLs don't count:", nSum
nSum = 0 : aCounts(0) = aCounts(0) * 4 : For Each i In aCounts : nSum = nSum + i : Next
WScript.Echo "vowels count * 4:", nSum
additional output:
all pigs are equal: 25
EOLs don't count: 23
vowels count * 4: 37

uninterlace bits from 16 bit value

I have a 16 bit value with its bits "interlaced".
I want to get an array of 8 items (values 0 to 3) that stores the bits in this order:
item 0: bits 7 and 15
item 1: bits 6 and 14
item 2: bits 5 and 13
...
item 7: bits 0 and 8
This is a trivial solution:
function uninterlace(n) {
return [((n>>7)&1)|((n>>14)&2), // bits 7 and 15
((n>>6)&1)|((n>>13)&2), // bits 6 and 14
((n>>5)&1)|((n>>12)&2), // bits 5 and 13
((n>>4)&1)|((n>>11)&2), // bits 4 and 12
((n>>3)&1)|((n>>10)&2), // bits 3 and 11
((n>>2)&1)|((n>> 9)&2), // bits 2 and 10
((n>>1)&1)|((n>> 8)&2), // bits 1 and 9
((n>>0)&1)|((n>> 7)&2)];// bits 0 and 8
}
Does anyone knows a better (faster) way of doing this?
Edit:
Notes:
Build a precalculated table is not an option.
Cannot use assembler or CPU-specific optimizations
Faster than a hand-written unrolled loop? I doubt it.
The code could be made less repetitive by using a for-loop, but that wouldn't make it run any faster.
def uninterlace(n) {
mask = 0x0101 // 0b0000_0001_0000_0001
slide = 0x7f // 0b0111_1111
return [(((n >> 0) & mask) + slide) >> 7,
(((n >> 1) & mask) + slide) >> 7,
(((n >> 2) & mask) + slide) >> 7,
(((n >> 3) & mask) + slide) >> 7,
(((n >> 4) & mask) + slide) >> 7,
(((n >> 5) & mask) + slide) >> 7,
(((n >> 6) & mask) + slide) >> 7,
(((n >> 7) & mask) + slide) >> 7]
}
This is only four operations per entry, instead of 5. The trick is in reusing the shifted value. The addition of slide moves the relevant bits adjacent to each other, and the shift by 7 puts them in the low-order position. The use of + might be a weakness.
A bigger weakness might be that each entry's operations must be done entirely in sequence, creating a latency of 4 instructions from entering a processor's pipeline to leaving it. These can be fully pipelined, but would still have some delay. The question's version exposes some instruction-level parallelism, and could potentially have a latency of only 3 instructions per entry, given sufficient execution resources.
It might be possible to combine multiple extractions into fewer operations, but I haven't seen a way to do it yet. The interlacing does, in fact, make that challenging.
Edit: a two-pass approach to treat the low and high order bits symmetrically, with interleaved 0's, shift them off from each other by one, and or the result could be much faster, and scalable to longer bitstrings.
Edited to correct the slide per Pedro's comment. Sorry to take up your time on my poor base-conversion skills. It was originally 0xef, which puts the 0 bit in the wrong place.
Ok, now with 3 operations per item (tested and works).
This is a variation of Novelocrat's answer. It uses variable masks and slides.
function uninterlace(n) {
return [((n & 0x8080) + 0x3FFF) >> 14,
((n & 0x4040) + 0x1FFF) >> 13,
((n & 0x2020) + 0x0FFF) >> 12,
((n & 0x1010) + 0x07FF) >> 11,
((n & 0x0808) + 0x03FF) >> 10,
((n & 0x0404) + 0x01FF) >> 9,
((n & 0x0202) + 0x00FF) >> 8,
((n & 0x0101) + 0x007F) >> 7];
}
How about a small precalculated table of 128 entries times 2?
int[128] b1 = { 2, 3, 3, .. 3};
int[128] b0 = { 0, 1, 1, .. 1};
function uninterlace(n) {
return [(n & 0x8000) ? b1 : b0)[n & 0x80],
(n & 0x4000) ? b1 : b0)[n & 0x40],
(n & 0x2000) ? b1 : b0)[n & 0x20],
(n & 0x1000) ? b1 : b0)[n & 0x10],
(n & 0x0800) ? b1 : b0)[n & 0x08],
(n & 0x0400) ? b1 : b0)[n & 0x04],
(n & 0x0200) ? b1 : b0)[n & 0x02],
(n & 0x0100) ? b1 : b0)[n & 0x01]
];
}
This uses bit masking and table lookup instead of shifts and additions and might be faster.

Resources