FFT Formal Proof  Auxiliary Info
Contents
 1 Housekeeping/Schedule (or: how/when I'm gonna be done with this)
 2 Takala equation and paper
 3 TODO/NEXT/CONTINUE FROM HERE 
 4 Notes  bank num encoding table
 5 The Proof (old)
 6 Sketching out the remainder of the proof
 7 Trash
 7.1 trash 1/27/2014
 7.2 trash
 7.3 Trash: clean example table
 7.4 Trash: nonadjacent toggle bits table
 7.5 Trash: Algorithm 1: One radix2 butterfly (R==2 and B==1) and even number of address bits
 7.6 TRASH/OLD Boundary condition, odd number of bits
 7.7 TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1
 7.8 TRASH/OLD The ... REST ... of the story?
 7.9 TRASH/OLD Extension to more than one butterfly unit (B > 1).
Housekeeping/Schedule (or: how/when I'm gonna be done with this)
Checkoff list:
Target Actual Finish Finish Task    Tue 11/19 Tue 11/19 Move the old stuff to an auxiliary page. DONE! ... Tue 11/19 Tue 11/19 Clean up the outline, figure out where to put the "nomenclature" section (see below) ... Tue 11/19 xxx xx/xx Build a timeline for what gets done when, working like twelve hours a week or whatever. ... Tue 11/19 Tue 11/19 Start on nomenclature section: what is N, n, N_B etc. ... Thu 11/19 Tue 11/19 Finish nomenclature section: what is N, n, N_B etc. ... Thu 11/19 xxx xx/xx Continue with the proof, in order of sections not finished ... TBD List of sections, including when each one will be done ... xxx xx/xx xxx xx/xx Section 5 xxx xx/xx xxx xx/xx 5.1 The equation so far xxx xx/xx xxx xx/xx 5.2 Boundary condition, odd number of bits xxx xx/xx xxx xx/xx 5.3 Boundary condition, even number of bits xxx xx/xx xxx xx/xx 5.4 Final equation for r==2 and B==1 xxx xx/xx xxx xx/xx Final polish of section 5 ... xxx xx/xx xxx xx/xx Section 6 xxx xx/xx xxx xx/xx 6.1 Extension to more than one butterfly unit (B > 1). xxx xx/xx xxx xx/xx 6.2 Extension to radix r > 2. xxx xx/xx xxx xx/xx 6.3 Extension to longer pipelines.
Takala equation and paper
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1205957
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1205957&tag=1
http://genesis2.stanford.edu/mediawiki/index.php/File:Takala_equations.jpg
TODO/NEXT/CONTINUE FROM HERE 
SCRUBBED TO HERE MAYBE
Notes  bank num encoding table
Bank num m  Bits m_{n+1}m_{n}m_{n1}...m_{0} 

m0  m_{n+1}m_{n}m_{n1}...m_{0} = 00...00_{2} = 0 
m1  m_{n+1}m_{n}m_{n1}...m_{1} = 00...01_{2} = 1 
↓  ↓ 
mn  m_{n+1}m_{n}m_{n1}...m_{0} = 11...10_{2} = 4B2 
m(n+1)  m_{n+1}m_{n}m_{n1}...m_{0} = 11...11_{2} = 4B1 
 bank m0: m_{n+1}m_{n}m_{n1}...m_{0} = 00...00_{2}
 bank m1: m_{n+1}m_{n}m_{n1}...m_{1} = 00...01_{2}
 ...
 bank mn: m_{n+1}m_{n}m_{n1}...m_{0} = 11...10_{2}
 bank m(n+1): m_{n+1}m_{n}m_{n1}...m_{0} = 11...11_{2}
OLD
 m_{n+1}m_{n}m_{n1}...m_{0}={(00...00_{2}),(00...01_{2}), ... (11...10_{2}),(11...11_{2})} i.e.
 m0=00..00_{2}, m1=00..01_{2}, mn=11..10_{2} and m(n+1)=11..11_{2}.
The Proof (old)
Note this part of the proof is old and possibly unnecessary; all the key information is in the section currently labeled "The Problem," above.
Here is an algorithm for an FFT schedule free of interstage memory access conflicts along with an embedded formal proof of correctness. The proof is developed for an FFT implemented as a single timesliced and pipelined radix2 butterfly unit, e.g. one radixtwo butterfly transforming eight datapoints. The proof will later be extended to cover any powerof2 number of butterfly units at any powerof2 radix, e.g. four radixfour butterflies transforming 1024 datapoints.
The initial proof posits an FFT based on the CooleyTukey [verify] algorithm, implemented as a single radix2 butterfly. The butterfly unit repeatedly reads two input operands, calculates two results, then writes the results back to memory, repeating with new input operands until the FFT operation is complete. The butterfly unit is pipelined such that in continuous operation, the read of two new operands happens at the same time as the writeback of the two previous results. [maybe include pipeline diagram here] This obviously requires a minimum of four memory accesses per cycle, using say a single fourported memory bank, two dualported memory banks, or four independent singleported memory banks. Our schedule targets the most efficient of these choices, four independent banks of singleported memory.
Given the constraints outlined above, our schedule must place operands in memory and schedule the operand accesses such that any given group of four operands (two reads plus two writes) reside in four separate memory banks. As an example, consider the FFT schedule in Fig. 1 below, for N=8 datapoints. When implemented as a single timesliced butterfly unit, the memory accesses occur as shown in Fig 2. For simplicity we assume that datapoint dp0 resides at memory location m0, dp1 at m1 and so on.
Fig. 1 goes here: std butterfly network for eight datapoints
Fig. 2: stretchedout version of Fig. 1 I guess
In cycle 0 of Fig. 1, the butterfly unit accesses datapoints from memory locations m0 and m1. In cycle 1, the butterfly unit writes its results back to m0 and m1 and reads new datapoints from m2 and m3. For this to happen without a stall, m0, m1, m2 and m3 must all reside in separate memory banks. Working through the entire diagram, we find the following nine groups of memory locations such that, during some cycle of execution, each memory location within the group gets accessed at the same time as each of the other three members of the group, and thus all four must live in separate banks for conflictfree operation.
{m0,m1,m2,m3} {m2,m3,m4,m5} {m4,m5,m6,m7} {m0,m2,m4,m6} {m4,m6,m1,m5} {m1,m5,m3,m7} {m0,m4,m1,m5} {m1,m5,m2,m6} {m2,m6,m3,m7}
So, for example, locations m0, m1, m2 and m3 can live in banks b0, b1, b2 and b3 respectively, thus satisfying the separateness requirements for group 1. In group 2, locations m2 and m3 have already been assigned to b2 and b3, and thus m4 and m5 must be assigned to b0 and b1. If we follow this through to group 3, we find that m6 and m7 have to reside in banks b2 and b3. By the time we get to group 4, we've already assigned m0, m2, m4 and m6 to b0, b2, b0 and b2, which is a problem because this causes collisions in both banks b0 and b2.
Finding the correct assignment, such that no collisions occur in any of the nine groups, is a very hard problem that may not be solvable for all values of N, much less for our little example where N=8. However, we can change the rules slightly. Note that the operations within a given FFT stage can occur in any order and still yield correct results. Thus in place of the hardtoschedule network of Fig. 2 we could substitute a different ordering, such as that given by Fig. 3.
Fig. 3: tractable reordered schedule
Now the groups that must reside in separate banks are as follows:
m0 m1 m2 m3 m2 m3 m5 m4 m5 m4 m7 m6 m0 m2 m4 m6 m4 m6 m5 m1 m5 m1 m7 m3 m0 m4 m7 m3 m7 m3 m5 m1 m5 m1 m2 m6
Given this schedule, we can achieve our goal by placing datapoints m0 and m5 in bank b0, m1 and m4 in bank b1, m2 and m7 in bank b2, and m3 and m6 in bank b3:
[Below, replace A, B, C and D with b0, b1, b2 and b3 respectively maybe]
A B C D C D A B A B C D m0 m1 m2 m3 m2 m3 m5 m4 m5 m4 m7 m6 A C B D B D A C A C B D m0 m2 m4 m6 m4 m6 m5 m1 m5 m1 m7 m3 A B C D C D A B A B C D m0 m4 m7 m3 m7 m3 m5 m1 m5 m1 m2 m6
[maybe this (above) is gonna work better graphically...???]
The remainder of this proof will show how to deterministically achieve a schedule and a mapping for similar conflictfree operation given any number of datapoints N=2^i, where i is any integer greater than 2 (cases for i <= 2 are trivial and obvious).
Sketching out the remainder of the proof
The CooleyTukey algorithm [check] for sixteen datapoints yields the following schedule:
stage 0: 0 1 2 3 4 5 6 7 8 9 a b c d e f stage 1: 0 2 4 6 8 a c e 1 3 5 7 9 b d f stage 2: 0 4 8 c 1 5 9 d 2 6 a e 3 7 b f stage 3: 0 8 1 9 2 a 3 b 4 c 5 d 6 e 7 f
If we write these operands in binary form it looks something like this:
stage 0 stage 1 stage 2 stage 3     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Note that, in any given group of four within any given stage, only two bits toggle while the remaining bits stay the same. For the purposes of this discussion, we'll number the bits 0 to 3, right to left, where bit 0 is the least significant bit (LSB) and bit 3 is the most significant bit (MSB).
In stage 0, bits 0 and 1 toggle while bits 2 and 3 stay at 00 (in the first group of four), then 01 (in the second group), then 10 and then 11.
In stage 1 it is bits 1 and 2 that toggle while bits 3 and 0 remain constant within each group of four.
In general, for stage s, bits s+1 and s within any group of four will count 00,01,10,11 while the remaining bits s1, s2, ... increment only at groupoffour boundaries, where si wraps back to the MSB if si is less than 0. More precisely, given S stages [0..(S1]], then within each group of four in stage s, bits
b_{[(s+1) mod S]} b_{s}
count (00,01,10,11) while the remaining bits
b_{[(s+S) mod S]} b_{[(s+(S1)) mod S]} b_{[(s+(S2)) mod S]} ... b_{[(s+2) mod S]}
increment (0,1,10,11,100,101...) at each groupoffour boundary.
Our example illustrates this for N=16 datapoints, but the principle holds for any arbitrarily large number of datapoints. Bits s+1 and s count 00 to 11 while the remaining bits increment at the end of each group of four, counting up e.g. if N=1024 then the second groupoffour in stage 2 would be 0000001(00)0, 0000001(01)0, 0000001(10)0, 0000001(11)0.
Note that things get a little "funny" in the final stage of execution, when we run out of bits. In our example, at stage 3, it is not bits 3 and 4 that count up from 00 to 11 in each group of four (bit 4 doesn't exist); it is bits 3 and 0. That is, the toggling pair has wrapped back around to the beginning of the word. Things get even "funnier" when there is an odd number of bits, we'll talk about this later.
We're going to use the property we've just discovered to produce a conflictfree schedule. In particular, we're going to make sure that 1) within each group of four, the four operands will map to the four separate banks b0, b1, b2 and b3 in some order. And then 2) we're going to reorder the operations within each group of four such that bank accesses are strictly ordered within the group, namely b0 then b1 then b2 then b3 for evennumbered stages and b0 then b2 then b1 then b3 for oddnumbered stages(!)
extending to nbutts > 1
Note that in any group of EIGHT operands, bits (s, s+1, s+2) count 000,001,010,011,100,101,110,111 while the remaining bits stay the same. Similarly for groups of 16, 32, 64, etc. etc. This allows us to extend our earlier proof in the following way...
extending to radix > 2
Note that the formulation for a radixfour schedule is similar to that for two radix2 butterflies operating in parallel. This is the basis for an obvious extension to any poweroftwo radix greater than two etc. etc.
Trash
trash 1/27/2014
Yes it is because within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in Stage 1 of our example above, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:
 (d1,d3)(d5,d7); (d1,d3)(d7,d5); (d3,d1)(d5,d7); (d3,d1)(d7,d5);
 (d5,d7)(d1,d3); (d5,d7)(d3,d1); (d7,d5)(d1,d3); (d7,d5)(d3,d1).
Further suppose that, when stage number s is even and when P_{od} is about to count (1,1,0,0), we change d_{s+1} (a component of P_{od}, because we know s+1 is odd) such that it counts (1,1,0,0) instead of (0,0,1,1). (In our example sequence (d1,d3,d5,d7) we would not need to make this change, because P_{od} already counts the way we want it to.) Is it okay to change the sequence like this? Yes it is because this is a valid reordering, as discussed above. And now P_{od} for the new ordering counts (0,0,1,1) instead of the original (1,1,0,0).
When the stage number is odd, we know that d_{s} is an odd bit and thus it is P_{od} that will be either (0,1,0,1) or (1,0,1,0) while P_{ev} is (0,0,1,1) or (1,1,0,0). Bank number P_{od}P_{ev} for a quad will thus be one of: (00,10,01,11); (01,11,00,10); (10,00,11,01); or (11,01,10,00).
Now let's conspire to make the memory banks count consistently (m0,m1,m2,m3, m0,m1,m2,m3,...) for evennumbered stages. For this to happen, Pev will need to consistently count (0,1,0,1, 0,1,0,1, ...) and Pod will need to consistently count (0,0,1,1, 0,0,1,1, ...) such that memory bank number PodPev will count (00,01,10,11, 00,01,10,11, ...)
If, however, we replace b(s) with P_{ev} then we know that P_{ev} will always count (0,1,0,1) (why?)
in particular if po/pe==00 then the bank order will be (00,01,10,11); if pope==01 the order will be (01,00,11,10); if pope==10 the order will be (10,11,00,01); and if pope==11 the order will be (11,10,01,00).
IF WE REPLACE BS+1/BS WITH POPE WE ALWAYS GET THE ORDER (00,01,10,11) (WHY?)
Note that IN ALL CASES within any group of four, the operands map one each to the four separate banks 00, 01, 10 and 11 in some order.
trash
A B C D m0 m1 m2 m3 m5 m4 m7 m6 m0 m2 m4 m6 m5 m1 m7 m3 m0 m4 m7 m3 m5 m1 m2 m6
INSIGHT NUMBER TWO: Within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in stage 1 or our example, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:
* (d1,d3)(d5,d7) (d1,d3)(d7,d5) (d3,d1)(d5,d7) (d3,d1)(d7,d5) * (d5,d7)(d1,d3) (d5,d7)(d3,d1) (d7,d5)(d1,d3) (d7,d5)(d3,d1)
When we change the order of operands, we change the order of bank numbers being accessed; in particular, because stage 1 is odd, our eight reorderings correspond to the following eight possible bank sequences:
* (01,11)(00,10) (01,11)(10,00) (11,01)(00,10) (11,01)(10,00) * (00,10)(01,11) (00,10)(11,01) (10,00)(01,11) (10,00)(11,01)
IN PARTICULAR IF WE REPLACE THE TOGGLING BITS B(S+1),B(S) WITH THEIR PARITAL EQUIVALENTS P(S+1)P(S) WE ALWAYS GET A GROUPOFFOUR BANK ORDER (00,01,10,11) FOR THE EVEN STAGES AND (00,10,01,11) FOR THE ODD STAGES (IS THIS TRUE!!??)
In the general case, there are only four cases possible for the nontoggling bits, based on the parity of the evennumbered bits p(e) vs. the oddnumbered bits p(o). The four cases are p(o)p(e) = 00, 01, 10 and 11 respectively. We will assign banks to operands such that banknum = f(Po,Pe) where Po is the parity of ALL odd bits and Pe is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks. In our example,
 p(o) = parity(b_{3}) = parity(0) = 0 and
 p(e) = parity(b_{4}b_{0} = parity(01) = 1
Next we will come up with four formulas to handle the four cases
(po,pe)(b(s+1)b(s)) = 00(00,01,10,11) 01(00,01,10,11) 10(00,01,10,11) and 11(00,01,10,11)
respectively, such that in each case the banknum counts (00,01,10,11) when s is even and consequently will count (00,10,01,11) when s is odd.
In our example (d1,d3,d5,d7) it is bits 1 and 2 that count while the remaining bit b0 is always 1. For this group, then, po is 0 (the empty set, actually) and pe is 1. Solving for group (d1,d3,d5,d7) solves for all groups such that s is odd, po is 0 and p1 is 1, such as
(d1,d3,d5,d7) (d25,d28,d30,d32) (41,d43,d45,d47) (d1,d9,d17,d25) (d2,d10,d18,d26) (d7,d15,d23,d31) etc.
[maybe go for a simpler example above?]
Trash: clean example table
Datapoint  Address (binary)  P_{1},P_{0}  bank 

d00 d01 d02 d03 
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d04 d05 d06 d07 
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 
0,1 0,0 1,1 1,0 
m1 m0 m3 m2 
d08 d09 d10 d11 
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 
1,1 1,0 0,1 0,0 
m3 m2 m1 m0 
d12 d13 d14 d15 
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 
1,0 1,1 0,0 0,1 
m2 m3 m0 m1 
Trash: nonadjacent toggle bits table
<=== (Stages 0, 1, 2, same as before) ===>  (new)  

Stage 0  Stage 1  Stage 2  Stage 3 




d0=0000  d0=0000  d0=0000  d0=0000 
d1=0001  d2=0010  d2=0100  d4=1000 
d2=0010  d4=0100  d4=1000  d1=0001 
d3=0011  d6=0110  d6=1100  d5=1001 




d4=0100  d1=0001  d1=0001  d2=0010 
d5=0101  d3=0011  d3=0101  d6=1010 
d6=0110  d5=0101  d5=1001  d3=0011 
d7=0111  d7=0111  d7=1101  d7=1011 




...  ...  ...  ... 
d7=1110  d7=1101  d7=1011  d7=0111 
d7=1111  d7=1111  d7=1111  d7=1111 
Trash: Algorithm 1: One radix2 butterfly (R==2 and B==1) and even number of address bits
Given
 a transform with N datapoints such that N is a power of two and N ≥ 16; and
 an even number of stages S, that is, S=log_{2}(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).
For the algorithm to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 00_{2} to 11_{2} i.e. m_{1}m_{0}={00_{2},01_{2},10_{2},11_{2}} i.e. m0=00_{2}, m1=01_{2}, m2=10_{2} and m3=11_{2}.
The following equation(s) transform a given stage's datapoint sequence
 (dp_{(N1)} dp_{(N2)}...dp_{1},dp_{0})
into an equivalent sequence
 (dp'_{(N1)} dp'_{(N2)}...dp'_{1},dp'_{0})
such that each datapoint dp'_{i} maps to a memory bank mb_{i} where the corresponding memory bank sequence
 (mb'_{(N1)} mb'_{(N2)}...mb'_{0},mb'_{1})
is the repeating series
 (m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),
where m0 = memory bank 00_{2}, m1 = memory bank 01_{2}, m2 = memory bank 10_{2} and m3 = memory bank 11_{2}.
So:
For each Sbit datapoint dp_{s,i} in a given transform stage s ∈ (0..S1),
 dp_{s,i} = d_{S1}d_{S2}...d_{1}d_{0}
and i ∈ (0..N1), redefine it as a new datapoint
 dp'_{s,i} = d'_{S1}d'_{S2}...d'_{1}d'_{0}
where
if (s==S1) then  d'_{0} = P_{0} and d'_{s} = P_{1}  and all other d'_{j} = d_{j} , where j ∈ (1..S2) 
else  d'_{s} = P_{s} and d'_{(s+1)} = P_{(s+1)}  and all other d'_{j} = d_{j} , where j ∈ (0..S1) and j ∉ {s,s+1} 
and
P_{k} = P_{0} = P_{even}  if k is even and 
P_{k} = P_{1} = P_{odd}  if k is odd 
and
 P_{0} = P_{even} = Parity(d_{S2},d_{S4}...d_{2},d_{0})
is the parity of the datapoint's even bits, that is all those bits d_{j} such that j mod 2 = 0; and
 P_{1} = P_{odd} = Parity(d_{S1},d_{S3}...d_{3},d_{1})
is the parity of the datapoint's odd bits, that is all those bits d_{j} such that j mod 2 = 1.
The equations work for
 a transform with N datapoints such that N is a power of two and N ≥ 16; and
 an even number of stages S, that is, S=log_{2}(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).
That is, they work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.
Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)
Datapoint d (decimal) 
Address (binary) = d_{3}d_{2}d_{1}d_{0} 
P_{1},P_{0}  Datapoint d' = Replace d_{s+1}d_{s} (d_{1}d_{0}) with P_{1}P_{0} (reorder) 
P'_{1},P'_{0}  bank = P'_{1}P'_{0} 

d00 d01 d02 d03 
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 
0,0 0,1 1,0 1,1 
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d04 d05 d06 d07 
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 
0,1 0,0 1,1 1,0 
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d08 d09 d10 d11 
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 
1,1 1,0 0,1 0,0 
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d12 d13 d14 d15 
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 
1,0 1,1 0,0 0,1 
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (new)
Datapoint d=d' (decimal) 
Address (binary) = d_{3}d_{2}d_{1}d_{0} = d'_{3}d'_{2}d'_{1}d'_{0} 
P'_{1},P'_{0}  Reorder: d" = d'_{3}d'_{2}P'_{1}P'_{0} 
P"_{1},P"_{0}  bank = P"_{1}P"_{0} 

d00 d01 d02 d03 
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 
0,0 0,1 1,0 1,1 
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d04 d05 d06 d07 
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 
0,1 0,0 1,1 1,0 
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d08 d09 d10 d11 
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 
1,1 1,0 0,1 0,0 
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d12 d13 d14 d15 
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 
1,0 1,1 0,0 0,1 
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (old)
Datapoint d (decimal) 
Address (binary) = d_{3}d_{2}d_{1}d_{0} 
P_{1},P_{0}  Datapoint d' = Replace d_{s+1}d_{s} (d_{1}d_{0}) with P_{1}P_{0} (reorder) 
P'_{1},P'_{0}  bank = P'_{1}P'_{0} 

d00 d01 d02 d03 
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 
0,0 0,1 1,0 1,1 
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d04 d05 d06 d07 
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 
0,1 0,0 1,1 1,0 
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d08 d09 d10 d11 
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 
1,1 1,0 0,1 0,0 
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
d12 d13 d14 d15 
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 
1,0 1,1 0,0 0,1 
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 
0,0 0,1 1,0 1,1 
m0 m1 m2 m3 
TRASH/OLD Boundary condition, odd number of bits
If the number of bits is odd, we have a slight problem. Everything is fine as before from stage 0 up through the penultimate stage S2. But then things change for the final stage S1, where the toggling bits are the two outside bits d_{S1} (fast toggle 0,1,0,1) and d_{0} (slow toggle 0,0,1,1). In the previous case, where there were an even number of address bits, we knew that d_{S1} was odd and d_{0} was even, and thus P_{odd} would toggle 0,1,0,1 (or 1,0,1,0, depending on the value of the nontoggling bits) while P_{even} would toggle 0,0,1,1 (or 1,1,0,0).
In the case where the number of bits is odd, however, the two outside (toggling) bits d_{S1} and d_{0} are both even. Thus, as d_{0}d_{S1} count (00,01,10,11), P_{odd} does not change at all (because the toggling bits are both even) while P_{even} counts either (0,1,1,0) (or (1,0,0,1), depending on the value of the nontoggling bits) instead of its usual 0011 (or 1100). Thus, if base the memory bank number on P_{odd}P_{even} we only get two memory banks repeated twice, e.g. 00010001, instead of the desired mapping to four separate memory banks 00011011
Talk about what happens in final stage when s==nbits and nbits is odd.
Stage 0  Stage 1  Stage 2 

d0=000  d0=000  d0=000 
d1=001  d2=010  d4=100 
d2=010  d4=100  d1=001 
d3=011  d6=110  d5=101 
d4=100  d1=001  d2=010 
d5=101  d3=011  d6=110 
d6=110  d5=101  d3=011 
d7=111  d7=111  d7=111 
Try this: for stage s=S1, instead of
 d_s = d_(S1) = P_s = P_1 (because S is even and thus S1 is odd)
and
 d_(s+1) = d_0 = P_(s+1) = P_0,
do this maybe
 d_s = d_(S1) = xor(P_s,d_0) = xor(P_(S1),d_0) = xor(P_0,d_0) (because S is odd and thus (S1) is even) (also note xor(P_0,d_0) is same as P_0) (right?)
and
 d_(s+1) = d_S = d_1 (why? because d_(S1)==d_0 sorta?) (because
and
 d_1 = xor(P_(s+1),d_0) = xor(P_1,d_0)
Final equation for when nbits is odd and R==2 and B==1
TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1
Equation for mapping addresses to banks for one butterly (B==1) at radix 2 (r==2)
TRASH/OLD The ... REST ... of the story?
FFT_Formal_Proof__Auxiliary_Info#Sketching_out_the_remainder_of_the_proof
TRASH/OLD Extension to more than one butterfly unit (B > 1).
Generalization of Algorithm 1, extended for any B=2^{i} and any number of address bits S: mod_{Mbits}S=0
Eventually we will find and show that, by extension, if we have
 an FFT with B butterflies (and thus M=4B memory banks);
 a transform with S=log_{2}(N) stages such that mod_{Mbits}(S)=0, where Mbits=log_{2} (M)
 (e.g. B=1⇒M=4⇒Mbits=2 and mod_{Mbits}(S)=mod_{2}(S)=0⇒number of stages S is even);
everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above
The mapping of any given datapoint
 d =d_{S1}d_{S2}...d_{1}d_{0}
to its proper memory bank
 m=m_{Mbits1}m_{Mbits2}...m_{1}m_{0}
is as follows:
 m_{Mbits1} = P_{(Mbits1)/Mbits}
 m_{Mbits2} = P_{(Mbits2)/Mbits}
 m_{1} = P_{1/Mbits}
 m_{0} = P_{0/Mbits}
where
 P_{Mbits1/Mbits} = Parity(d_{S1},d_{SMbits1}...d_{2Mbits1},d_{Mbits1})
is the parity of all those bits d_{i} such that i mod Mbits = (Mbits1); and
 P_{Mbits2/Mbits} = Parity(d_{S2},d_{SMbits2}...d_{2Mbits2},d_{Mbits2})
is the parity of all those bits d_{i} such that i mod Mbits = (Mbits2); and
 P_{1/Mbits} = Parity(d_{S(Mbits1)},d_{S(2Mbits1)}...d_{Mbits+1},d_{1})
is the parity of all those bits d_{i} such that i mod Mbits = 1; and
 P_{0/Mbits} = Parity(d_{S(Mbits)},d_{S(2Mbits)}...d_{Mbits},d_{0})
is the parity of all those bits d_{i} such that i mod Mbits = 0; and The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.