FFT Formal Proof - Auxiliary Info

From Genesis2
Jump to: navigation, search

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Checkoff list:

  Target     Actual
  Finish     Finish    Task
 ---------  ---------  ----------------------------------------------
 Tue 11/19  Tue 11/19  Move the old stuff to an auxiliary page. DONE!
 ...
 Tue 11/19  Tue 11/19  Clean up the outline, figure out where to put
                       the "nomenclature" section (see below)
 ...
 Tue 11/19  xxx xx/xx  Build a timeline for what gets done when, 
                       working like twelve hours a week or whatever.
 ...
 Tue 11/19  Tue 11/19  Start on nomenclature section: what is N, n, N_B etc. 
 ...
 Thu 11/19  Tue 11/19  Finish nomenclature section: what is N, n, N_B etc. 
 ...
 Thu 11/19  xxx xx/xx  Continue with the proof, in order of sections not finished
 ...
 TBD List of sections, including when each one will be done
 ...
 xxx xx/xx  xxx xx/xx  Section 5
 xxx xx/xx  xxx xx/xx    5.1 The equation so far
 xxx xx/xx  xxx xx/xx    5.2 Boundary condition, odd number of bits
 xxx xx/xx  xxx xx/xx    5.3 Boundary condition, even number of bits
 xxx xx/xx  xxx xx/xx    5.4 Final equation for r==2 and B==1
 xxx xx/xx  xxx xx/xx    Final polish of section 5
 ...
 xxx xx/xx  xxx xx/xx  Section 6
 xxx xx/xx  xxx xx/xx    6.1 Extension to more than one butterfly unit (B > 1).
 xxx xx/xx  xxx xx/xx    6.2 Extension to radix r > 2.
 xxx xx/xx  xxx xx/xx    6.3 Extension to longer pipelines.

Takala equation and paper

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1205957

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1205957&tag=1

http://genesis2.stanford.edu/mediawiki/index.php/File:Takala_equations.jpg

Takala equations.jpg

TODO/NEXT/CONTINUE FROM HERE ----------------------------

SCRUBBED TO HERE MAYBE

Notes - bank num encoding table

  Bank num m    Bits mn+1mnmn-1...m0
m0 mn+1mnmn-1...m0    =    00...002    =    0
m1 mn+1mnmn-1...m1    =    00...012    =    1
mn mn+1mnmn-1...m0    =    11...102    =    4B-2
m(n+1) mn+1mnmn-1...m0    =    11...112    =    4B-1


  • bank m0: mn+1mnmn-1...m0 = 00...002
  • bank m1: mn+1mnmn-1...m1 = 00...012
  • ...
  • bank mn: mn+1mnmn-1...m0 = 11...102
  • bank m(n+1): mn+1mnmn-1...m0 = 11...112


OLD

  • mn+1mnmn-1...m0={(00...002),(00...012), ... (11...102),(11...112)} i.e.
  • m0=00..002, m1=00..012, mn=11..102 and m(n+1)=11..112.

The Proof (old)

Note this part of the proof is old and possibly unnecessary; all the key information is in the section currently labeled "The Problem," above.

Here is an algorithm for an FFT schedule free of interstage memory access conflicts along with an embedded formal proof of correctness. The proof is developed for an FFT implemented as a single time-sliced and pipelined radix-2 butterfly unit, e.g. one radix-two butterfly transforming eight datapoints. The proof will later be extended to cover any power-of-2 number of butterfly units at any power-of-2 radix, e.g. four radix-four butterflies transforming 1024 datapoints.

The initial proof posits an FFT based on the Cooley-Tukey [verify] algorithm, implemented as a single radix-2 butterfly. The butterfly unit repeatedly reads two input operands, calculates two results, then writes the results back to memory, repeating with new input operands until the FFT operation is complete. The butterfly unit is pipelined such that in continuous operation, the read of two new operands happens at the same time as the writeback of the two previous results. [maybe include pipeline diagram here] This obviously requires a minimum of four memory accesses per cycle, using say a single four-ported memory bank, two dual-ported memory banks, or four independent single-ported memory banks. Our schedule targets the most efficient of these choices, four independent banks of single-ported memory.

Given the constraints outlined above, our schedule must place operands in memory and schedule the operand accesses such that any given group of four operands (two reads plus two writes) reside in four separate memory banks. As an example, consider the FFT schedule in Fig. 1 below, for N=8 datapoints. When implemented as a single time-sliced butterfly unit, the memory accesses occur as shown in Fig 2. For simplicity we assume that datapoint dp0 resides at memory location m0, dp1 at m1 and so on.

Fig. 1 goes here: std butterfly network for eight datapoints

Fig. 2: stretched-out version of Fig. 1 I guess

In cycle 0 of Fig. 1, the butterfly unit accesses datapoints from memory locations m0 and m1. In cycle 1, the butterfly unit writes its results back to m0 and m1 and reads new datapoints from m2 and m3. For this to happen without a stall, m0, m1, m2 and m3 must all reside in separate memory banks. Working through the entire diagram, we find the following nine groups of memory locations such that, during some cycle of execution, each memory location within the group gets accessed at the same time as each of the other three members of the group, and thus all four must live in separate banks for conflict-free operation.

{m0,m1,m2,m3} {m2,m3,m4,m5} {m4,m5,m6,m7}
{m0,m2,m4,m6} {m4,m6,m1,m5} {m1,m5,m3,m7}
{m0,m4,m1,m5} {m1,m5,m2,m6} {m2,m6,m3,m7}

So, for example, locations m0, m1, m2 and m3 can live in banks b0, b1, b2 and b3 respectively, thus satisfying the separateness requirements for group 1. In group 2, locations m2 and m3 have already been assigned to b2 and b3, and thus m4 and m5 must be assigned to b0 and b1. If we follow this through to group 3, we find that m6 and m7 have to reside in banks b2 and b3. By the time we get to group 4, we've already assigned m0, m2, m4 and m6 to b0, b2, b0 and b2, which is a problem because this causes collisions in both banks b0 and b2.

Finding the correct assignment, such that no collisions occur in any of the nine groups, is a very hard problem that may not be solvable for all values of N, much less for our little example where N=8. However, we can change the rules slightly. Note that the operations within a given FFT stage can occur in any order and still yield correct results. Thus in place of the hard-to-schedule network of Fig. 2 we could substitute a different ordering, such as that given by Fig. 3.

Fig. 3: tractable reordered schedule

Now the groups that must reside in separate banks are as follows:

m0 m1 m2 m3    m2 m3 m5 m4    m5 m4 m7 m6
m0 m2 m4 m6    m4 m6 m5 m1    m5 m1 m7 m3
m0 m4 m7 m3    m7 m3 m5 m1    m5 m1 m2 m6


Given this schedule, we can achieve our goal by placing datapoints m0 and m5 in bank b0, m1 and m4 in bank b1, m2 and m7 in bank b2, and m3 and m6 in bank b3:

[Below, replace A, B, C and D with b0, b1, b2 and b3 respectively maybe]

  A  B  C  D     C  D  A  B     A  B  C  D
 m0 m1 m2 m3    m2 m3 m5 m4    m5 m4 m7 m6

  A  C  B  D     B  D  A  C     A  C  B  D
 m0 m2 m4 m6    m4 m6 m5 m1    m5 m1 m7 m3

  A  B  C  D     C  D  A  B     A  B  C  D
 m0 m4 m7 m3    m7 m3 m5 m1    m5 m1 m2 m6

[maybe this (above) is gonna work better graphically...???]

The remainder of this proof will show how to deterministically achieve a schedule and a mapping for similar conflict-free operation given any number of datapoints N=2^i, where i is any integer greater than 2 (cases for i <= 2 are trivial and obvious).

Sketching out the remainder of the proof

The Cooley-Tukey algorithm [check] for sixteen datapoints yields the following schedule:

 stage 0:  0 1 2 3  4 5 6 7  8 9 a b  c d e f
 stage 1:  0 2 4 6  8 a c e  1 3 5 7  9 b d f
 stage 2:  0 4 8 c  1 5 9 d  2 6 a e  3 7 b f
 stage 3:  0 8 1 9  2 a 3 b  4 c 5 d  6 e 7 f

If we write these operands in binary form it looks something like this:

  stage 0    stage 1    stage 2    stage 3
  -------    -------    -------    -------    
  0 0 0 0    0 0 0 0    0 0 0 0    0 0 0 0
  0 0 0 1    0 0 1 0    0 1 0 0    1 0 0 0
  0 0 1 0    0 1 0 0    1 0 0 0    0 0 0 1
  0 0 1 1    0 1 1 0    1 1 0 0    1 0 0 1

  0 1 0 0    1 0 0 0    0 0 0 1    0 0 1 0
  0 1 0 1    1 0 1 0    0 1 0 1    1 0 1 0
  0 1 1 0    1 1 0 0    1 0 0 1    0 0 1 1
  0 1 1 1    1 1 1 0    1 1 0 1    1 0 1 1

  1 0 0 0    0 0 0 1    0 0 1 0    0 1 0 0
  1 0 0 1    0 0 1 1    0 1 1 0    1 1 0 0
  1 0 1 0    0 1 0 1    1 0 1 0    0 1 0 1
  1 0 1 1    0 1 1 1    1 1 1 0    1 1 0 1

  1 1 0 0    1 0 0 1    0 0 1 1    0 1 1 0
  1 1 0 1    1 0 1 1    0 1 1 1    1 1 1 0
  1 1 1 0    1 1 0 1    1 0 1 1    0 1 1 1
  1 1 1 1    1 1 1 1    1 1 1 1    1 1 1 1

Note that, in any given group of four within any given stage, only two bits toggle while the remaining bits stay the same. For the purposes of this discussion, we'll number the bits 0 to 3, right to left, where bit 0 is the least significant bit (LSB) and bit 3 is the most significant bit (MSB).

In stage 0, bits 0 and 1 toggle while bits 2 and 3 stay at 00 (in the first group of four), then 01 (in the second group), then 10 and then 11.

In stage 1 it is bits 1 and 2 that toggle while bits 3 and 0 remain constant within each group of four.

In general, for stage s, bits s+1 and s within any group of four will count 00,01,10,11 while the remaining bits s-1, s-2, ... increment only at group-of-four boundaries, where s-i wraps back to the MSB if s-i is less than 0. More precisely, given S stages [0..(S-1]], then within each group of four in stage s, bits

     b[(s+1) mod S] bs

count (00,01,10,11) while the remaining bits

     b[(s+S) mod S] b[(s+(S-1)) mod S] b[(s+(S-2)) mod S] ... b[(s+2) mod S]

increment (0,1,10,11,100,101...) at each group-of-four boundary.

Our example illustrates this for N=16 datapoints, but the principle holds for any arbitrarily large number of datapoints. Bits s+1 and s count 00 to 11 while the remaining bits increment at the end of each group of four, counting up e.g. if N=1024 then the second group-of-four in stage 2 would be 0000001(00)0, 0000001(01)0, 0000001(10)0, 0000001(11)0.

Note that things get a little "funny" in the final stage of execution, when we run out of bits. In our example, at stage 3, it is not bits 3 and 4 that count up from 00 to 11 in each group of four (bit 4 doesn't exist); it is bits 3 and 0. That is, the toggling pair has wrapped back around to the beginning of the word. Things get even "funnier" when there is an odd number of bits, we'll talk about this later.

We're going to use the property we've just discovered to produce a conflict-free schedule. In particular, we're going to make sure that 1) within each group of four, the four operands will map to the four separate banks b0, b1, b2 and b3 in some order. And then 2) we're going to reorder the operations within each group of four such that bank accesses are strictly ordered within the group, namely b0 then b1 then b2 then b3 for even-numbered stages and b0 then b2 then b1 then b3 for odd-numbered stages(!)

extending to nbutts > 1

Note that in any group of EIGHT operands, bits (s, s+1, s+2) count 000,001,010,011,100,101,110,111 while the remaining bits stay the same. Similarly for groups of 16, 32, 64, etc. etc. This allows us to extend our earlier proof in the following way...

extending to radix > 2

Note that the formulation for a radix-four schedule is similar to that for two radix-2 butterflies operating in parallel. This is the basis for an obvious extension to any power-of-two radix greater than two etc. etc.

Trash

trash 1/27/2014

Yes it is because within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in Stage 1 of our example above, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:

  • (d1,d3)(d5,d7);    (d1,d3)(d7,d5);    (d3,d1)(d5,d7);    (d3,d1)(d7,d5);
  • (d5,d7)(d1,d3);    (d5,d7)(d3,d1);    (d7,d5)(d1,d3);    (d7,d5)(d3,d1).

Further suppose that, when stage number s is even and when Pod is about to count (1,1,0,0), we change ds+1 (a component of Pod, because we know s+1 is odd) such that it counts (1,1,0,0) instead of (0,0,1,1). (In our example sequence (d1,d3,d5,d7) we would not need to make this change, because Pod already counts the way we want it to.) Is it okay to change the sequence like this? Yes it is because this is a valid reordering, as discussed above. And now Pod for the new ordering counts (0,0,1,1) instead of the original (1,1,0,0).

When the stage number is odd, we know that ds is an odd bit and thus it is Pod that will be either (0,1,0,1) or (1,0,1,0) while Pev is (0,0,1,1) or (1,1,0,0). Bank number PodPev for a quad will thus be one of: (00,10,01,11); (01,11,00,10); (10,00,11,01); or (11,01,10,00).

Now let's conspire to make the memory banks count consistently (m0,m1,m2,m3, m0,m1,m2,m3,...) for even-numbered stages. For this to happen, Pev will need to consistently count (0,1,0,1, 0,1,0,1, ...) and Pod will need to consistently count (0,0,1,1, 0,0,1,1, ...) such that memory bank number PodPev will count (00,01,10,11, 00,01,10,11, ...)

If, however, we replace b(s) with Pev then we know that Pev will always count (0,1,0,1) (why?)

in particular if po/pe==00 then the bank order will be (00,01,10,11); if pope==01 the order will be (01,00,11,10); if pope==10 the order will be (10,11,00,01); and if pope==11 the order will be (11,10,01,00).

IF WE REPLACE BS+1/BS WITH POPE WE ALWAYS GET THE ORDER (00,01,10,11) (WHY?)

Note that IN ALL CASES within any group of four, the operands map one each to the four separate banks 00, 01, 10 and 11 in some order.

trash

 A  B  C  D
m0 m1 m2 m3
m5 m4 m7 m6
m0 m2 m4 m6
m5 m1 m7 m3
m0 m4 m7 m3
m5 m1 m2 m6



INSIGHT NUMBER TWO: Within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in stage 1 or our example, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:

* (d1,d3)(d5,d7)  (d1,d3)(d7,d5)  (d3,d1)(d5,d7)  (d3,d1)(d7,d5)
* (d5,d7)(d1,d3)  (d5,d7)(d3,d1)  (d7,d5)(d1,d3)  (d7,d5)(d3,d1)


When we change the order of operands, we change the order of bank numbers being accessed; in particular, because stage 1 is odd, our eight reorderings correspond to the following eight possible bank sequences:

* (01,11)(00,10)  (01,11)(10,00)  (11,01)(00,10)  (11,01)(10,00)
* (00,10)(01,11)  (00,10)(11,01)  (10,00)(01,11)  (10,00)(11,01)



IN PARTICULAR IF WE REPLACE THE TOGGLING BITS B(S+1),B(S) WITH THEIR PARITAL EQUIVALENTS P(S+1)P(S) WE ALWAYS GET A GROUP-OF-FOUR BANK ORDER (00,01,10,11) FOR THE EVEN STAGES AND (00,10,01,11) FOR THE ODD STAGES (IS THIS TRUE!!??)


In the general case, there are only four cases possible for the non-toggling bits, based on the parity of the even-numbered bits p(e) vs. the odd-numbered bits p(o). The four cases are p(o)p(e) = 00, 01, 10 and 11 respectively. We will assign banks to operands such that banknum = f(Po,Pe) where Po is the parity of ALL odd bits and Pe is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks. In our example,

  • p(o) = parity(b3) = parity(0) = 0 and
  • p(e) = parity(b4b0 = parity(01) = 1


Next we will come up with four formulas to handle the four cases

 (po,pe)(b(s+1)b(s)) = 00(00,01,10,11) 01(00,01,10,11) 10(00,01,10,11) and 11(00,01,10,11)

respectively, such that in each case the banknum counts (00,01,10,11) when s is even and consequently will count (00,10,01,11) when s is odd.

In our example (d1,d3,d5,d7) it is bits 1 and 2 that count while the remaining bit b0 is always 1. For this group, then, po is 0 (the empty set, actually) and pe is 1. Solving for group (d1,d3,d5,d7) solves for all groups such that s is odd, po is 0 and p1 is 1, such as

 (d1,d3,d5,d7) (d25,d28,d30,d32) (41,d43,d45,d47)
 (d1,d9,d17,d25) (d2,d10,d18,d26) (d7,d15,d23,d31)
  etc.

[maybe go for a simpler example above?]


Trash: clean example table

An example table
   Datapoint       Address (binary)       P1,P0       bank   
d00
d01
d02
d03
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04
d05
d06
d07
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
m1
m0
m3
m2
d08
d09
d10
d11
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
m3
m2
m1
m0
d12
d13
d14
d15
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
m2
m3
m0
m1

Trash: non-adjacent toggle bits table

<=== (Stages 0, 1, 2, same as before) ===> (new)
Stage 0 Stage 1 Stage 2 Stage 3




d0=0000 d0=0000 d0=0000 d0=0000
d1=0001 d2=0010 d2=0100 d4=1000
d2=0010 d4=0100 d4=1000 d1=0001
d3=0011 d6=0110 d6=1100 d5=1001




d4=0100 d1=0001 d1=0001 d2=0010
d5=0101 d3=0011 d3=0101 d6=1010
d6=0110 d5=0101 d5=1001 d3=0011
d7=0111 d7=0111 d7=1101 d7=1011




... ... ... ...
d7=1110 d7=1101 d7=1011 d7=0111
d7=1111 d7=1111 d7=1111 d7=1111

Trash: Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

Given

  • a transform with N datapoints such that N is a power of two and N ≥ 16; and
  • an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).

For the algorithm to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 002 to 112 i.e. m1m0={002,012,102,112} i.e. m0=002, m1=012, m2=102 and m3=112.

The following equation(s) transform a given stage's datapoint sequence

(dp(N-1) dp(N-2)...dp1,dp0)

into an equivalent sequence

(dp'(N-1) dp'(N-2)...dp'1,dp'0)

such that each datapoint dp'i maps to a memory bank mbi where the corresponding memory bank sequence

(mb'(N-1) mb'(N-2)...mb'0,mb'1)

is the repeating series

(m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),

where m0 = memory bank 002, m1 = memory bank 012, m2 = memory bank 102 and m3 = memory bank 112.

So:

For each S-bit datapoint dps,i in a given transform stage s ∈ (0..S-1),

dps,i = dS-1dS-2...d1d0

and i ∈ (0..N-1), redefine it as a new datapoint

dp's,i = d'S-1d'S-2...d'1d'0

where

    if (s==S-1) then      d'0 = P0    and d's = P1    and all other d'j = dj , where j ∈ (1..S-2)
    else      d's = Ps    and d'(s+1) = P(s+1)    and all other d'j = dj , where j ∈ (0..S-1) and j ∉ {s,s+1}

and

    Pk = P0 = Peven    if k is even and
    Pk = P1 = Podd    if k is odd

and

  • P0 = Peven = Parity(dS-2,dS-4...d2,d0)

is the parity of the datapoint's even bits, that is all those bits dj such that j mod 2 = 0; and

  • P1 = Podd = Parity(dS-1,dS-3...d3,d1)

is the parity of the datapoint's odd bits, that is all those bits dj such that j mod 2 = 1.

The equations work for

  • a transform with N datapoints such that N is a power of two and N ≥ 16; and
  • an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).

That is, they work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)
   Datapoint d   
(decimal)
   Address (binary)
= d3d2d1d0   
   P1,P0       Datapoint d' =
Replace ds+1ds (d1d0)
with P1P0 (reorder)
   P'1,P'0       bank = P'1P'0   
d00
d01
d02
d03
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
0 0 0 0 = d00
0 0 0 1 = d01
0 0 1 0 = d02
0 0 1 1 = d03
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04
d05
d06
d07
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
0 1 0 1 = d05
0 1 0 0 = d04
0 1 1 1 = d07
0 1 1 0 = d06
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d08
d09
d10
d11
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
1 0 1 1 = d11
1 0 1 0 = d10
1 0 0 1 = d09
1 0 0 0 = d08
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d12
d13
d14
d15
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
1 1 1 0 = d14
1 1 1 1 = d15
1 1 0 0 = d12
1 1 0 1 = d13
0,0
0,1
1,0
1,1
m0
m1
m2
m3

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (new)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)
   Datapoint d=d'   
(decimal)
   Address (binary)
= d3d2d1d0   
= d'3d'2d'1d'0   
   P'1,P'0       Reorder:
d" = d'3d'2P'1P'0
   P"1,P"0       bank = P"1P"0   
d00
d01
d02
d03
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
0 0 0 0 = d00
0 0 0 1 = d01
0 0 1 0 = d02
0 0 1 1 = d03
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04
d05
d06
d07
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
0 1 0 1 = d05
0 1 0 0 = d04
0 1 1 1 = d07
0 1 1 0 = d06
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d08
d09
d10
d11
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
1 0 1 1 = d11
1 0 1 0 = d10
1 0 0 1 = d09
1 0 0 0 = d08
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d12
d13
d14
d15
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
1 1 1 0 = d14
1 1 1 1 = d15
1 1 0 0 = d12
1 1 0 1 = d13
0,0
0,1
1,0
1,1
m0
m1
m2
m3

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (old)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)
   Datapoint d   
(decimal)
   Address (binary)
= d3d2d1d0   
   P1,P0       Datapoint d' =
Replace ds+1ds (d1d0)
with P1P0 (reorder)
   P'1,P'0       bank = P'1P'0   
d00
d01
d02
d03
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
0 0 0 0 = d00
0 0 0 1 = d01
0 0 1 0 = d02
0 0 1 1 = d03
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04
d05
d06
d07
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
0 1 0 1 = d05
0 1 0 0 = d04
0 1 1 1 = d07
0 1 1 0 = d06
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d08
d09
d10
d11
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
1 0 1 1 = d11
1 0 1 0 = d10
1 0 0 1 = d09
1 0 0 0 = d08
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d12
d13
d14
d15
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
1 1 1 0 = d14
1 1 1 1 = d15
1 1 0 0 = d12
1 1 0 1 = d13
0,0
0,1
1,0
1,1
m0
m1
m2
m3

TRASH/OLD Boundary condition, odd number of bits

If the number of bits is odd, we have a slight problem. Everything is fine as before from stage 0 up through the penultimate stage S-2. But then things change for the final stage S-1, where the toggling bits are the two outside bits dS-1 (fast toggle 0,1,0,1) and d0 (slow toggle 0,0,1,1). In the previous case, where there were an even number of address bits, we knew that dS-1 was odd and d0 was even, and thus Podd would toggle 0,1,0,1 (or 1,0,1,0, depending on the value of the non-toggling bits) while Peven would toggle 0,0,1,1 (or 1,1,0,0).

In the case where the number of bits is odd, however, the two outside (toggling) bits dS-1 and d0 are both even. Thus, as d0dS-1 count (00,01,10,11), Podd does not change at all (because the toggling bits are both even) while Peven counts either (0,1,1,0) (or (1,0,0,1), depending on the value of the non-toggling bits) instead of its usual 0-0-1-1 (or 1-1-0-0). Thus, if base the memory bank number on PoddPeven we only get two memory banks repeated twice, e.g. 00-01-00-01, instead of the desired mapping to four separate memory banks 00-01-10-11

Talk about what happens in final stage when s==nbits and nbits is odd.

note stage 1 is WRONG should be 0,2,4,6,8,10,12,14 not 0,2,4,6,1,3,...
Stage 0      Stage 1      Stage 2     
d0=000 d0=000 d0=000
d1=001 d2=010 d4=100
d2=010 d4=100 d1=001
d3=011 d6=110 d5=101
d4=100 d1=001 d2=010
d5=101 d3=011 d6=110
d6=110 d5=101 d3=011
d7=111 d7=111 d7=111

https://skydrive.live.com/view.aspx?cid=8595D345F1A90105&resid=8595D345F1A90105%213371&app=WordPdf&wdo=1

Try this: for stage s=S-1, instead of

  • d_s = d_(S-1) = P_s = P_1 (because S is even and thus S-1 is odd)

and

  • d_(s+1) = d_0 = P_(s+1) = P_0,


do this maybe

  • d_s = d_(S-1) = xor(P_s,d_0) = xor(P_(S-1),d_0) = xor(P_0,d_0) (because S is odd and thus (S-1) is even) (also note xor(P_0,d_0) is same as P_0) (right?)

and

  • d_(s+1) = d_S = d_1 (why? because d_(S-1)==d_0 sorta?) (because

and

  • d_1 = xor(P_(s+1),d_0) = xor(P_1,d_0)


Final equation for when nbits is odd and R==2 and B==1

TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1

Equation for mapping addresses to banks for one butterly (B==1) at radix 2 (r==2)

TRASH/OLD The ... REST ... of the story?

FFT_Formal_Proof_-_Auxiliary_Info#Sketching_out_the_remainder_of_the_proof

TRASH/OLD Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2i and any number of address bits S: modMbitsS=0

Eventually we will find and show that, by extension, if we have

  • an FFT with B butterflies (and thus M=4B memory banks);
  • a transform with S=log2(N) stages such that modMbits(S)=0, where Mbits=log2 (M)
  • (e.g. B=1⇒M=4⇒Mbits=2 and modMbits(S)=mod2(S)=0⇒number of stages S is even);

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

The mapping of any given datapoint

d =dS-1dS-2...d1d0

to its proper memory bank

m=mMbits-1mMbits-2...m1m0

is as follows:

  • mMbits-1 = P(Mbits-1)/Mbits
  • mMbits-2 = P(Mbits-2)/Mbits
  • m1 = P1/Mbits
  • m0 = P0/Mbits

where

  • PMbits-1/Mbits = Parity(dS-1,dS-Mbits-1...d2Mbits-1,dMbits-1)

is the parity of all those bits di such that i mod Mbits = (Mbits-1); and

  • PMbits-2/Mbits = Parity(dS-2,dS-Mbits-2...d2Mbits-2,dMbits-2)

is the parity of all those bits di such that i mod Mbits = (Mbits-2); and

  • P1/Mbits = Parity(dS-(Mbits-1),dS-(2Mbits-1)...dMbits+1,d1)

is the parity of all those bits di such that i mod Mbits = 1; and

  • P0/Mbits = Parity(dS-(Mbits),dS-(2Mbits)...dMbits,d0)

is the parity of all those bits di such that i mod Mbits = 0; and The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.