April 26-29, 2004 • Hyatt Regency Dallas www.motorola.com/sndf

## H1119 - Introduction to AltiVec - Ten easy ways to Vectorize your code

Sergei Larin
Sr. SW Engineer
CPD Applications Engineering


## What is a Vector Architecture?

- A vector architecture allows the simultaneous processing of multiple data items in parallel
- Operations are performed on multiple data elements by a single instruction
- Referred to as Single Instruction Multiple Data (SIMD) parallel processing



## What is AltiVec?

- SIMD extension to PowerPC Architecture
- ... no tradeoffs ... just additions
- Provides a high-performance RISC microprocessor with DSP-like compute power
- Allows highly parallel operations for the simultaneous execution of up to 16 operations in a single clock cycle
- Offers a programmable solution for controller and signal processing functions
- ...which can easily migrate via software upgrades to follow changing standards and customer requirements


## AltiVec's Vector Execution Unit



- Concurrent with PowerPC integer and floating-point units
- Separate, dedicated 32 128-bit vector registers
- Approximately 11\% of the silicon area
- No penalty for mixing integer, floating point and AltiVec operations


## SIMD Intra-element Instructions



## AltiVec Instruction Set Features

- 162 new instructions added to the PowerPC ISA
- Intra and inter-element arithmetic instructions
- Intra and inter-element conditional instructions
- Powerful Permute, Shift and Rotate, Splat, Pack/Unpack and Merge instructions
- 4-operand, non-destructive instructions
- Up to three source operands and a single destination operand
- Supports advanced "multiply-add/sum" and permute primitives
- All instructions fully pipelined with single-cycle throughput
- Simple ops: 1 cycle latency
- Compound ops: 3-4 cycle latency
- No restriction on issue with scalar instructions


## Enabling AltiVec in your applications

- Let Compiler do the job
- Currently there are no practical Autovectorizers out-there
- It is a guess how much performance will be actually extracted
- and they still require certain expertise to work with
- Using C intrinsic
- There is a standardized list of intrinsics supported in all PowerPC enabled compilers
- Can actually guarantee good level of control, up to the level of register assignment
- Using Assembly Language programming
- The most effective, and the most laborious way
- Can provide 100\% performance extraction

April 26-29, 2004 • Hyatt Regency Dallas www.motorola.com/sndf

The ways...


## Method 0 - Where to start

- There are two possible starting points:
- Mathematical description of the algorithm
- Existing C code written for serial execution
- Designing algorithms in Vector form from scratch is the best approach...
- ... but the practice shows that much more often it is not the case
- Most of the time starting point is C code written for "serial" execution
- C code is an excellent starting point
- It guarantees model execution
- ... but you should remember that vectorization is not trivial
- Vectorized code often needs completely different approach
- Finally set your goals clear
- Do you want the fastest code possible?
- Do you want the most compact code possible?
- Combination of both?


## Method 1 - Beware of your Bottlenecks

- One of the first things to realize is whether the algorithm is Compute or I/O bound
- If the code entirely bounded by memory performance it would not help to reduce the latency of computations
- If the code is computation intensive, we probably want to reduce that latency and then revisit memory bandwidth
- If the code is control intensive, it might not be the best candidate for vectorization
- unless you can use predication and convert control dependency to data dependency
- or you can use MULTICHANNEL processing
- If memory is the bottleneck the question is -
- Is it streaming data case (system bus is the bottleneck)
- ... or is it cache resident scenario (same data processed more than once, and then flushed back to main memory)
- If system bus is the bottleneck...
- Is it saturated?
- The theoretical 60x bus throughput is 640 Mbytes/sec @ 100MHz
- The theoretical MPX bus throughput is $800 \mathrm{Mbytes} / \mathrm{sec} @ 100 \mathrm{MHz}$
- If you have already reached your bus capacity, and you are sure you are doing just the necessary work - what else can you do?
© Freescale Semiconductor, Inc. 2004


## Method 1 - Beware of your Bottlenecks

- If we are dealing with cache resident data, you should strongly consider using AltiVec
- AltiVec has FOUR TIMES the bandwidth
- One vector load brings 128 bit of data
- One FP load - 64 bit
- Once scalar load only 32 bit
- Beware, that there is no direct "link" between GPRs and Vector Registers (VRs)
- To "copy" a value between the two you need to store-load it
- This means that you do need to keep majority of computations in one place
- One exception from this rule is memcpy/memmove/memset kind of operation
- If you copy large amounts of data between two locations, it might be useful to use AltiVec just as a transport manager
- If computations are the critical path...
- Is there true data dependency between values being computed?
- If there is not, you could be almost certain that AltiVec will deliver performance improvement
© Freescale Semiconductor, Inc. 2004


## Method 2 - Look at the loops

- The easiest way to reduce computation latency is to restructure loops
- This is the main optimization performed by autovectorizers
- Loop Unrolling
- Do multiple iterations in one pass

| int res[n], a[n],b[n]; | int res[n], a[n],b[n]; | ctor int vres[q_n],va[g_n],vb[q_n]; |
| :---: | :---: | :---: |
| $\begin{aligned} & \text { for }(\mathbf{i}=\mathbf{0} ; \mathbf{i}<\mathbf{n} ; \mathbf{i}++)\{ \\ & \text { res }[\mathbf{i}]=\mathbf{a}[\mathbf{i}]+\mathbf{b}[\mathbf{i}] ; \\ & \} \end{aligned}$ | for ( $\mathrm{i}=0$; $\mathrm{i}<\mathrm{n} / 4$; $\mathrm{i}+=4$ ) \{ | // q_n == n/4 |
|  | $\operatorname{res}[\mathbf{i}]=\mathbf{a}[\mathbf{i}]+\mathbf{b}[\mathbf{i}]$; | for (i=0; $\mathrm{i}<\mathrm{q} \_\mathrm{n}$; $\mathrm{i}+=4$ ) ( |
|  | $\mathbf{r e s}[\mathbf{i}+1]=\mathrm{a}[\mathrm{i}+1]+\mathrm{b}[\mathbf{i}+1]$; | vres[i] = vec_add(va[i],vb[i]); |
|  | $\mathbf{r e s}[\mathbf{i}+2]=\mathrm{a}[\mathrm{i}+2]+\mathrm{b}[\mathbf{i}+2]$; | \} |
|  | $\mathbf{r e s}[\mathbf{i}+3]=\mathbf{a}[\mathbf{i}+3]+\mathbf{b}[\mathbf{i}+3]$; |  |
|  | \} |  |

- Change the amount of computations in the loop
- Traditionally you might want to remove loop invariant computations from the loops and do several smaller loops as opposed to single large one...
- But paradoxically it might sometimes HELP to move MORE computations into loops

All other product or service names are
© Freescale Semiconductor, Inc. 2004

## Method 3 - Look at data dependency

- True Data dependency is often preventing vectorization
- as well as some classical code optimizations
- But in some cases it could be eliminated
- Let us consider Dot Product of two Matrixes
- X,Y vectors size N

```
Dot_Product \((x[n], y[n])=\sum_{i=1}^{n} x[i] * y[i]\)
int DotProduct( int *X, int * \(\mathbf{Y}\), int length )\{
    int temp \(=0\);
    // N Iterations
    for( int \(\mathbf{i}=\mathbf{0} ; \mathbf{i}<\) length; \(\mathbf{i + +}\) ) \{
        temp \(=\mathrm{X}[\mathrm{i}] * \mathrm{Y}[\mathrm{i}]+\) temp;
    \}
    return temp;
\}
```


## Method 3 - Look at data dependency

- Same function could be written in vector form
- Note that v1 and v2 are size of N/4
- so is the "length" == four times fewer iterations

```
int VectorDotProduct( vector int *v1, vector int *v2, int length ){
    vector int temp = (vector int) vec_splat_u32(0);
    int result;
    // Loop over the length of the vectors multiplying like terms and summing
    // Number of iterations is N/4
    for( int i = 0; i < length; i++)
        temp = vec_madd( v1[i], v2[i], temp);
    // Still have four ints splat across a vector
    // Add across the vector
    temp = vec_add( temp, vec_sld( temp, temp, 4 )); // Vector Shift Left Double
    temp = vec_add( temp, vec_sld( temp, temp, 8 ));
    vec_ste( temp, 0, &result );
    return result;
}
```

All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2004

## Method 3 - Look at data dependency

- But is this the best possible way of doing it?
- vec_madd takes 4 cycles to complete...

```
int VectorDotProduct( vector int *v1, vector int *v2, int length ){
    vector int temp = (vector int) vec_splat_u32(0);
    int result;
    // Loop over the length of the vectors multiplying like terms and summing
    // Number of iterations is N/4
    for( int i = 0; i < length; i+t.).
        temip = vec_madd( v1[i], v2[i], temp); // true data dependency
                            // only 1 madd every 4 cycles
    temp = vec_add( temp, vec_sld( temp, temp, 4 )); // Vector Shift Left Double
    temp = vec_add( temp, vec_sld( temp, temp, 8 ));
    vec_ste( temp, 0, &result );
    return result;
}
```

All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2004

## Method 3 - Look at data dependency

- Now eliminate the data dependency...
int FastVectorDotProduct( vector float *v1, vector float *v2, int length )\{
vector float temp = (vector float) vec_splat_s8(0);
vector float temp2 = temp; vector float temp3 = temp;
vector float temp4 = temp; vector float result;
for ( int $i=0 ; i<$ length; $i+=4)\{\quad / / L o o p$ over the length of the vectors,
temp = vec_madd( $\mathbf{v 1}[\mathrm{i}]$, v2[i], temp); //this time doing 4 vectors in parallel
temp2 = vec_madd( v1[i+1], v2[i+1], temp2); // to fill the pipeline
temp3 $=$ vec_madd( $\mathbf{v} 1[i+2]$, v2[i+2], temp3);
temp4 $=$ vec_madd( $\mathbf{v} 1[i+3]$, v2[i+3], temp4);
\}
//Sum our temp vectors
temp = vec_add( temp, temp2 );
temp3 = vec_add( temp3, temp4 );
temp = vec_add( temp, temp3);
//Add across the vector
temp = vec_add( temp, vec_sld( temp, temp, 4 ));
temp = vec_add(temp, vec_sld( temp, temp, 8 ));
//Copy the result to the stack so we can return it via the IPU
vec_ste( temp, 0 , \&result );
return result;
Many thanks to Ian Ollmann
\}

All other product or service names are the
© Freescale Semiconductor, Inc. 2004

## Method 4 - Look at your Data Layout

- Often algorithm calls for loading of blocks of data in certain order
- Images (pixels orders) are good example
- Let us look at RGB to YCbCr conversion

Layout 1

| 1 | 1 |  | 2 | 2 |  |  | 10 | 10 | 10 | 11 | 11 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 11 | 12 | 12 | 12 | 13 | 13 | ... | 20 | 21 | 21 | 21 | 22 |
| 22 |  | 23 | 23 |  | 24 |  | 31 |  | 32 | 32 |  |

Layout 2


Layout 3


## Method 4 - Look at your Data Layout

- One vector load from Layout 1 yields this:
$\square$

- One solution (and maybe the only) is to get 3 vectors worth, and then use vec_perm instruction

| 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 3 | 4 | 4 | 4 | 5 | 5 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 6 | 6 | 7 | 7 | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 10 | 10 | 10 | 11 | 11 |
| 11 | 12 | 12 | 12 | 13 | 13 | 13 | 14 | 14 | 14 | 15 | 15 | 15 | 16 | 16 | 16 |


| $\mathbf{0}$ | $\mathbf{3}$ | $\mathbf{6}$ | $\mathbf{9}$ | $\mathbf{1 2}$ | $\mathbf{1 5}$ | $\mathbf{1 8}$ | $\mathbf{2 1}$ | $\mathbf{1}$ | $\mathbf{4}$ | $\mathbf{7}$ | $\mathbf{1 0}$ | $\mathbf{1 3}$ | $\mathbf{1 6}$ | $\mathbf{1 9}$ | $\mathbf{2 2}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\mathbf{2}$ | $\mathbf{5}$ | $\mathbf{8}$ | $\mathbf{1 1}$ | $\mathbf{1 4}$ | $\mathbf{1 7}$ | $\mathbf{2 0}$ | $\mathbf{2 4}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ |


| $\mathbf{1}$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 | $\mathbf{1}$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $\mathbf{1}$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ |

## Method 4 - Look at your Data Layout

- One vector load from Layout 2 yields 16 bytes of one "color"

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

- Which means that only three vector loads will yield 16 full pixels
- In three vector registers

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

- This is the fastest way to get data "in" BUT only if processing is done on "chars" and no extra precision is needed


## Method 4 - Look at your Data Layout

- The actual formula for RGB to YCbCr conversion is:

$$
\begin{aligned}
& Y=\frac{77}{256}\left(219 \cdot \frac{R_{y}}{256}+16\right)+\frac{150}{256}\left(219 \cdot \frac{C_{y}}{256}+16\right)+\frac{29}{256}\left(219 \cdot \frac{B_{y}}{256}+16\right) \\
& C b=-\frac{44}{256}\left(219 \cdot \frac{R_{y}}{256}+16\right)-\frac{87}{256}\left(219 \cdot \frac{G_{y}}{256}+16\right)+\frac{131}{256}\left(219 \cdot \frac{B_{\gamma}}{256}+16\right)+128 \\
& C^{r}=\frac{131}{256}\left(219 \cdot \frac{R_{y}}{256}+16\right)-\frac{110}{256}\left(219 \cdot \frac{G_{y}}{256}+16\right)-\frac{21}{256}\left(219 \cdot \frac{B_{y}}{256}+16\right)+128
\end{aligned}
$$

- Which is equivalent to...

$$
\begin{aligned}
& Y=\frac{8432}{32768}\left(R_{y}\right)+\frac{16425}{32768}\left(G_{y}\right)+\frac{3176}{32768}\left(B_{y}\right)+16 \\
& C b=-\frac{4818}{32768}\left(R_{y}\right)-\frac{9527}{32768}\left(G_{y}\right)+\frac{14345}{32768}\left(B_{y}\right)+128 \\
& C_{r}=\frac{14345}{32768}\left(R_{y}\right)-\frac{12045}{32768}\left(G_{y}\right)-\frac{2300}{32768}\left(a_{y}\right)+128
\end{aligned}
$$

- Which needs 16 bits precision for accurate computation... All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2004


## Method 4 - Look at your Data Layout

- This could be done by "unpacking" bytes to shorts:

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | $\mathbf{1}$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ | $\mathbf{x}$ |

- vec_unpack_2sh and vec_unpack_2sl

| $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | 4 | 5 | $\mathbf{6}$ | 7 | 8 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $\mathbf{1}$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

- On which the computations are performed... and packed back to bytes by vec_packsu()
- This means that by "reverse engineering" the necessary order of bytes back to memory we will get Layout3, so there is no need for permute instructions after vector loads



## Method 4 - Look at your Data Layout

- Do the global Data Layout analysis
- Caches are only working well on data streams which exhibit spatial and time locality
- Remember that two vector loads == one cache line
- If loading from multiple cache lines, do one load from each line, then go back and load the second half (also works for scalar access)
- Do not group too many loads and stores
- 8-16 vector stores in a row can overflow CSQ (completed store queue) and cause processor to stall
- but remember store merging - put two stores to the same cache line together
- See if "in place" computations are possible
- sometimes reduces memory traffic in half


## Method 5 - Look at the Data Types

- Similarly to the optimization method 4 we can see that Data Type analysis can affect algorithm mapping
- In addition to "normal" or forward data type analysis...
- If you multiply two bytes, you better use short as result
- ... there could be a "reverse" data analysis
- If the actual precision of the result being used is LESS then the precision provided by extended data types, maybe simple rounding will suffice
- A good example of this rule is use of double precision floating point in many embedded algorithms
- Often original algorithms are developed for "generic" conditions, which might not meet exact use of the algorithm in this specific instance
- In this case it is a variation of the Method 0 - know (profile) your application and possible data set


## Method 5 - Look at the Data Types

- One case of data type consideration (and partially data layout) is aligning allocated data to quad word boundaries
- Different compilers have different means of achieving it, but all of them DO
- Here is GCC example...
typedef union\{
vector unsigned int vec;
int elements[4];
\}LongVector __attribute__ ((aligned (16))) ;
unsigned char 8bitBuf[] __attribute__ ((aligned (16))) $=\{$
\#include "attribute_table.txt"
\};
- In this example every variable of data type LongVector will be aligned on quadword boundaries and 8bitBuf is already aligned
- Data Alignment is absolutely critical for mapping algorithms on AltiVec


## Method 5 - Look at the Data Types

- Why?


For a series of array elements: A0, A1, A2, A3

| lvx | v00 ,\&A0 |
| :--- | :--- |
| lvx | v01,\&A1 |
| lvsl | v02 ,\&A0 |
| vperm | v10,v00,v01,v02 |



| lvx | v00 ,\&A0 |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| lvx | v01 ,\&A1 | v00 | v01 | v02 |  |  |
| lvsl | v02 ,\&A0 |  |  |  |  |  |
| vperm | v10, v00,v01,v02 |  |  |  | $\Rightarrow$ | A0 |
| lvx | v00 , \& ${ }^{\text {v11 }}$ 201, 000 |  | A0) A1 | A1 A2 | $\Rightarrow$ | A1 |
| vperm | v11, v01,v00,v02 |  | A0) A1 | A2 | $\Rightarrow$ | A1 |
| lvx vperm | $\begin{aligned} & \text { v01, \&A3 } \\ & \text { v12,v00,v01,v02 } \end{aligned}$ | A1 A2 | A2) A3 |  | $\Rightarrow$ | A2 |
| lvx | v00 ,\&(A3+16) |  |  |  |  |  |
| vperm | v13, v01,v00,v02 |  | A2) A3 | 3 | $\Rightarrow$ | A3 |

## Method 5 - Look at the Data Types

- Loading Unaligned Data requires getting twice (or more) the data you really need

```
vector unsigned char vectorLoadUnaligned( vector unsigned char *v ){
    vector unsigned char permuteVector = vec_lvsl( 0, (int*) v );
    vector unsigned char low = vec_ld( 0,v );
    vector unsigned char high = vec_ld( 16, v );
    return vec_perm( low, high, permuteVector );
}
```


## Method 5 - Look at the Data Types

- Store Unaligned Data is even 'better' ...
- You need to LOAD in order to be able to Store!

```
void vectorStoreUnaligned( vector unsigned char v , vector unsigned char *where)\{
    vector unsigned char permuteVector = vec_lvsr( 0 , (int*) where );
    vector unsigned char low,high,tmp,mask;
    vector signed char ones \(\quad\) vec_splat_s8( -1 );
    vector signed char zeroes \(=\) vec_splat_s8( 0 );
    vector unsigned char low = vec_ld ( 0 , where ); //Load the surrounding area
    vector unsigned char high = vec_ld ( 16 , where );
    //Make a mask for which parts of the vectors to swap out
    mask = vec_perm( zeros, ones, permuteVector );
    tmp = vec_perm( tmp, tmp, permuteVector ); //Right rotate our input data
    low = vec_sel( tmp, low, mask ); // Insert masked data to aligned vector
    high = vec_sel( high, v, mask );
    vec_st ( low, \(\mathbf{0}\), where ); //Store aligned results
    vec_st ( high, 16, where );
\}
```

All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2004

## Method 6 - Eliminate Branching

- You cannot proceed at full speed unless you know exactly where you are going...
- When processor encounters a branch instruction, and condition data is not available processor guesses...
- And if it guessed wrong, it will back track to the decision point and start over
- There are some general guidelines on how processor will guess...
- Static branch prediction: Forward branch - not taken, backward branch is taken
- Which means in if-then-else place LIKELY section in "then"
- Dynamic branch prediction - after one or two invocations of the same branch instructions enough HISTORY is accumulated to make good prediction next time around...
- But branch predictor is vulnerable to aliasing...
- Try to avoid branches even if it means more computations - it is likely to be faster!


## Method 6 - Eliminate Branching

- A simple example of finding maximum of two numbers
- Assuming it is used to compare two arrays

```
int Max( int a, int b ) {
    int result;
    if( a < b ) result = b;
    else result = a;
    return result;
}
```


//Return the maximum of two vector integers: result = (a \& ~mask) | (b \& mask);
vector signed int Max(vector signed int $a$, vector signed int $b$ ) \{
vector bool int mask = vec_cmplt ( $\mathbf{a}, \mathbf{b}$ );
//If ( $\mathbf{a}<\mathbf{b}$ )...
vector signed int result = vec_sel( $\mathbf{a}, \mathbf{b}$, mask ); //Select a or b
return result;
\}


## Method 6 - Eliminate Branching

- Some compare instructions write their results to integer unit registers
- vec_any_xx() functions that return 1 if any element satisfies the test
- vec_all_xx() functions that return 1 if all of the elements in the vector satisfy the test
//Return true if the second and third floats in v are greater than $\mathbf{0 . 0}$
Boolean AreSecondAndThirdElementsPositive( vector float $\mathbf{v}$ )\{ vector unsigned int compare = (vector unsigned int)( QNaN, 0, 0, QNaN); return vec_all_nle( $\mathbf{v}$, (vector float) compare );
\}


All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2004

## Method 7 - Look at Memory data rate

- Memory bandwidth is the natural (and actually desirable) limit of productivity for I/O based applications
- If it is a streaming application we are mainly talking about system bus throughput
- Maximum Actual 60x bus throughput is about $640 \mathrm{Mbytes} / \mathrm{sec} @ 100 \mathrm{MHz}$
- Maximum Actual MPX bus throughput is about $800 \mathrm{Mbytes} / \mathrm{sec} @ 100 \mathrm{MHz}$
- If we are reusing data already fetched, we are probably dealing with Cache hierarchy
- Approximate L1 cache scalar throughput is around 3.7 Gbytes/sec @ 1GHz
- AltiVec provides $4 x$ the bandwidth accessing it at 15 GBytes/sec @ 1GHz
- For the case when system bus is the bottleneck, we need to guarantee that it is used all the time
- In AltiVec it is achieved with Data Stream Touch instruction (dstx)
- For the cases when we mainly depend on Cache performance, we need to improve locality and eliminate unnecessary requests
- This is done in part with careful layout
- and dcba/dcbt instructions


## Method 7 - Look at Memory data rate

- AltiVec allows up to four prefetch streams, independent and asynchronous
- addressed by a two bit ID tag
- Vec_dst(a,b,c); where $a$ is initial address, $b$ is control constant, $c$ is ID tag

Data Stream Prefetch
Block Size $=0-32$ Vectors


Stride $= \pm 32 \mathrm{KBytes}$


## Method 7 - Look at Memory data rate

- One to four cache streams could be set using
- vec_dst (Data Stream Touch for load) and...
- vec_dstst (Vector Data Stream Touch for Store == load+store)
- Transient variants vec_dstt() and vec_dststt()
- Mark their blocks to be flushed straight to RAM instead of L2 and L3 caches
- Use vec_dst() and vec_dstt() just to read a block of data
- Use vec_dstst() and vec_dststt() to read and modify a block of data
- Do not use prefetch for write only
- All writes go into a Store Miss Merge Queue
- When making two adjacent vector stores to the same cache line, they could be merged into one cache line store, so no memory read is needed
- This is also more efficient then using dcbz (less an instruction)


## Method 7 - Look at Memory data rate

- Prefetch thread is a low priority process
- There are multiple events that can delay or stop prefetch
- Prefetch streams will also stop silently if they step on memory that is either unmapped or would cause a protection violation
- Prefetch engine could be shared by user and OS



## Method 7 - Look at Memory data rate

- To improve Cache dependent case, we might want to "help" it
- AltiVec is capable of 'hinting' Cache on future usage of data
- vec_IdI() and vec_stl() (Vector Load/Store Indexed LRU) mark the new cache blocks as those least recently used
- They will be the first to be flushed when more space in the cache is needed
- They also mark their cache blocks as transient, which means that they will be flushed directly to memory rather than take up space in the L2
- Next step is to use cache management instructions - dcba/dcbt/dcbz
- dcba (Data Cache Block Allocate) - allocates cache block without fetching it from memory
- this is telling Cache "I will soon store into this cache block"
- dcbt (Data Cache Block Touch) - fetches the cache block from memory
- this means "I will soon load from this location"
- dcbz (Data Cache Block Clear to Zero) - same as dcba, but zeroes out allocated block
- There are many more dcbx instructions for a variety of scenarios

All other product or service names are the
© Freescale Semiconductor, Inc. 2004

## Method 8 - Get rid of memory accesses...

- If you do not like it... do not do it!
- All the color conversions do the simple calculation for each color of output space using values of colors from input space
- cmyk_ycck_convert() from GNU GhostScript
- The equation looks like the following
- OutColor1 = C1 * InColor1 + C2 * InColor2 + C3 * InColor3
- where C1, C2 and C3 are the constants
- To speed up the calculation original GNU code uses the pre calculated table to exchange C1 * InColor1 with Table[InColor1]
- The same trick is used to limit range of the output value (to one byte length)
- AltiVec version calculate the output value "as is" using original equation and "mradd" instructions.
- It is match faster than memory byte access operations
- And we do such calculation for eight pixels simultaneously
- The result is $5 x$ Faster...


## Method 8 - Get rid of memory accesses...

- GNU optimized implementation with Lookup Table

```
while (--num_rows >= 0) {
    for (col = 0; col < num_cols; col++) {
                    r = MAXJSAMPLE - GETJSAMPLE(inptr[0]);
                    g = MAXJSAMPLE - GETJSAMPLE(inptr[1]);
                    b = MAXJSAMPLE - GETJSAMPLE(inptr[2]);
                            /* K passes through as-is */
                    outptr3[col] = inptr[3];
                            inptr += 4;
                            /* Y */
                            outptr0[col] = ((ctab[r+R_Y_OFF] + ctab[g+G_Y_OFF] + ctab[b+B_Y_OFF]) >> SCALEBITS);
                            /* Cb */
                            outptr1[col] = ((ctab[r+R_CB_OFF] + ctab[g+G_CB_OFF] + ctab[b+B_CB_OFF]) >> SCALEBITS);
                            /* Cr */
                            outptr2[col] = ((ctab[r+R_CR_OFF] + ctab[g+G_CR_OFF] + ctab[b+B_CR_OFF]) >> SCALEBITS);
    }
}
```


## Method 8 - Get rid of memory accesses...

```
while (--num_rows >= 0) {
    source0 = vec_ld(0, inptr);
    for (col = 0; col < num_cols; source0 = source4, col+=16) {
    /* Perform 16-bit arithmetic conversion of R,G,B to Y,Cb,Cr
    Y = 0.29900 * R + 0.58700 * G + 0.11400 *B
    Cb}=-0.16874*R-0.33126*G+0.50000*B + CENTERJSAMPLE
    Cr}=0.50000*R-0.41869*G-0.08131*B + CENTERJSAMPLE */ 
    y0 = vec_mradds(ss_ry, r0, vec_mradds(ss_gy, g0, vec_mradds(ss_by, b0, ss_zero)));
    y1 = vec_mradds(ss_ry, r1, vec_mradds(ss_gy, g1, vec_mradds(ss_by, b1, ss_zero)));
    cb0 = vec_mradds(ss_rcb, r0, vec_mradds(ss_gcb, g0, vec_mradds(ss_bcb, b0, ss_center)));
    cb1 = vec_mradds(ss_rcb, r1, vec_mradds(ss_gcb, g1, vec_mradds(ss_bcb, b1, ss_center)));
    cr0 = vec_mradds(ss_rcr, r0, vec_mradds(ss_gcr, g0, vec_mradds(ss_bcr, b0, ss_center)));
    cr1 = vec_mradds(ss_rcr, r1, vec_mradds(ss_gcr, g1, vec_mradds(ss_bcr, b1, ss_center)));
    /* Pack results into 8-bit format and store to non-interleaved data streams:
    output arrays are aligned, but the size of the image
    could be unaligned with vector size (num_cols % 16 != 0)! */
    y = vec_packsu(y0, y1);
    cb = vec_packsu(cb0, cb1);
    cr = vec_packsu(cr0, cr1);
    *outptr0++ = y;
    *outptr1++ = cb;
    *outptr2++ = cr;
    *outptr3++ = k;
    }
}
```


## Method 9 - Get rid of computations

- Well, the same idea, do not like it, don't do it...
- You have some control over where the bottleneck is, so move it around a bit
- Data slicing is a very effective technique to be used in AltiVec
- It is best explained with an example
- Let's consider Byte-wise Bit Reversal algorithm
- For each byte return bit reversal version of the input:

```
unsigned char reverse (unsigned char in){
    unsigned char out = ((in & 0x01)<<7)|
        ((in & 0x02) <<5) |
        ((in & 0x04)<<<)|
        ((in & 0x08)<<1)|
        ((in & 0x10) >>1)|
        ((in & 0x20) >>3)
        ((in & 0x40) >>5)
    ((in & 0x80) >>7);
    return out;
}
```

- This straightforward method yields 0.10 Bytes/Cycle


## Method 9 - Get rid of computations...

- Alternative implementation could be Big Lookup Table:
- 256 entry byte table holding the "reversed" values
- So, the computation for each byte is converted into a single "load"
- reversed[j] = big_lookup[in[i]];
unsigned char big_lookup[256] = \{
$0 \times 00,0 \times 80,0 \times 40,0 \times c 0,0 \times 20,0 \times 20,0 \times 60,0 \times e 0,0 \times 10,0 \times 90,0 \times 50,0 x d 0,0 \times 30,0 \times 60,0 \times 70,0 x f 0$,
$0 \times 08,0 \times 88,0 \times 48,0 \times c 8,0 \times 28,0 \times 28,0 \times 68,0 \times e 8,0 \times 18,0 \times 98,0 \times 58,0 \times d 8,0 \times 38,0 \times b 8,0 \times 78,0 \times 58$,
$0 \times 04,0 \times 84,0 \times 44,0 \times c 4,0 \times 24,0 \times 24,0 \times 64,0 \times 44,0 \times 14,0 \times 94,0 \times 54,0 \times d 4,0 \times 34,0 \times b 4,0 \times 74,0 \times f 4$,
$0 \times 0 \mathrm{c}, 0 \times 8 \mathrm{c}, 0 \times 4 \mathrm{c}, 0 \times \mathrm{cc}, 0 \times 2 \mathrm{c}, 0 \times \mathrm{xac}, 0 \times 6 \mathrm{c}, 0 \mathrm{xec}, 0 \times 1 \mathrm{c}, 0 \times 9 \mathrm{c}, 0 \times 5 \mathrm{c}, 0 \times \mathrm{xc}, 0 \times 3 \mathrm{c}, 0 \times b \mathrm{~b}, 0 \times 7 \mathrm{c}, 0 \times \mathrm{fc}$,
$0 \times 02,0 \times 82,0 \times 42,0 \times c 2,0 \times 22,0 \times 22,0 x 62,0 \times 22,0 \times 12,0 \times 92,0 \times 52,0 x d 2,0 \times 32,0 x b 2,0 \times 72,0 x f 2$,
$0 \times 0 a, 0 \times 8 a, 0 \times 4 a, 0 \times c a, 0 \times 2 a, 0 \times 2 a, 0 \times 6 a, 0 \times e a, 0 \times 1 a, 0 \times 9 a, 0 \times 5 a, 0 \times d a, 0 \times 3 a, 0 \times b a, 0 \times 7 a, 0 \times f a$,
$0 \times 06,0 \times 86,0 \times 46,0 \times c 6,0 \times 26,0 \times 26,0 \times 66,0 \times 66,0 \times 16,0 \times 96,0 \times 56,0 \times d 6,0 \times 36,0 \times 66,0 \times 76,0 \times f 6$,
$0 \times 0 \mathrm{e}, 0 \times 8 \mathrm{e}, 0 \times 4 \mathrm{e}, 0 \times \mathrm{ce}, 0 \times 2 \mathrm{e}, 0 \times \mathrm{xae}, 0 \times 6 \mathrm{e}, 0 \times e,, 0 \times 1 \mathrm{e}, 0 \times 9 \mathrm{e}, 0 \times 5 \mathrm{e}, 0 \times \mathrm{de}, 0 \times 3 \mathrm{e}, 0 \times \mathrm{be}, 0 \times 7 \mathrm{e}, 0 \times f e$,
$0 \times 01,0 \times 81,0 \times 41,0 \times c 1,0 \times 21,0 \times 21,0 \times 61,0 \times 1,0 \times 11,0 \times 91,0 \times 51,0 \times d 1,0 \times 31,0 \times b 1,0 \times 71,0 \times f 1$,
$0 \times 09,0 \times 89,0 \times 49,0 \times c 9,0 \times 29,0 \times 29,0 \times 69,0 \times 99,0 \times 19,0 \times 99,0 \times 59,0 \times d 9,0 \times 39,0 \times 69,0 \times 79,0 \times f 9$,
$0 \times 05,0 \times 85,0 \times 45,0 \times c 5,0 \times 25,0 \times 25,0 \times 65,0 \times 55,0 \times 15,0 \times 95,0 \times 55,0 \times d 5,0 \times 35,0 \times 65,0 \times 75,0 \times 55$,
0x0d,0x8d,0x4d,0xcd,0x2d,0xad,0x6d,0xed,0x1d,0x9d,0x5d,0xdd,0x3d,0xbd,0x7d,0xfd,
$0 \times 03,0 \times 83,0 \times 43,0 \times c 3,0 \times 23,0 \times 33,0 \times 63,0 \times e 3,0 \times 13,0 \times 93,0 \times 53,0 \times d 3,0 \times 33,0 \times 63,0 \times 73,0 \times f 3$,
0x0b,0x8b,0x4b,0xcb,0x2b,0xab,0x6b,0xeb,0x1b,0x9b,0x5b,0xdb,0x3b,0xbb,0x7b,0xfb,
0x07,0x87,0x47,0xc7,0x27, 0xa7,0x67,0xe7,0x17,0x97,0x57,0xd7,0x37,0xb7,0x77,0xf7,
$0 x 0 f, 0 \times 8 f, 0 x 4 f, 0 x c f, 0 \times 2 f, 0 x a f, 0 x 6 f, 0 x e f, 0 \times 1 f, 0 \times 9 f, 0 \times 5 f, 0 x d f, 0 x 3 f, 0 x b f, 0 x 7 f, 0 x f f$
\};
- This method yields 0.19 Bytes/Cycle or $2 x$ faster then original

All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2004

## Method 9 - Get rid of computations...

- Another method is Small Lookup Table
- based on splitting each byte into two nibbles
- looking up values for both of them independently, and merging result later

```
unsigned char small_lookup_l[16] __attribute__ ((aligned (16))) = {
    0x00,0x08,0x04,0x0c,0x02,0x0a,0x06,0x0e,0x01,0x09,0x05,0x0d,0x03,0x0b,0x07,0x0f
};
unsigned char small_lookup_h[16] __attribute__ ((aligned (16))) = {
    0x00,0x80,0x40,0xc0,0x20,0xa0,0x60,0xe0,0x10,0x90,0x50,0xd0,0x30,0xb0,0x70,0xf0
};
reversed[j] = small_lookup_l[(in[j]&0xf0)>>4] | small_lookup_h[(in[j]&0x0f)];
```

- This method uses less memory, but runs a bit slower: 0.11 Bytes/Cycle


## Method 9 - Get rid of computations

- The true advantages comes from observation that small tables will fit into two ALtiVec registers...
- ... and ALL lookups are completely independent, so 16 of them could be performed in parallel !!!

```
void reverse_vector(vector unsigned char *in,vector unsigned char *out, int num_elements){
    int i;
    vector unsigned char st_l, st_h;
    vector unsigned char four = vec_splat_u8(4);
    vector unsigned char v_in,vl,vh, v_out;
    st_l = vec_ld (0,(vector unsigned char *) small_lookup_l);
    st_h = vec_ld (0,(vector unsigned char *) small_lookup_h);
    for(i=0; i<num_elements; i+=16){
        v_in = vec_ld (i,in);
    vh = vec_sr(v_in,four);
    vh = vec_perm(st_l,st_l,vh);
    vl = vec_perm(st_h,st_h,v_in);
    v_out = vec_or(vh,vl);
    vec_st(v_out,i,out);
}
```

- This method for the same conditions gets
- 2.7 Bytes/Cycle
- It is $30 x$ faster then original Scalar
- ...and 15x faster then BigLookupTable

All other product or service names are
© Freescale Semiconductor, Inc. 2004

## Method 10 - Constant Generations

- Often a "standard" C declaration of an initialized variable is interpreted by compiler as declare + store...

```
vector signed int zero_vec_32 = { 0,0,0,0 };
vector float
zero_vec_fp = { 0.0,0.0,0.0,0.0 };
```

- In this case it is much more practical to generate these constants

```
vector signed int zero_vec_32 = vec_splat_s32(0);
vector float zero_vec_fp = vec_ctf( vec_splat_u32(0), 0 ));
    // Vector Convert from Fixed-Point Word
```

- These are trivial cases, but what about "fancy" constants...

```
vector float vec_neg_zero( void ){ //Generate a vector full of -0.0.
    vector unsigned int result = vec_splat_u32(-1); // Vector Splat
    return (vector float ) vec_sl( result, result ); // Vector Shift Left
}
```

All other product or service names are the property of their respective owners
© Freescale Semiconductor, Inc. 2004

## Method 10 - Constant Generations

- A very common reason for constant generation is their use for vector permute instruction



## vector unsigned char <br> vectorLoadUnaligned( vector unsigned char *v )\{

vector unsigned char permuteVector $=$
vec_lvsl( $0, \mathrm{v}$ );
vector unsigned char low $=$ vec_ld $\mathbf{0 , v}$ );
vector unsigned char high = vec_ld( $16, \mathrm{v}$ );
return vec_perm( low, high, permuteVector ); \}
$\mathbf{d}=$ vec_lvsl $(\mathbf{a}, \mathbf{b})$
$\mathrm{EA} \leftarrow \mathbf{a}+\mathbf{b}$
$\mathrm{sh} \leftarrow \mathrm{EA}[28: 31]$
$\mathrm{sh} \leftarrow \mathrm{EA}[28: 31]$
if sh $=0 \times 0$ then $d \leftarrow 0 \times 000102030405060708090$ AOBOCODOEOF if $\mathrm{sh}=0 \times 1$ then $\mathbf{d} \leftarrow 0 \times 0102030405060708090$ AOBOCODOEOF10 if $\mathrm{sh}=0 \times 2$ then $\mathbf{d} \leftarrow 0 \times 02030405060708090$ AOBOCODOEOF1011 if $\mathrm{sh}=0 \times 3$ then $\mathrm{d} \leftarrow 0 \times 030405060708090$ AOBOCODOEOF101112 if $\mathrm{sh}=0 \times 4$ then $\mathbf{d} \leftarrow 0 \times 0405060708090$ AOB0CODOEOF10111213 if sh $=0 \times 5$ then $\mathbf{d} \leftarrow 0 \times 05060708090$ A0B0CODOEOF1011121314 if $\mathrm{sh}=0 \times 6$ then $\mathbf{d} \leftarrow 0 \times 060708090$ A0B0CODOE0F101112131415 if sh $=0 \times 7$ then $\mathbf{d} \leftarrow 0 \times 0708090$ A0B0COD0E0F10111213141516 if sh $=0 \times 8$ then $\mathbf{d} \leftarrow 0 \times 08090$ A0B0C0D0E0F1011121314151617 if sh $=0 \times 9$ then $\mathbf{d} \leftarrow 0 \times 090$ A0B0C0D0E0F101112131415161718 if sh $=0 \times \mathrm{A}$ then $\mathbf{d} \leftarrow 0 \times 0 \mathrm{~A}$ OBOCODOEOF10111213141516171819 if sh $=0 \times B$ then $\mathbf{d} \leftarrow 0 \times 0$ B0CODOE0F101112131415161718191A if $\mathrm{sh}=0 \mathrm{xC}$ then $\mathbf{d} \leftarrow 0 \times 0 \mathrm{C} 0$ D0EOF101112131415161718191A1B if $\mathrm{sh}=0 \times \mathrm{D}$ then $\mathbf{d} \leftarrow 0 \times 0 \mathrm{D} 0 \mathrm{E} 0 \mathrm{~F} 101112131415161718191 \mathrm{~A} 1 \mathrm{~B} 1 \mathrm{C}$ if $\mathrm{sh}=0 \times \mathrm{E}$ then $\mathrm{d} \leftarrow 0 \times 0 \mathrm{E} 0 \mathrm{~F} 101112131415161718191 \mathrm{~A} 1 \mathrm{~B} 1 \mathrm{C} 1 \mathrm{D}$ if $\mathrm{sh}=0 \times \mathrm{F}$ then $\mathrm{d} \leftarrow 0 \times 0 \mathrm{~F} 101112131415161718191 \mathrm{~A} 1 \mathrm{~B} 1 \mathrm{C} 1 \mathrm{D} 1 \mathrm{E}$

## $\mathbf{d}=\mathrm{vec} \_\operatorname{lvsr}(\mathbf{a}, \mathbf{b})$


$\mathrm{sh} \leftarrow \operatorname{EA}[28: 31]$
if sh=0x0 then $\mathbf{d} \leftarrow 0 \times 101112131415161718191$ A1B1C1D1E1F
if $\mathrm{sh}=0 \times 1$ then $\mathrm{d} \leftarrow 0 \times 0 \mathrm{~F}$. $\leftarrow 01112131415161718191 \mathrm{~A} 1 \mathrm{~B} 1 \mathrm{C} 1 \mathrm{D} 1 \mathrm{E}$
if sh=0x1 then $d \leftarrow 0 \times 0$ F101112131415161718191A1B1C1D1E
if sh=0x2 then $d \leftarrow 0 \times 0$ E0F101112131415161718191A1B1C1D
if $\mathrm{sh}=0 \times 2$ then $\mathrm{d} \leftarrow 0 \times 0$ ex0F10F1112131415161718191A1B1C1D
if sh=0x4 then $d \leftarrow 0 \times 0$ CODOEOF101112131415161718191A1B
if sh=0x4 then $\mathbf{d} \leftarrow 0 \times 0$ then $d \leftarrow 0 \times 0$ B0CODOEOF101112131415161718191A
if sh=0x6 then $\mathbf{d} \leftarrow 0 \times 0$ AOBOCODOEOF10111213141516171819
if sh=0x7 then $\mathbf{d} \leftarrow 0 \times 090$ AOBOCODOE0F101112131415161718
if $\mathrm{sh}=0 \times 8$ then $\mathrm{d} \leftarrow 0 \times 08090$ AOBOCODOE0F1011121314151617
if sh=0x9 then $\mathbf{d} \leftarrow 0 \times 0708090$ A0BOCODOE0F10111213141516
if sh=0xA then $\mathbf{d} \leftarrow 0 \times 060708090$ A0B0CODOEOF101112131415
if sh=0xB then $\mathbf{d} \leftarrow 0 \times 05060708090$ A0B0CODOE0F1011121314
if sh=0xC then $d \leftarrow 0 \times 0405060708090$ A0BOCODOE0F10111213
if sh=0xD then $\mathbf{d} \leftarrow 0 \times 030405060708090$ AOBOCODOE0F101112
if sh=0xE then $\mathbf{d} \leftarrow 0 \times 02030405060708090$ A0BOCODOE0F1011
if sh=0xF then $\mathbf{d} \leftarrow 0 \times 0102030405060708090$ AOBOCODOEOF10

## Method 10 - Constant Generations

- But there are number of cases that are much more complex...
vector unsigned char vec_perm1 =
(vector unsigned char)\{0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23\};
vector unsigned char $\mathrm{a}=$ vec_lvsl( 0,0 );
vector unsigned char $b=$ vec_lvsr( 0,0 ); vector unsigned char vec_perm1 =
(vector unsigned char)vec_mergeh((vector unsigned int)a, (vector unsigned int)b);



## Method 10 - Constant Generations



00: (1) splat 00
01: (1) splat 01
10: (2) splat 08, add self
11: (3) splat 02, splat 0F, add together
12: (2) splat 09 , add self

3D: (3) splat F4, splat 06, rol by
3E: (2) splat FA, srl self
3F: (3) splat F3, splat 04, rol by
B1: (4) splat F6, rol self; splat F4; add together
B2: (4) splat F6, rol self; splat F5; add together
B3: (3) splat $0 B$, rol self, subtract

FE: (1) splat FE
FF: (1) splat FF
32 sequences consist of 1 instruction
44 sequences consist of 2 instructions
152 sequences consist of 3 instructions 28 sequences consist of 4 instructions
Many thanks to Holger Bettag

## Put it all together

- Step back and take a 10,000 foot view
- There is a logical sequence to be observed in implementation of these methods...
- One can look at the optimization process as on moving the bottleneck around the processor -
- if computation takes longer then anything else - speed them up
- if system bus is underutilized - use prefetching
- if bus is $100 \%$ full, computations are at the minimum... reduce the code and data size?
- But the truly superior goal is to reach computational entropy -
- get rid of all the unnecessary computations through algorithm modifications
- and balance added memory bandwidth with real data I/O
- use predictability of the data streams to the full extent
- Concentrate your effort, in large applications work with $10 \%$ of the code which accounts for $90 \%$ of execution time


## General Coding Strategy

- Use Vector algorithms
- Aim for high throughput
- Align your Data
- 16 bytes
- Never hurts scalar code
- Keep all data in close proximity
- Helps to improve memory performance
- Try not to mix different data types in the same vector
- Do more work
- On a cold cache assume having 40 cycles for each 32byte chunk of data
- You are likely to achieve TOP performance when processing time exactly equal to the fetch time
- Use prefetch and 'hinting' instructions



## More ways...



Freescale ${ }^{T M}$ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners.
All other product or service names are ther,
© Freescale Semiconductor, Inc. 2004

## AltiVec Library Offering

- Telecomm
- FFT/IFFT, FIR, Autocorrelation, Convolution Encoder/Viterbi Decoder (GSM,3G), Error Correction Codes (CRC 8,12,16,24)
- Voice Over IP (G723, G729) elements
- MultiMedia
- DCT/IDCT, MPEG2, MPEG4, H.26x, AC3, MP3, JPEGIJPEG 2000, Quantization/Dequantization, SAD
- Voice Recognition, Pattern Recognition,
- Printer
- GhostScript Library elements, Color Management routines, Color Conversion (RGB to YCbCr), Scaling/Rotation, Filtering routines, FS Dithering
- Networking
- OSPF, QOS, NAT, Route Lookup, IP Reassembly, TCP/IP,
- Encryption (AES, DES, 3DES, MD5, SHA, RSA, Kasumi)
- Wireless network (802.11), LZO
- LibC (means could be "Linked" at compilation)
- Link level support for standard C functions (memcpy, strcmp etc.)
- Mathematical primitives (Extension of LibC+)
- Matrix math, LargeNumber Lib
- Math.h - Log, Exp, Sin, Cos, Sqrt
- OS enablement
- Linux (TCP/IP),
- VxWorks elements


## Legend

Green color - available on the AV web site now (motorola.com/altivec)
Orange color - available upon request
Light blue - available soon

## To summarize:

- AltiVec $^{\text {TM }}$ Technology transparently adds SIMD functionality to a high speed RISC engine
- AltiVec enables a broad range of embedded and computing applications
- C level programming offers certain level of comfort while providing powerful way to extract parallelism from applications
- You must think in terms of Vector Processing throughout design cycle of an application
- AltiVec is not a pixie dust to be sprinkled on an existing code
- Given that $-2 x-4 x-11 x$ speedup is possible



## MPC7400



## MPC7450



## 60x vs. MPX

## - Bandwidth

| PLATFORMS | DESCRIPTION | LATENCY | BANDWDTH |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | THEORETICAL | MEASURED |  |
|  |  |  |  | SW | LA |
| EximerMaximer | FPGA 60x SRAM, 100MHz | 2 | $460 \mathrm{MB} / \mathrm{sec}$ |  |  |
| Sandpoint | 60x, SDRAM, 1 level deep pipelining, 101MHz | 8 | $640 \mathrm{MB} / \mathrm{sec}$ | $427 \mathrm{MB} / \mathrm{sec}$ | $N / A$ |
| M MP | 60x pipelined, SDRAM, 100 MHz | 19 | E40 MB/sec | $591 \mathrm{MB} / \mathrm{sec}$ | $589 \mathrm{MB} / \mathrm{sec}$ |
| M VP | MPX pipelined, SDRAM, 100 MHz | 19 | $800 \mathrm{MB} / \mathrm{sec}$ | $593 \mathrm{MB} / \mathrm{sec}$ | $590 \mathrm{MB} / \mathrm{sec}$ |
| Ideal | MPX, SDRAM, data streaming, data intervention, out of order, 100 MHz | 6-8 | $800 \mathrm{ME} / \mathrm{sec}$ | N/A | $\mathrm{N} / \mathrm{A}$ |
| Simple Memory Controller | 60x non-pipelined, SDRAM, 100 MHz | 9 | $320 \mathrm{ME} / \mathrm{sec}$ | $178 \mathrm{ME} / \mathrm{sec}$ | $177 \mathrm{MB} / \mathrm{sec}$ |

- PCI I/O

| PLATFORMS | DESCRIPTION | LATENCY | BAND WIDTH |
| :---: | :---: | :---: | :---: |
| Sandpoint | 60x, SDRAM, 1 level deep pipelining, 100 WHz | 61 | $290 \mathrm{ME} / \mathrm{sec}$ |
| MVP | MPX pipelined, SDRAM, 100 MHz | 267 | 172 MEv ec |

