#### Parallel AutoClass

Kevin R Keane CSE 633 Spring 2009

#### **Given Observed Data**



#### Estimate Class Distribution Parameters

$$\mu_{j,k} = \frac{\sum_{i=1}^{N} w_{i,j} x_{i,k}}{\sum_{i=1}^{N} w_{i,j}}$$

$$\sigma_{j,k}^2 = \frac{\sum_{i=1}^N w_{i,j} (x_{i,k} - \mu_{j,k})^2}{\sum_{i=1}^N w_{i,j}}$$

# Original update\_parameters()

```
void update parameters( clsf DS clsf){
   11 ...
   for (n cl=0; n cl<n classes; n cl++) {</pre>
      cl = classes[n cl];
      update_params_fn(cl, n_classes, database, collect);
   }
 // ...
}
void update params fn( class DS class, int n classes,
  database DS data base, int collect){
  int i, n atts;
  tparm DS tparm;
  class->pi j = (class->w j + (1.0 / n classes)) / (data base->n data + 1.0);
  class->log pi j = (float) safe log((double) class->pi j);
  n atts = data base->n atts;
  for (i=0; i<n atts; i++) {</pre>
    tparm=class->tparms[i];
  11 ...
  }
```



### New update\_parameters()

```
int bx=16, by=16, gx, gy;
cudaError_t err;
gx = (n_classes -1)/ bx + 1;
gy = (n_atts -1)/by + 1;
dim3 dimBlock(bx, by);
dim3 dimGrid(gx, gy);
update params fn<<< dimGrid, dimBlock >>>
```

```
(d_classes, n_classes, d_database, collect);
```

#### New - inside the former loops

```
global void update params fn( class DS *classes, int n classes,
                database DS data base, int collect)
{
 int i, j;
 tparm DS tparm;
 int ix = blockDim.x * blockIdx.x + threadIdx.x;
 int iy = blockDim.y * blockIdx.y + threadIdx.y;
 if(!(ix < n classes && iy < classes[0]->model->n terms))return;
 j = ix; // class
 i = iy; // attribute
 class DS xclass = classes[j];
 // this looks like trouble - lots of threads updating common pi j / log pi j
 if(iy==0){
         xclass->pi j = (xclass->w j + (1.0 / n classes)) / (data base->n data + 1.0);
          xclass->log pi j = (float) safe log((double) xclass->pi j);
  }
```

```
// allocate storage for classes on device
class DS d classes buffer, *d classes;
cudaMalloc((void**)&d classes buffer, n classes*sizeof(d classes buffer[0]));
cudaMalloc((void**)&d classes, n classes*sizeof(d classes[0]));
for(i=0;i<n classes;i++){</pre>
        d classes[i] = d classes buffer + i;
        cudaMemcpy(d classes[i],classes[i],
          sizeof(d_classes_buffer[0]), cudaMemcpyHostToDevice);
}
// allocate storage for db on device - MOVE THIS - needed only once
float *d data buffer, **d data;
cudaMalloc((void**)&d_data_buffer, n_data*n_atts*sizeof(d_data_buffer[0]));
cudaMalloc((void**)&d data, n data*sizeof(d data[0]));
for(i=0;i<n data;i++){</pre>
        d data[i] = d data buffer + i*n atts;
        cudaMemcpy(d_data[i],data[i],
          sizeof(d data buffer[0])*n atts, cudaMemcpyHostToDevice);
}
database DS d database;
cudaMalloc((void**)&d database, sizeof(d database[0]));
d database->n data = n data;
d database->n atts = n atts;
d database->data = d data;
// allocate space for wts
float *d wts buffer, **d wts;
cudaMalloc((void**)&d wts buffer, n data*n classes*sizeof(d wts buffer[0]));
cudaMalloc((void**)&d_wts, n_classes*sizeof(d_wts[0]));
for(i=0;i<n classes;i++){</pre>
        d wts[i] = d wts buffer + i*n data;
        d classes[i]->wts = d wts[i];
        cudaMemcpy(d classes[i]->wts,classes[i]->wts,
          sizeof(d wts buffer[0])*n data, cudaMemcpyHostToDevice);
}
int bx=16,by=16,gx,gy;
cudaError t err;
gx = (n classes -1)/bx + 1;
gy = (n_{atts} - 1)/by + 1;
dim3 dimBlock(bx,by);
dim3 dimGrid(gx,gy);
update params fn<<< dimGrid, dimBlock >>>
  (d classes, n classes, d database, collect);
```

#### Estimate Class Membership Weights

$$p(x_i|i \in C_j, \theta_j) = \prod_{k=1}^K \pi_j p(x_{i,k}|i \in C_j, \theta_{j,k})$$

$$w_i = \frac{\pi_j p(x_i | i \in C_j, \theta_j)}{\sum_{j=1}^J \pi_j p(x_i | i \in C_j, \theta_j)}$$



## So much code ... so little time the speed bumps

- It took *8.5 seconds* to push hello world to the device.
  - After that, hellos arrived 12/millisecond.
- Apparently *try* and *class* are keywords in C++.
- The NVIDIA compiler, nvcc compiles the \*.cu files in C++ mode. So, for the C portion of the program to link properly, you need to wrap headers: extern "C" { int myCoolDemo(int argc,char \*\*argv);

## So much code ... so little time the speed bumps

- 3-D dimensions have modest limits in the third dimension.
- Maximum dimensions for a *block* 512, 512, and 64 for the x, y, and z
- A grid is at most two dimensional
  - Blocks can be arrayed in two dimensional grids to large size.
  - Third dimension is limited to 64. So, need to pick a dimension to be relegated to the 64 count limit.
- Perhaps that's ok for the class dimension, and then again maybe not.

## So much code ... so little time the speed bumps

- There is not an intrinsic *all-reduce* function in CUDA
  - Required along the summation dimensions for parallel code
- Implementation
  - thread per each class / attribute combination for parameter estimation
  - Thread per datum for weight estimation
  - Summation and product dimensions handled serially
- Not so bad ...
  - #classes x #attributes frequently 50-100 or more
  - Number of observations typically large, 1000 100,000+
- Significant parallel speedup still possible

## Optimizing reduction code provides further opportunity for speedup

|                                                                 | Time (2 <sup>22</sup> ints) | Bandwidth   | Step<br>Speedup | Cumulative<br>Speedup |
|-----------------------------------------------------------------|-----------------------------|-------------|-----------------|-----------------------|
| Kernel 1:<br>interleaved addressing<br>with divergent branching | 8.054 ms                    | 2.083 GB/s  |                 |                       |
| Kernel 2:<br>interleaved addressing<br>with bank conflicts      | 3.456 ms                    | 4.854 GB/s  | 2.33x           | 2.33x                 |
| Kernel 3:<br>sequential addressing                              | 1.722 ms                    | 9.741 GB/s  | 2.01x           | 4.68x                 |
| Kernel 4:<br>first add during global load                       | 0.965 ms                    | 17.377 GB/s | 1.78x           | 8.34x                 |
| Kernel 5:<br>unroll last warp                                   | 0.536 ms                    | 31.289 GB/s | 1.8x            | 15.01x                |
| Kernel 6:<br>completely unrolled                                | 0.381 ms                    | 43.996 GB/s | 1.41x           | 21.16x                |
| Kernel 7:<br>multiple elements per thread                       | 0.268 ms                    | 62.671 GB/s | 1.42x           | 30.04x                |

Kernel 7 on 32M elements: 73 GB/s!

Source: developer.download.nvidia.com

**^** . . . . .

### Optimizing reduction code provides further opportunity for speedup



#### Source: developer.download.nvidia.com

### Lessons learned ...

- Porting legacy code is not pretty ...
  - Loops spread widely across functions
  - Data structures not compactly allocated
    - Copy & ship is a pain
- Probably best to design software directly to take advantage of architecture
- On the other hand
  - Software per architecture is probably a bad idea
  - Coding is frequently the bottleneck