1.并行计算

最开始的写法是三个for循环嵌套,分别对Z_ADDZ_GRAD逐个赋值。特别慢。后来改成了将其中最底层的for换成并行计算的gfor,如下:

for (size_t iloop = 0; iloop < K; ++iloop){
  for (size_t jloop = 0; jloop < T; ++jloop){
    absinput_after_blur(iloop,jloop,af::span,af::span) = absinput(iloop,jloop,af::span,af::span);

    
    // gfor (af::seq ploop, std::max(iloop-int(m_p_j),0), std::min(iloop+int(m_p_j),K-1)){
    gfor (af::seq ploop, K){
      auto m_p_j = af::moddims(m(ploop,jloop,0,0),K); // dim of K*1*1*1
      auto m_floor = af::floor(m_p_j); // dim of K*1*1*1

      auto sum_m_p_j=m_floor*(2*m_p_j-m_floor-1) + m_p_j; // dim of K*1*1*1
      auto sum_mpj_partial_to_mpj=2*m_p_j; // dim of K*1*1*1

      //这里只看 abs(ploop-iloop)<m_p_j 的部分
      auto condition1 = (af::abs(ploop-iloop)<m_p_j);
      auto condition2 = ((ploop - iloop)==0);


      auto Z_add_pji = condition1.as(f32) * ((!condition2).as(f32) * (af::moddims(absinput(ploop,jloop,0,0),K)*(m_p_j-af::abs(iloop-ploop))/sum_m_p_j) + condition2.as(f32) * (af::moddims(absinput(ploop,jloop,0,0),K)*(m_p_j-sum_m_p_j)/sum_m_p_j));
      auto Z_grad_pji = condition1.as(f32) * ((!condition2).as(f32) * (af::moddims(absinput(ploop,jloop,0,0),K)*(sum_m_p_j - sum_mpj_partial_to_mpj*(m_p_j-abs(iloop-ploop)))/(sum_m_p_j*sum_m_p_j)) + condition2.as(f32) * (af::moddims(absinput(ploop,jloop,0,0),K)*((1-sum_mpj_partial_to_mpj)*sum_m_p_j-sum_mpj_partial_to_mpj*(m_p_j-sum_m_p_j))/(sum_m_p_j*sum_m_p_j)));
      

      Z_add(ploop,jloop,iloop,af::span) = Z_add_pji;
      Z_grad(ploop,jloop,iloop,af::span) = Z_grad_pji;
    } 

    absinput_after_blur(iloop,jloop,af::span)+=af::sum(Z_add(af::span,jloop,iloop),0);
  }
  // printf("i=%d\n",iloop);
} 

但是速度仍然很慢,大约10秒一个epoch,不能接受。

这段代码的目的是从两个[K T]矩阵生成两个[K T K]矩阵,并且中途穿插了复杂计算和条件判断,后层一个点是与前层的一整列有关的。因此刚开始加速思路是将整个过程看做一次conv或腐蚀,但是想不出好办法。

因此最后想出办法是,首先使用tile将原矩阵变至同一形状,然后将使用的循环索引iloop,jloop也写为矩阵,而对于判断是否相等、是否越界等,也写成矩阵,最后用矩阵乘法表示循环和条件判断。如下:

absinput_after_blur(af::span,af::span,af::span,af::span) = absinput(af::span,af::span,af::span,af::span);

af::array MTiled = af::tile(m, af::dim4(1, 1, K));
af::array absTiled = af::tile(absinput, af::dim4(1, 1, K));
af::array iloop = af::range(af::dim4(K, T, K), 2);
af::array ploop = af::range(af::dim4(K, T, K), 0);

af::array i_e_p = (iloop == ploop);
af::array cond  = af::abs(iloop - ploop) < MTiled;

auto m_floor = af::floor(MTiled);
auto sum_m_p_j=m_floor*(2*MTiled-m_floor-1) + MTiled;
auto sum_mpj_partial_to_mpj=2*MTiled;

af::array f1_1 = absTiled*(MTiled-af::abs(iloop-ploop))/sum_m_p_j; //i!=p, add
af::array f1_2 = absTiled*(sum_m_p_j - sum_mpj_partial_to_mpj*(MTiled-abs(iloop-ploop)))/(sum_m_p_j*sum_m_p_j); //i!=p, grad

af::array f2_1 = absTiled*(MTiled-sum_m_p_j)/sum_m_p_j; //i==p, add
af::array f2_2 = absTiled*((1-sum_mpj_partial_to_mpj)*sum_m_p_j-sum_mpj_partial_to_mpj*(MTiled-sum_m_p_j))/(sum_m_p_j*sum_m_p_j); //i==p, grad

Z_add = cond * (i_e_p * f2_1 + (1 - i_e_p) * f1_1);
Z_grad = cond * (i_e_p * f2_2 + (1 - i_e_p) * f1_2);
        
absinput_after_blur += af::transpose(af::moddims(af::sum(Z_add,0), af::dim4(T, K, 1, 1)));

tips: 一般tile操作堆叠不会占用新的内存。

2.梯度内存爆炸

刚开始的写法直接从理论证明来,声明了一个[K T T K]维的梯度张量。但是float储存的,算出来这个张量需要内存500多GB,肯定不够。因此从逻辑上计算,lossm(i,j)的梯度,就等于遍历中间项newfft的每个点newfft(p,k),对逐个点链式法则后求和,∑∂myloss/∂newfft(p,k)*∂newfft(p,k)/∂m(i,j),如下:

mGrad = af::transpose(af::moddims(af::sum(af::tile(xGrad, af::dim4(1, 1, K))*Z_grad,0),af::dim4(T, K, 1, 1)));

经过以上的优化,现在大约1秒钟100个epoch。