Thoughts & Views: Distributed Databases

這個演算法算是在去唸書期間，以第一學期來說，算是最麻煩的，即使公式都給了。這門課是講關於分散式系統資料庫，對我來說跟天書沒什麼兩樣，因為那時連資料庫碰都沒碰過，一點概念都沒有。而且現在才紀錄下來好像太晚了點.....都忘了差不多了XDD

總而言之，這門課的第一作業就是叫我們實現Bond Energy Algorithm。在講它之前，有必要先講一些分散式資料庫的簡單觀念說一下。

分散式資料庫是一群資料庫系統(或是所謂的節點)的集合，它們都是透過網路來連結，而且彼此互相之間是有關聯的。它被應用在非常廣的範圍，像是Google 就是用分散式系統管理資料庫，不然一個資料點掛了就全部陣亡。舉例來說，小明在台北市查找一本書。實際上這連線是不止會傳回台北所在的資料，也會向其它縣市的資料庫進行查尋(如果沒有或是全部查找)。

在分散式資料庫系統中，我們通常須將存放資料的表格進行分割，並放在多個資料庫節點上。而這分割又被分為Horizontal / Vertical fragmentation。而BEA就是在設計Vertical Fragmentation Algorithm時，基於Attribute affinity values，找出一組Attirbutes的關聯的意義。

BEA的演算法公式如下 :

/* Bond Energy algorithm
        input: AA (Attribute Affinity matrix)
        output: CA (Clustered Affinity matrix)
     */

     //copy old AA matrix to new AA in order to avoid index mistake
     AAnew = new double[MATRIX_SIZE+1][MATRIX_SIZE+1];
     CA = new double[MATRIX_SIZE+2][MATRIX_SIZE+2];
     tempColumn = new double[MATRIX_SIZE+1];

     for(int i = 1; i < MATRIX_SIZE+1; i++) {
       for(int j = 1; j < MATRIX_SIZE+1; j++) {
         AAnew[i][j] = AA[i-1][j-1];
        // System.out.println("AAnew["+i+"]["+j+"] = "+AAnew[i][j]);
       }
     }

     //fill columns and rows
     for(int i = 0; i < MATRIX_SIZE+1; i++) {
       CA[0][i] = i;
      for(int j = 0; j < MATRIX_SIZE+1; j++) {
         CA[j][0] = j;
       }
     }

     for(int i = 0; i < MATRIX_SIZE+1; i++) {
       AAnew[0][i] = i;
       for(int j = 0; j < MATRIX_SIZE+1; j++) {
         AAnew[j][0] = j;
       }
     }


     // Copy the first and second columns from AA
     for(int i = 1; i < MATRIX_SIZE+1; i++) {
       CA[i][1] = AAnew[i][1];
       CA[i][2] = AAnew[i][2];
     }

     index = 3; // start from third index
     RightMostIndex = 2; //current rightmost index of AA.
     while(index <= MATRIX_SIZE) {
       double contrib = 0;
       double maxContrib = 0;
       //System.out.println("------ index = "+index);

       for(int i = 1; i < index; i++) {
          contrib = cont(CA[0][i-1], index, CA[0][i], AAnew);
          //if found the highest contribution, record it
          if(contrib > maxContrib) {
            maxContrib = contrib;
            record(CA[0][i-1], index, CA[0][i]);
          }
        }
        contrib = cont(CA[0][index-1], index, index+1, AAnew);

        if(contrib > maxContrib) {
          maxContrib = contrib;
          record(CA[0][index-1], index, index+1);
        }
       /* place the highest contribution */
       CAplacement();
       index++;
     }
     CArowplace();
     CAprint();
   }

   /* This function is to order rows according the placement of columns */
   public static void CArowplace() {
     //double[][] temp;
     CAfinal =  new double[MATRIX_SIZE+2][MATRIX_SIZE+2];

     for(int i = 0; i <= MATRIX_SIZE; i++) {
       double row = CA[0][i];
       CAfinal[0][i] = row;

       for(int j = 0; j <= MATRIX_SIZE; j++) {
         //temp[i][j] = CA[(int)row][j];
         //System.out.println("row ="+(int)row);
         CAfinal[i][j] = CA[(int)row][j];
       }
     }
   }

看起來雖然滿複雜，但實現這公式是滿簡單的。麻煩的是要算出AA matrix。當時老師不打算用課本上的演算法，改用一個叫做 Cosine-based similarity 的算法。所以AA matrix 的算法變成如下 :

/* Start - Attribute Affinity algorithm */
     AA = new double[MATRIX_SIZE][MATRIX_SIZE];
     DecimalFormat df = new DecimalFormat("0.0000");

     for(int i = 0; i < att.size(); i++) {
       for(int j = 0; j < att.size(); j++) {
         double Aik = 0;
         double Ajk = 0;
         double AffNumerator = 0;
         double AffDenominator = 0;
         double temp = 0;
         double temp1 = 0;
         double TempSqrt = 0;
         double Temp1Sqrt = 0;
         for(int k = 0; k < query.size(); k++) {
           double SumAcc = 0;
           /* sum all accMatrix then multiply use(qk,Ai)*/
           for(int site = 0; site < 3; site++) {
             SumAcc = SumAcc + accMatrix[k][site];
           }
           Aik = beQueried[k][i] * SumAcc;
           Ajk = beQueried[k][j] * SumAcc;

           AffNumerator = AffNumerator + (Aik*Ajk);
           temp = temp + Math.pow(Aik,2);
           temp1 = temp1 + Math.pow(Ajk,2);
         }
         TempSqrt = Math.sqrt(temp);
         Temp1Sqrt = Math.sqrt(temp1);
         AffDenominator = TempSqrt * Temp1Sqrt;
         //System.out.println("Affn = "+AffNumerator);
         //System.out.println("Affd = "+AffDenominator);
         AA[i][j] = Double.parseDouble(df.format(AffNumerator / AffDenominator));
         System.out.println("AA["+i+"]["+j+"] = "+AA[i][j]);
       }
     }

這個程式都放在Github，需要的人請自取

整個過程都有註解，所以應該還滿好理解，只是執行速度太慢就是了...而且當時寫這作業沒想其它的，看起來真有點亂...

Thoughts & Views

1/27/2017

Distributed Databases - Bond Energy Algorithm

No comments:

Post a Comment