1/27/2017

Distributed Databases - Bond Energy Algorithm

這個演算法算是在去唸書期間,以第一學期來說,算是最麻煩的,即使公式都給了。這門課是講關於分散式系統資料庫,對我來說跟天書沒什麼兩樣,因為那時連資料庫碰都沒碰過,一點概念都沒有。而且現在才紀錄下來好像太晚了點.....都忘了差不多了XDD

總而言之,這門課的第一作業就是叫我們實現Bond Energy Algorithm。在講它之前,有必要先講一些分散式資料庫的簡單觀念說一下。

分散式資料庫是一群資料庫系統(或是所謂的節點)的集合,它們都是透過網路來連結,而且彼此互相之間是有關聯的。它被應用在非常廣的範圍,像是Google 就是用分散式系統管理資料庫,不然一個資料點掛了就全部陣亡。舉例來說,小明在台北市查找一本書。實際上這連線是不止會傳回台北所在的資料,也會向其它縣市的資料庫進行查尋(如果沒有或是全部查找)。

在分散式資料庫系統中,我們通常須將存放資料的表格進行分割,並放在多個資料庫節點上。而這分割又被分為Horizontal / Vertical fragmentation。而BEA就是在設計Vertical Fragmentation Algorithm時,基於Attribute affinity values,找出一組Attirbutes的關聯的意義。

BEA的演算法公式如下 :




1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/* Bond Energy algorithm
        input: AA (Attribute Affinity matrix)
        output: CA (Clustered Affinity matrix)
     */

     //copy old AA matrix to new AA in order to avoid index mistake
     AAnew = new double[MATRIX_SIZE+1][MATRIX_SIZE+1];
     CA = new double[MATRIX_SIZE+2][MATRIX_SIZE+2];
     tempColumn = new double[MATRIX_SIZE+1];

     for(int i = 1; i < MATRIX_SIZE+1; i++) {
       for(int j = 1; j < MATRIX_SIZE+1; j++) {
         AAnew[i][j] = AA[i-1][j-1];
        // System.out.println("AAnew["+i+"]["+j+"] = "+AAnew[i][j]);
       }
     }

     //fill columns and rows
     for(int i = 0; i < MATRIX_SIZE+1; i++) {
       CA[0][i] = i;
      for(int j = 0; j < MATRIX_SIZE+1; j++) {
         CA[j][0] = j;
       }
     }

     for(int i = 0; i < MATRIX_SIZE+1; i++) {
       AAnew[0][i] = i;
       for(int j = 0; j < MATRIX_SIZE+1; j++) {
         AAnew[j][0] = j;
       }
     }


     // Copy the first and second columns from AA
     for(int i = 1; i < MATRIX_SIZE+1; i++) {
       CA[i][1] = AAnew[i][1];
       CA[i][2] = AAnew[i][2];
     }

     index = 3; // start from third index
     RightMostIndex = 2; //current rightmost index of AA.
     while(index <= MATRIX_SIZE) {
       double contrib = 0;
       double maxContrib = 0;
       //System.out.println("------ index = "+index);

       for(int i = 1; i < index; i++) {
          contrib = cont(CA[0][i-1], index, CA[0][i], AAnew);
          //if found the highest contribution, record it
          if(contrib > maxContrib) {
            maxContrib = contrib;
            record(CA[0][i-1], index, CA[0][i]);
          }
        }
        contrib = cont(CA[0][index-1], index, index+1, AAnew);

        if(contrib > maxContrib) {
          maxContrib = contrib;
          record(CA[0][index-1], index, index+1);
        }
       /* place the highest contribution */
       CAplacement();
       index++;
     }
     CArowplace();
     CAprint();
   }

   /* This function is to order rows according the placement of columns */
   public static void CArowplace() {
     //double[][] temp;
     CAfinal =  new double[MATRIX_SIZE+2][MATRIX_SIZE+2];

     for(int i = 0; i <= MATRIX_SIZE; i++) {
       double row = CA[0][i];
       CAfinal[0][i] = row;

       for(int j = 0; j <= MATRIX_SIZE; j++) {
         //temp[i][j] = CA[(int)row][j];
         //System.out.println("row ="+(int)row);
         CAfinal[i][j] = CA[(int)row][j];
       }
     }
   }


看起來雖然滿複雜,但實現這公式是滿簡單的。麻煩的是要算出AA matrix。當時老師不打算用課本上的演算法,改用一個叫做 Cosine-based similarity 的算法。所以AA matrix 的算法變成如下 :




1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Start - Attribute Affinity algorithm */
     AA = new double[MATRIX_SIZE][MATRIX_SIZE];
     DecimalFormat df = new DecimalFormat("0.0000");

     for(int i = 0; i < att.size(); i++) {
       for(int j = 0; j < att.size(); j++) {
         double Aik = 0;
         double Ajk = 0;
         double AffNumerator = 0;
         double AffDenominator = 0;
         double temp = 0;
         double temp1 = 0;
         double TempSqrt = 0;
         double Temp1Sqrt = 0;
         for(int k = 0; k < query.size(); k++) {
           double SumAcc = 0;
           /* sum all accMatrix then multiply use(qk,Ai)*/
           for(int site = 0; site < 3; site++) {
             SumAcc = SumAcc + accMatrix[k][site];
           }
           Aik = beQueried[k][i] * SumAcc;
           Ajk = beQueried[k][j] * SumAcc;

           AffNumerator = AffNumerator + (Aik*Ajk);
           temp = temp + Math.pow(Aik,2);
           temp1 = temp1 + Math.pow(Ajk,2);
         }
         TempSqrt = Math.sqrt(temp);
         Temp1Sqrt = Math.sqrt(temp1);
         AffDenominator = TempSqrt * Temp1Sqrt;
         //System.out.println("Affn = "+AffNumerator);
         //System.out.println("Affd = "+AffDenominator);
         AA[i][j] = Double.parseDouble(df.format(AffNumerator / AffDenominator));
         System.out.println("AA["+i+"]["+j+"] = "+AA[i][j]);
       }
     }


這個程式都放在Github,需要的人請自取

整個過程都有註解,所以應該還滿好理解,只是執行速度太慢就是了...而且當時寫這作業沒想其它的,看起來真有點亂...

1/17/2017

I'm back

On 17th of November, I finally completed my degree of Master in Australia. This three-year journey took me a lot of efforts on assignments and projects, but really benefited from those activities during the study. To be honest, I'd never thought that one day I would have been able to graduate from foreign schools and even worked on there. I've truly appreciated my parents who give me such an opportunity to study abroad, and also thankful my girlfriend company with me in the long distance.  Of course, there are many people who care about me that I really appreciate as well.

Although I'm still studying IELTS test in order to apply for PR visa after the graduation ( under 0.5 in part of listening...), updating what I've learnt from the university on this blog is my next step! I hope that by writing the blog, I could be able to organise my thoughts and review the things I learnt.

By the way, if anyone would like to look for a partner for practising IELTS speaking or something else, you can just simply send an email to me :)