總而言之,這門課的第一作業就是叫我們實現Bond Energy Algorithm。在講它之前,有必要先講一些分散式資料庫的簡單觀念說一下。
分散式資料庫是一群資料庫系統(或是所謂的節點)的集合,它們都是透過網路來連結,而且彼此互相之間是有關聯的。它被應用在非常廣的範圍,像是Google 就是用分散式系統管理資料庫,不然一個資料點掛了就全部陣亡。舉例來說,小明在台北市查找一本書。實際上這連線是不止會傳回台北所在的資料,也會向其它縣市的資料庫進行查尋(如果沒有或是全部查找)。
在分散式資料庫系統中,我們通常須將存放資料的表格進行分割,並放在多個資料庫節點上。而這分割又被分為Horizontal / Vertical fragmentation。而BEA就是在設計Vertical Fragmentation Algorithm時,基於Attribute affinity values,找出一組Attirbutes的關聯的意義。
BEA的演算法公式如下 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | /* Bond Energy algorithm input: AA (Attribute Affinity matrix) output: CA (Clustered Affinity matrix) */ //copy old AA matrix to new AA in order to avoid index mistake AAnew = new double[MATRIX_SIZE+1][MATRIX_SIZE+1]; CA = new double[MATRIX_SIZE+2][MATRIX_SIZE+2]; tempColumn = new double[MATRIX_SIZE+1]; for(int i = 1; i < MATRIX_SIZE+1; i++) { for(int j = 1; j < MATRIX_SIZE+1; j++) { AAnew[i][j] = AA[i-1][j-1]; // System.out.println("AAnew["+i+"]["+j+"] = "+AAnew[i][j]); } } //fill columns and rows for(int i = 0; i < MATRIX_SIZE+1; i++) { CA[0][i] = i; for(int j = 0; j < MATRIX_SIZE+1; j++) { CA[j][0] = j; } } for(int i = 0; i < MATRIX_SIZE+1; i++) { AAnew[0][i] = i; for(int j = 0; j < MATRIX_SIZE+1; j++) { AAnew[j][0] = j; } } // Copy the first and second columns from AA for(int i = 1; i < MATRIX_SIZE+1; i++) { CA[i][1] = AAnew[i][1]; CA[i][2] = AAnew[i][2]; } index = 3; // start from third index RightMostIndex = 2; //current rightmost index of AA. while(index <= MATRIX_SIZE) { double contrib = 0; double maxContrib = 0; //System.out.println("------ index = "+index); for(int i = 1; i < index; i++) { contrib = cont(CA[0][i-1], index, CA[0][i], AAnew); //if found the highest contribution, record it if(contrib > maxContrib) { maxContrib = contrib; record(CA[0][i-1], index, CA[0][i]); } } contrib = cont(CA[0][index-1], index, index+1, AAnew); if(contrib > maxContrib) { maxContrib = contrib; record(CA[0][index-1], index, index+1); } /* place the highest contribution */ CAplacement(); index++; } CArowplace(); CAprint(); } /* This function is to order rows according the placement of columns */ public static void CArowplace() { //double[][] temp; CAfinal = new double[MATRIX_SIZE+2][MATRIX_SIZE+2]; for(int i = 0; i <= MATRIX_SIZE; i++) { double row = CA[0][i]; CAfinal[0][i] = row; for(int j = 0; j <= MATRIX_SIZE; j++) { //temp[i][j] = CA[(int)row][j]; //System.out.println("row ="+(int)row); CAfinal[i][j] = CA[(int)row][j]; } } } |
看起來雖然滿複雜,但實現這公式是滿簡單的。麻煩的是要算出AA matrix。當時老師不打算用課本上的演算法,改用一個叫做 Cosine-based similarity 的算法。所以AA matrix 的算法變成如下 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | /* Start - Attribute Affinity algorithm */ AA = new double[MATRIX_SIZE][MATRIX_SIZE]; DecimalFormat df = new DecimalFormat("0.0000"); for(int i = 0; i < att.size(); i++) { for(int j = 0; j < att.size(); j++) { double Aik = 0; double Ajk = 0; double AffNumerator = 0; double AffDenominator = 0; double temp = 0; double temp1 = 0; double TempSqrt = 0; double Temp1Sqrt = 0; for(int k = 0; k < query.size(); k++) { double SumAcc = 0; /* sum all accMatrix then multiply use(qk,Ai)*/ for(int site = 0; site < 3; site++) { SumAcc = SumAcc + accMatrix[k][site]; } Aik = beQueried[k][i] * SumAcc; Ajk = beQueried[k][j] * SumAcc; AffNumerator = AffNumerator + (Aik*Ajk); temp = temp + Math.pow(Aik,2); temp1 = temp1 + Math.pow(Ajk,2); } TempSqrt = Math.sqrt(temp); Temp1Sqrt = Math.sqrt(temp1); AffDenominator = TempSqrt * Temp1Sqrt; //System.out.println("Affn = "+AffNumerator); //System.out.println("Affd = "+AffDenominator); AA[i][j] = Double.parseDouble(df.format(AffNumerator / AffDenominator)); System.out.println("AA["+i+"]["+j+"] = "+AA[i][j]); } } |
這個程式都放在Github,需要的人請自取
整個過程都有註解,所以應該還滿好理解,只是執行速度太慢就是了...而且當時寫這作業沒想其它的,看起來真有點亂...