PCAKmeans

question 5
question 5
question 6
question 7

close all
%Lamrous PCA program on notes
[dataNum, datatext, alldata] = xlsread('voiture.xls','voiture2');
Data=dataNum(:,1:7);
Variables = datatext(1,2:8)
Individuals = datatext(2:31,1)

%This allows you to generate a PCA
%Attention, in the latest versions of Matlab, the "princomp" function is
%called otherwise,try "pca"

Variables =

  1×7 cell array

  Columns 1 through 4

    {'puissance'}    {'cylindree'}    {'vitesse'}    {'longueur'}

  Columns 5 through 7

    {'largeur'}    {'hauteur'}    {'poids'}


Individuals =

  30×1 cell array

    {'PANDA'    }
    {'TWINGO'   }
    {'YARIS'    }
    {'CITRONC2' }
    {'CORSA'    }
    {'FIESTA'   }
    {'CLIO'     }
    {'P1007'    }
    {'MODUS'    }
    {'MUSA'     }
    {'GOLF'     }
    {'MERC_A'   }
    {'AUDIA3'   }
    {'CITRONC4' }
    {'AVENSIS'  }
    {'VECTRA'   }
    {'PASSAT'   }
    {'LAGUNA'   }
    {'MEGANECC' }
    {'P407'     }
    {'P307CC'   }
    {'PTCRUISER'}
    {'MONDEO'   }
    {'MAZDARX8' }
    {'VELSATIS' }
    {'CITRONC5' }
    {'P607'     }
    {'MERC_E'   }
    {'ALFA156'  }
    {'BMW530'   }

question 5

5) Unroll the Clustering Kmeans algorithm on the global matrix with the following parameters: - Initialization: uniform - Number of groups: 2 - Number of replications: 1

[idx,centres]=kmeans(Data,2,'Replicates',1,'Start','uniform');

question 5

Is your algorithm stable from one execution to another? If "no" in what condition would it be? Our algorithm is unstable because at each iteration the center is chosen randomly. the code suggested below manually sets the center for clustering

[dataNum, datatext, alldata] = xlsread('voiture.xls','voiture2');
DataKmeans=dataNum(:,1:7);

%for i=1:10
%fixedCenter = randi([1 5],2,7)
fixedCenter = rand(2,7)
[idx,centres]=kmeans(DataKmeans,2,'Replicates',1,'Start',fixedCenter);


figure;
plot(DataKmeans(idx==1,1),DataKmeans(idx==1,2),'r.','MarkerSize',12)
hold on
plot(DataKmeans(idx==2,1),DataKmeans(idx==2,2),'b.','MarkerSize',12)
plot(centres(:,1),centres(:,2),'kx',...
     'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
       'Location','NW')
title 'Cluster without PCA'
text(DataKmeans(:,1),DataKmeans(:,2),Individuals);
hold off
% end

fixedCenter =

    0.5942    0.2084    0.5252    0.3057    0.9568    0.6515    0.6377
    0.0031    0.4844    0.2100    0.5018    0.6035    0.3047    0.8723

question 6

What is the best grouping? answer: 2 see how does it work here: https://fr.mathworks.com/help/stats/evalclusters.html

eva = evalclusters(Data,'kmeans','CalinskiHarabasz','KList',[1:2]..."maximum clustering can be 30, but we should work with two"
    )

eva = 

  CalinskiHarabaszEvaluation with properties:

    NumObservations: 30
         InspectedK: [1 2]
    CriterionValues: [NaN 42.2606]
           OptimalK: 2

[coeff,scores,eigen_values] = pca(zscore(Data));

 coeff % coordinates of variables in the projection space
 scores % coordinates of individuals in the projection space
 eigen_values % Eigenvalues



%The following command is used to create the graph of the eigenvalues.

figure('Name','Eigenvalue spectrum','NumberTitle','off');
bar(eigen_values);

% And this gives us the cumulative inertias of the eigenvalues.

Information_Percentage = cumsum(eigen_values)./sum(eigen_values)
figure('Name','% of cumulative inertia','NumberTitle','off');
bar(Information_Percentage);


% This command is used to produce the factorial map of individuals on
% the two main axes

figure('Name','individus','NumberTitle','off');
plot(scores(:,1),scores(:,2),'r+');
text(scores(:,1),scores(:,2),Individuals);
a  = axis;
xl = a(1);xu = a(2);yl = a(3);yu = a(4);
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
hold on
line([xl xu],[0 0])
line([0 0],[yl yu])
%axis 1 axis 3
figure('Name','individuals 1/3','NumberTitle','off');
plot(scores(:,1),scores(:,3),'r+');
text(scores(:,1),scores(:,3),Individuals);
a  = axis;
xl = a(1);xu = a(2);yl = a(3);yu = a(4);
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
hold on
line([xl xu],[0 0])
line([0 0],[yl yu])
%axis 2 axis 3
figure('Name','individuals 2/3','NumberTitle','off');
plot(scores(:,1),scores(:,3),'r+');
text(scores(:,1),scores(:,3),Individuals);
a  = axis;
xl = a(1);xu = a(2);yl = a(3);yu = a(4);
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
hold on
line([xl xu],[0 0])
line([0 0],[yl yu])
% In 3D
figure('Name','individuals 1 2 3','NumberTitle','off');
plot3(scores(:,1),scores(:,2),scores(:,3),'r+');
text(scores(:,1),scores(:,2),scores(:,3),Individuals);
a  = axis;
xl = a(1);xu = a(2);yl = a(3);yu = a(4);z1=a(5);zu=a(6);
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
zlabel('3nd Principal Component')
hold on
line([xl xu],[0, 0],[0, 0])
line([0 0],[yl yu],[0,0])
line([0 0],[0,0],[z1 zu])

% This command is used to create the circle of variable correlations
% on the two main axes.

figure('Name','Variables','NumberTitle','off');
plot(coeff(:,1),coeff(:,2),'*');

coeff =

    0.4044   -0.0790   -0.5043   -0.4002    0.0680   -0.1920    0.6118
    0.3917    0.2147   -0.3307    0.8220    0.1197   -0.0303    0.0112
    0.4213   -0.1404   -0.2517   -0.2918    0.3433    0.4170   -0.6022
    0.4167   -0.0158    0.3717    0.0392   -0.4468    0.6291    0.3017
    0.3823    0.0587    0.6535   -0.0298    0.5698   -0.2889    0.1199
   -0.1179    0.9301   -0.0571   -0.2221    0.1345    0.2164    0.0587
    0.4146    0.2432    0.0676   -0.1649   -0.5664   -0.5121   -0.3926


scores =

   -4.0472    0.2824   -0.5296    0.0481   -0.1216    0.3029    0.1078
   -3.7200   -1.0909   -0.2811    0.3755    0.1383   -0.2749    0.0884
   -3.4453   -0.2319    0.0988   -0.1354    0.2159    0.0647    0.2326
   -3.1877   -0.5250    0.1005    0.0959    0.0690   -0.0397    0.0504
   -2.6235   -0.8177    0.0059    0.1814   -0.3447   -0.0025   -0.0766
   -2.1625   -0.6374    0.2909    0.3407   -0.4203   -0.2245   -0.1353
   -2.1306   -1.1847   -0.6300    0.1808   -0.0105    0.2220   -0.1137
   -2.4464    1.5375    0.0590   -0.3301    0.0567   -0.1076   -0.1199
   -1.5978    1.2096   -0.4389   -0.4081    0.4058    0.0795   -0.1168
   -1.4170    2.7378   -0.3200   -0.1589    0.1118    0.3307   -0.0248
   -1.0473    0.3197    0.8125    0.8790   -0.1021   -0.1917    0.2012
   -0.3368    1.5964   -0.3389   -0.3367    0.7666   -0.3685   -0.2073
   -0.6814   -0.7129    0.7360    0.1595    0.0895   -0.1510    0.0283
    0.4648   -0.1627    0.1755    0.0566    0.1404   -0.2251   -0.2237
    0.3866    0.1650    0.5991    0.2901   -0.5390    0.2982   -0.0223
    1.1160   -0.2073    0.4998   -0.2477   -0.0494    0.1825   -0.1116
    0.8168   -0.2523    0.1871   -0.4205   -0.3323    0.7483   -0.0814
    1.0447   -0.6974    0.1265   -0.0705    0.0584    0.3193    0.1251
    1.1246   -0.8829   -0.0716   -0.1577    0.1056   -0.1447   -0.3229
    1.1753   -0.2613    0.8485    0.0521   -0.0272    0.1909   -0.0602
    1.2105   -0.5931   -0.3689   -0.3629   -0.1441   -0.2267   -0.3059
    1.1419    1.0743   -1.2928   -0.2078   -0.6249   -0.5160    0.4434
    1.8907   -0.4868    1.8009    0.0097    0.8752   -0.1395    0.2392
    1.3463   -2.1830   -0.3105   -1.4109   -0.0593   -0.2084    0.1643
    1.9100    1.8172    1.1883   -0.2660   -0.4744   -0.1885    0.1114
    2.3613    0.2514   -0.4568   -0.1139   -0.3363    0.1506    0.1002
    3.1763    0.1294    0.1606    0.2307   -0.3763   -0.1982    0.0095
    3.5546    0.3838   -0.3855    0.7553   -0.1227   -0.1314   -0.3219
    2.7322   -0.7153   -1.8469    0.6783    0.4428    0.0772    0.0960
    3.3909    0.1379   -0.4185    0.2934    0.6091    0.3722    0.2466


eigen_values =

    5.0224
    1.0559
    0.4972
    0.1842
    0.1327
    0.0734
    0.0341


Information_Percentage =

    0.7175
    0.8683
    0.9394
    0.9657
    0.9846
    0.9951
    1.0000

question 7

Unroll your algorithm on the results of a PCA with at least 94% of information. What is your observation?

We got from Eigenvalue spectrum figure that 70 is a 100% percent its mean that infomation which is more then 94% containts in first 3 splash, thats why we calculate for 65.8 as 94% of the information after PCA we got 0.6905 CriterionValues using SilhouetteEvaluation with kmeans of 2 groups which was 0.6636 with the same without PCA. It's been concluded that after PCA if we apply kmeans then it provides better quality with clustering.

fixedCenter = rand(2,3)
% Retain first three principal components
selectedThree = scores(:,1:3);
[idx,centres]=kmeans(selectedThree,2,'Replicates',1,'Start',fixedCenter);


figure;
plot(selectedThree(idx==1,1),selectedThree(idx==1,2),'r.','MarkerSize',12)
hold on
plot(selectedThree(idx==2,1),selectedThree(idx==2,2),'b.','MarkerSize',12)
plot(centres(:,1),centres(:,2),'kx',...
     'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
       'Location','NW')
title 'Cluster with KMeans and almost 94% information from PCA'
text(scores(:,1),scores(:,2),Individuals);
hold off


eva = evalclusters(Data,'kmeans','Silhouette','KList',[1:2]..."maximum clustering can be 30, but we should work with two"
    )

eva = evalclusters(selectedThree,'kmeans','Silhouette','KList',[1:2]..."maximum clustering can be 30, but we should work with two"
    )

fixedCenter =

    0.8929    0.4086    0.3285
    0.4786    0.8509    0.7848


eva = 

  SilhouetteEvaluation with properties:

    NumObservations: 30
         InspectedK: [1 2]
    CriterionValues: [NaN 0.6636]
           OptimalK: 2


eva = 

  SilhouetteEvaluation with properties:

    NumObservations: 30
         InspectedK: [1 2]
    CriterionValues: [NaN 0.6905]
           OptimalK: 2

Contents

question 5

question 5

question 6

question 7