PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO GRANDE DO SUL FACULDADE DE INFORMÁTICA DISCIPLINA: Aspectos Avançados em Banco de Dados, TURMA: 168, prof. Duncan Exercícios de mineração de dados, com Classificação e Regras de Associação: Para este exercício, serão usados arquivos .arff disponíveis na página da disciplina. A ferramenta para a prática é a WEKA, versão 3.5.7. Executar a ferramenta WEKA, e entrar em Applications-> Explorer. Selecionar Open file... 1a. prática: arquivo TAN_Ex_DT.arff. Constitui o arquivo usado nos slides do cap. 4 do TAN. - abrir o arquivo .arff - selecionar a aba Classify - pressionar o botão Choose e selecionar trees -> SimpleCart - pressionar Start; o resultado deve ser o abaixo: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.SimpleCart -S 1 -M 2.0 -N 5 -C 1.0 TAN_DecisionTree 10 5 TID Refund Marital_Status Taxable_Income Cheat 10-fold cross-validation === Classifier model (full training set) === CART Decision Tree : NO(7.0/3.0) Number of Leaf Nodes: 1 Size of the Tree: 1 Time taken to build model: 0.22 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 6 3 -0.1739 0.5062 0.5653 113.8889 % 120.4266 % 9 66.6667 % 33.3333 % === Detailed Accuracy By Class === TP Rate 0.857 0 FP Rate 1 0.143 Precision 0.75 0 Recall 0.857 0 F-Measure 0.8 0 ROC Area 0 0 Class NO YES === Confusion Matrix === a b <-- classified as 6 1 | a = NO 2 0 | b = YES Notem que a “árvore” tem apenas 1 nodo: : NO(7.0/3.0), que significa que o algoritmo classificou todos os objetos como Cheat = NO, acertando 7 e errando 3. - Clicar no nome do algoritmo: SimpleCart; - Mudar minNumObj para 1 e usePrune para False; - Pressionar Start novamente; a árvore mostrada deve ser esta: CART Decision Tree TID < 4.5: NO(4.0/0.0) TID >= 4.5 | Marital_Status=(Single): YES(2.0/0.0) | Marital_Status!=(Single) | | TID < 5.5: YES(1.0/0.0) | | TID >= 5.5: NO(3.0/0.0) Number of Leaf Nodes: 4 Size of the Tree: 7 Pode-se notar que TID está fazendo parte da árvore, e não significa nada para os objetos. Para desconsiderálo, façamos o seguinte: - clicar na aba Preprocess, selecionar o atributo TID em Attributes e apertar o botão Remove; - clicar na aba Classify e apertar Start; a saída deve ser esta: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.SimpleCart -S 1 -M 1.0 -N 5 -U -C 1.0 TAN_DecisionTree-weka.filters.unsupervised.attribute.Remove-R1 10 4 Refund Marital_Status Taxable_Income Cheat 10-fold cross-validation === Classifier model (full training set) === CART Decision Tree Marital_Status=(Single)|(Divorced) | Refund=(NO) | | Taxable_Income < 77.5: NO(1.0/0.0) | | Taxable_Income >= 77.5: YES(3.0/0.0) | Refund!=(NO): NO(2.0/0.0) Marital_Status!=(Single)|(Divorced): NO(4.0/0.0) Number of Leaf Nodes: 4 Size of the Tree: 7 Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 7 3 0.3478 0.3 0.5477 63.4615 % 109.2739 % 10 70 30 % % === Detailed Accuracy By Class === TP Rate 0.714 0.667 FP Rate 0.333 0.286 Precision 0.833 0.5 Recall 0.714 0.667 F-Measure 0.769 0.571 ROC Area 0.69 0.69 Class NO YES === Confusion Matrix === a b <-- classified as 5 2 | a = NO 1 2 | b = YES Agora, a árvore resultante está consistente com a vista em aula. As modificações nas configurações do algoritmo (minNumObj para 1 e usePrune para False) serviram para fazer com que o mesmo fosse até o último nível da hierarquia, e que não tentasse reduzir a árvore por algum critério de poda. IMP: Remover atributos irrelevantes ou que não significam nada ao problema é uma tarefa de preparação, que não necessariamente os algoritmos fazem. Ou seja, o usuário tem de fazer a preparação manualmente. Considerando o mesmo arquivo, vamos executar com outro algoritmo de classificação: J48 (é a implementação, no WEKA, do algoritmo C4.5). - clicar no botão Choose e selecionar Trees-> J48; - configurar (clicar sobre o nome do algoritmo) com minNumObj = 1 e com unpruned = true; - apertar Start; a saída deve ser esta: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.J48 -U -M 1 TAN_DecisionTree-weka.filters.unsupervised.attribute.Remove-R1 10 4 Refund Marital_Status Taxable_Income Cheat 10-fold cross-validation === Classifier model (full training set) === J48 unpruned tree -----------------Refund = NO | Marital_Status = Single | | Taxable_Income <= 75: NO (1.0) | | Taxable_Income > 75: YES (2.0) | Marital_Status = Married: NO (3.0) | Marital_Status = Divorced: YES (1.0) Refund = YES: NO (3.0) Number of Leaves : Size of the tree : 5 8 Time taken to build model: 0.03 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 5 5 0.0741 0.5 0.7071 105.7692 % 141.072 % 10 50 50 % % === Detailed Accuracy By Class === TP Rate 0.429 0.667 FP Rate 0.333 0.571 Precision 0.75 0.333 Recall 0.429 0.667 F-Measure 0.545 0.444 ROC Area 0.548 0.548 Class NO YES === Confusion Matrix === a b <-- classified as 3 4 | a = NO 1 2 | b = YES É possível perceber diferenças nos resultados; as árvores são diferentes! A seguir, segue a árvore resultante do J48 (sobre a Result list, trees.J48, clicar botão direito e selecionar Visualize Tree). O algoritmo SimpleCart não tem visualização da árvore. Compare os resultados dos 2 algoritmos, em termos de acurácia (taxa de acertos) e matriz de confusão: SimpleCart: Correctly Classified Instances 7 70 % === Confusion Matrix === a b <-- classified as 5 2 | a = NO 1 2 | b = YES J48: Correctly Classified Instances 5 50 % === Confusion Matrix === a b <-- classified as 3 4 | a = NO 1 2 | b = YES “Qual o melhor classificador, em sua opinião?” Faça a mesma análise feita acima, para o arquivo academico4.arff. Para tanto, remova, pelo Weka, a coluna NOTA. Resultado do SimpleCart: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.SimpleCart -S 1 -M 1.0 -N 5 -U -C 1.0 acadêmico 40 10 disciplina turma ano_sem professor matricula sexo idade curso nota desempenho 10-fold cross-validation === Classifier model (full training set) === CART Decision Tree nota < 7.5 | nota < 5.5 | | disciplina=(46258-04): REGULAR(1.0/0.0) | | disciplina!=(46258-04): RUIM(4.0/0.0) | nota >= 5.5 | | turma=(138): BOM(1.0/0.0) | | turma!=(138): REGULAR(14.0/0.0) nota >= 7.5 | professor=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)|(Arruda)|(Karin)|(Hubert)|(Egidio): BOM(19.0/0.0) | professor!=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)|(Arruda)|(Karin)|(Hubert)|(Egidio): REGULAR(1.0/0.0) Number of Leaf Nodes: 6 Size of the Tree: 11 Time taken to build model: 0.05 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 36 4 0.8305 0.0667 0.2582 16.9565 % 58.3717 % 40 90 10 % % === Detailed Accuracy By Class === TP Rate 0.9 0.875 1 FP Rate 0.05 0.083 0.028 Precision 0.947 0.875 0.8 === Confusion Matrix === a b 18 2 1 14 0 0 c <-- classified as 0 | a = BOM 1 | b = REGULAR 4 | c = RUIM Recall 0.9 0.875 1 F-Measure 0.923 0.875 0.889 ROC Area 0.925 0.896 0.986 Class BOM REGULAR RUIM Resultado do J48: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.J48 -U -M 1 acadêmico 40 10 disciplina turma ano_sem professor matricula sexo idade curso nota desempenho 10-fold cross-validation === Classifier model (full training set) === J48 unpruned tree -----------------nota <= 7 | nota <= 5 | | disciplina = 46250-02: RUIM (1.0) | | disciplina = 46251-04: RUIM (1.0) | | disciplina = 46252-04: RUIM (0.0) | | disciplina = 46256-04: RUIM (2.0) | | disciplina = 46257-04: RUIM (0.0) | | disciplina = 46258-04: REGULAR (1.0) | | disciplina = 46266-04: RUIM (0.0) | | disciplina = 46267-04: RUIM (0.0) | nota > 5 | | turma = 128: REGULAR (14.0) | | turma = 138: BOM (1.0) nota > 7 | professor = Karin: BOM (2.0) | professor = Duncan: BOM (3.0) | professor = Arruda: BOM (2.0) | professor = Yamaguti: REGULAR (1.0) | professor = Bastos: BOM (1.0) | professor = Afonso: BOM (3.0) | professor = Hubert: BOM (1.0) | professor = Ana-Paula: BOM (7.0) | professor = Egidio: BOM (0.0) Number of Leaves : Size of the tree : 19 24 Time taken to build model: 0.03 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 33 7 0.7021 0.1292 0.3461 32.8533 % 78.2459 % 40 82.5 17.5 % % === Detailed Accuracy By Class === TP Rate 0.8 0.875 0.75 FP Rate 0.05 0.208 0.028 Precision 0.941 0.737 0.75 === Confusion Matrix === a b 16 4 1 14 0 1 c <-- classified as 0 | a = BOM 1 | b = REGULAR 3 | c = RUIM Recall 0.8 0.875 0.75 F-Measure 0.865 0.8 0.75 ROC Area 0.874 0.826 0.854 Class BOM REGULAR RUIM Com remoção da coluna Nota; resultado do SimpleCart: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.SimpleCart -S 1 -M 1.0 -N 5 -U -C 1.0 acadêmico-weka.filters.unsupervised.attribute.Remove-R9 40 9 disciplina turma ano_sem professor matricula sexo idade curso desempenho 10-fold cross-validation === Classifier model (full training set) === CART Decision Tree professor=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula) | matricula=(93106842)|(94201018)|(95280018)|(94206067)|(94108293)|(95280023)|(94112046)|(94112192)|(94103839)|(96104543) | | ano_sem=(1997/2) | | | matricula=(94112192): BOM(1.0/0.0) | | | matricula!=(94112192): REGULAR(2.0/0.0) | | ano_sem!=(1997/2): BOM(14.0/0.0) | matricula!=(93106842)|(94201018)|(95280018)|(94206067)|(94108293)|(95280023)|(94112046)|(94112192)|(94103839)|(96104543) | | ano_sem=(1999/2)|(1997/2)|(1998/1)|(1998/2): REGULAR(1.0/0.0) | | ano_sem!=(1999/2)|(1997/2)|(1998/1)|(1998/2): RUIM(1.0/0.0) professor!=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula) | matricula=(95280018)|(94206067)|(94108293) | | professor=(Karin)|(Arruda)|(Hubert)|(Duncan)|(Bastos)|(Afonso)|(Ana-Paula) | | | disciplina=(46257-04): REGULAR(1.0/0.0) | | | disciplina!=(46257-04): BOM(4.0/0.0) | | professor!=(Karin)|(Arruda)|(Hubert)|(Duncan)|(Bastos)|(Afonso)|(Ana-Paula): REGULAR(2.0/0.0) | matricula!=(95280018)|(94206067)|(94108293) | | disciplina=(46266-04)|(46258-04): REGULAR(4.0/0.0) | | disciplina!=(46266-04)|(46258-04) | | | disciplina=(46257-04) | | | | ano_sem=(1999/2)|(1997/2)|(1998/1)|(1998/2): REGULAR(1.0/0.0) | | | | ano_sem!=(1999/2)|(1997/2)|(1998/1)|(1998/2): BOM(1.0/0.0) | | | disciplina!=(46257-04) | | | | ano_sem=(1999/1): REGULAR(1.0/0.0) | | | | ano_sem!=(1999/1) | | | | | disciplina=(46256-04)|(46250-02)|(46252-04)|(46257-04)|(46258-04)|(46266-04)|(46267-04) | | | | | | matricula=(94112046): REGULAR(1.0/0.0) | | | | | | matricula!=(94112046) | | | | | | | ano_sem=(1998/2)|(1997/2)|(1999/1)|(1999/2) | | | | | | | | matricula=(94112192)|(93106842)|(94108293)|(94112046)|(94201018)|(94206067)|(95280018)|(95280023)|(95280027)|(96104543): REGULAR(1.0/0.0) | | | | | | | | matricula!=(94112192)|(93106842)|(94108293)|(94112046)|(94201018)|(94206067)|(95280018)|(95280023)|(95280027)|(96104543): RUIM(1.0/0.0) | | | | | | | ano_sem!=(1998/2)|(1997/2)|(1999/1)|(1999/2): RUIM(1.0/0.0) | | | | | disciplina!=(46256-04)|(46250-02)|(46252-04)|(46257-04)|(46258-04)|(46266-04)|(46267-04) | | | | | | matricula=(94112046): RUIM(1.0/0.0) | | | | | | matricula!=(94112046): REGULAR(2.0/0.0) Number of Leaf Nodes: 18 Size of the Tree: 35 Time taken to build model: 0.05 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 17 23 0.0086 0.3833 0.6191 97.5 % 139.9705 % 40 42.5 57.5 % % === Detailed Accuracy By Class === TP Rate 0.7 0.188 0 FP Rate 0.45 0.375 0.139 Precision 0.609 0.25 0 === Confusion Matrix === a 14 9 0 b 5 3 4 c <-- classified as 1 | a = BOM 4 | b = REGULAR 0 | c = RUIM Recall 0.7 0.188 0 F-Measure 0.651 0.214 0 ROC Area 0.625 0.406 0.431 Class BOM REGULAR RUIM Com remoção da coluna Nota, com J48: === Run information === Scheme: Relation: Instances: Attributes: Test mode: weka.classifiers.trees.J48 -U -M 1 acadêmico-weka.filters.unsupervised.attribute.Remove-R9 40 9 disciplina turma ano_sem professor matricula sexo idade curso desempenho 10-fold cross-validation === Classifier model (full training set) === J48 unpruned tree -----------------professor = Karin | matricula = 93106842: REGULAR (0.0) | matricula = 94103839: RUIM (1.0) | matricula = 94108293: BOM (1.0) | matricula = 94112046: REGULAR (1.0) | matricula = 94112192 | | ano_sem = 1997/2: REGULAR (0.0) | | ano_sem = 1998/1: RUIM (1.0) | | ano_sem = 1998/2: REGULAR (1.0) | | ano_sem = 1999/1: REGULAR (0.0) | | ano_sem = 1999/2: REGULAR (0.0) | matricula = 94201018: REGULAR (0.0) | matricula = 94206067: REGULAR (0.0) | matricula = 95280018: BOM (1.0) | matricula = 95280023: REGULAR (1.0) | matricula = 95280027: REGULAR (0.0) | matricula = 96104543: REGULAR (0.0) professor = Duncan: BOM (3.0) professor = Arruda | disciplina = 46250-02: BOM (0.0) | disciplina = 46251-04: BOM (0.0) | disciplina = 46252-04: BOM (0.0) | disciplina = 46256-04: BOM (0.0) | disciplina = 46257-04 | | matricula = 93106842: REGULAR (0.0) | | matricula = 94103839: REGULAR (0.0) | | matricula = 94108293: REGULAR (1.0) | | matricula = 94112046: BOM (1.0) | | matricula = 94112192: REGULAR (1.0) | | matricula = 94201018: REGULAR (0.0) | | matricula = 94206067: REGULAR (0.0) | | matricula = 95280018: REGULAR (0.0) | | matricula = 95280023: REGULAR (0.0) | | matricula = 95280027: REGULAR (0.0) | | matricula = 96104543: REGULAR (0.0) | disciplina = 46258-04: BOM (0.0) | disciplina = 46266-04: BOM (0.0) | disciplina = 46267-04: BOM (1.0) professor = Yamaguti | matricula = 93106842: REGULAR (0.0) | matricula = 94103839: REGULAR (1.0) | matricula = 94108293: REGULAR (0.0) | matricula = 94112046: RUIM (1.0) | matricula = 94112192: REGULAR (1.0) | matricula = 94201018: REGULAR (0.0) | matricula = 94206067: REGULAR (1.0) | matricula = 95280018: REGULAR (0.0) | matricula = 95280023: REGULAR (0.0) | matricula = 95280027: REGULAR (0.0) | matricula = 96104543: REGULAR (0.0) professor = Bastos: BOM (1.0) professor = Afonso: BOM (4.0) professor = Hubert | matricula = 93106842: REGULAR (0.0) | matricula = 94103839: REGULAR (1.0) | matricula = 94108293: BOM (1.0) | matricula = 94112046: REGULAR (1.0) | matricula = 94112192: REGULAR (1.0) | matricula = 94201018: REGULAR (0.0) | matricula = 94206067: REGULAR (0.0) | matricula = 95280018: REGULAR (0.0) | matricula = 95280023: REGULAR (0.0) | matricula = 95280027: REGULAR (0.0) | matricula = 96104543: REGULAR (0.0) professor = Ana-Paula | ano_sem = 1997/2 | | sexo = M: REGULAR (2.0) | | sexo = F: BOM (1.0) | ano_sem = 1998/1: BOM (2.0) | ano_sem = 1998/2: BOM (2.0) | ano_sem = 1999/1 | | idade <= 20: RUIM (1.0) | | idade > 20: BOM (2.0) | ano_sem = 1999/2: REGULAR (1.0) professor = Egidio: REGULAR (2.0) Number of Leaves : 66 Size of the tree : 76 Time taken to build model: 0.02 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 19 21 0.0749 0.3632 0.5379 92.3777 % 121.5963 % 40 47.5 52.5 % % === Detailed Accuracy By Class === TP Rate 0.7 0.313 0 FP Rate 0.4 0.417 0.083 Precision 0.636 0.333 0 Recall 0.7 0.313 0 F-Measure 0.667 0.323 0 ROC Area 0.714 0.488 0.438 Class BOM REGULAR RUIM === Confusion Matrix === a 14 8 0 b 6 5 4 c <-- classified as 0 | a = BOM 3 | b = REGULAR 0 | c = RUIM Experimente variar o número mínimo de objetos (numMinObject), no último nível, e veja as diferenças de resultados.