PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO GRANDE DO SUL

Propaganda
PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO GRANDE DO SUL
FACULDADE DE INFORMÁTICA
DISCIPLINA: Aspectos Avançados em Banco de Dados,
TURMA: 168, prof. Duncan
Exercícios de mineração de dados, com Classificação e Regras de Associação:
Para este exercício, serão usados arquivos .arff disponíveis na página da disciplina.
A ferramenta para a prática é a WEKA, versão 3.5.7.
Executar a ferramenta WEKA, e entrar em Applications-> Explorer. Selecionar Open file...
1a. prática: arquivo TAN_Ex_DT.arff. Constitui o arquivo usado nos slides do cap. 4 do TAN.
- abrir o arquivo .arff
- selecionar a aba Classify
- pressionar o botão Choose e selecionar trees -> SimpleCart
- pressionar Start; o resultado deve ser o abaixo:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.SimpleCart -S 1 -M 2.0 -N 5 -C 1.0
TAN_DecisionTree
10
5
TID
Refund
Marital_Status
Taxable_Income
Cheat
10-fold cross-validation
=== Classifier model (full training set) ===
CART Decision Tree
: NO(7.0/3.0)
Number of Leaf Nodes: 1
Size of the Tree: 1
Time taken to build model: 0.22 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
6
3
-0.1739
0.5062
0.5653
113.8889 %
120.4266 %
9
66.6667 %
33.3333 %
=== Detailed Accuracy By Class ===
TP Rate
0.857
0
FP Rate
1
0.143
Precision
0.75
0
Recall
0.857
0
F-Measure
0.8
0
ROC Area
0
0
Class
NO
YES
=== Confusion Matrix ===
a b
<-- classified as
6 1 | a = NO
2 0 | b = YES
Notem que a “árvore” tem apenas 1 nodo: : NO(7.0/3.0), que significa que o algoritmo classificou todos os
objetos como Cheat = NO, acertando 7 e errando 3.
- Clicar no nome do algoritmo: SimpleCart;
- Mudar minNumObj para 1 e usePrune para False;
- Pressionar Start novamente; a árvore mostrada deve ser esta:
CART Decision Tree
TID < 4.5: NO(4.0/0.0)
TID >= 4.5
| Marital_Status=(Single): YES(2.0/0.0)
| Marital_Status!=(Single)
| | TID < 5.5: YES(1.0/0.0)
| | TID >= 5.5: NO(3.0/0.0)
Number of Leaf Nodes: 4
Size of the Tree: 7
Pode-se notar que TID está fazendo parte da árvore, e não significa nada para os objetos. Para desconsiderálo, façamos o seguinte:
- clicar na aba Preprocess, selecionar o atributo TID em Attributes e apertar o botão Remove;
- clicar na aba Classify e apertar Start; a saída deve ser esta:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.SimpleCart -S 1 -M 1.0 -N 5 -U -C 1.0
TAN_DecisionTree-weka.filters.unsupervised.attribute.Remove-R1
10
4
Refund
Marital_Status
Taxable_Income
Cheat
10-fold cross-validation
=== Classifier model (full training set) ===
CART Decision Tree
Marital_Status=(Single)|(Divorced)
| Refund=(NO)
| | Taxable_Income < 77.5: NO(1.0/0.0)
| | Taxable_Income >= 77.5: YES(3.0/0.0)
| Refund!=(NO): NO(2.0/0.0)
Marital_Status!=(Single)|(Divorced): NO(4.0/0.0)
Number of Leaf Nodes: 4
Size of the Tree: 7
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
7
3
0.3478
0.3
0.5477
63.4615 %
109.2739 %
10
70
30
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.714
0.667
FP Rate
0.333
0.286
Precision
0.833
0.5
Recall
0.714
0.667
F-Measure
0.769
0.571
ROC Area
0.69
0.69
Class
NO
YES
=== Confusion Matrix ===
a b
<-- classified as
5 2 | a = NO
1 2 | b = YES
Agora, a árvore resultante está consistente com a vista em aula. As modificações nas configurações do
algoritmo (minNumObj para 1 e usePrune para False) serviram para fazer com que o mesmo fosse até o
último nível da hierarquia, e que não tentasse reduzir a árvore por algum critério de poda.
IMP: Remover atributos irrelevantes ou que não significam nada ao problema é uma tarefa de preparação,
que não necessariamente os algoritmos fazem. Ou seja, o usuário tem de fazer a preparação manualmente.
Considerando o mesmo arquivo, vamos executar com outro algoritmo de classificação: J48 (é a
implementação, no WEKA, do algoritmo C4.5).
- clicar no botão Choose e selecionar Trees-> J48;
- configurar (clicar sobre o nome do algoritmo) com minNumObj = 1 e com unpruned = true;
- apertar Start; a saída deve ser esta:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.J48 -U -M 1
TAN_DecisionTree-weka.filters.unsupervised.attribute.Remove-R1
10
4
Refund
Marital_Status
Taxable_Income
Cheat
10-fold cross-validation
=== Classifier model (full training set) ===
J48 unpruned tree
-----------------Refund = NO
|
Marital_Status = Single
|
|
Taxable_Income <= 75: NO (1.0)
|
|
Taxable_Income > 75: YES (2.0)
|
Marital_Status = Married: NO (3.0)
|
Marital_Status = Divorced: YES (1.0)
Refund = YES: NO (3.0)
Number of Leaves
:
Size of the tree :
5
8
Time taken to build model: 0.03 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
5
5
0.0741
0.5
0.7071
105.7692 %
141.072 %
10
50
50
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.429
0.667
FP Rate
0.333
0.571
Precision
0.75
0.333
Recall
0.429
0.667
F-Measure
0.545
0.444
ROC Area
0.548
0.548
Class
NO
YES
=== Confusion Matrix ===
a b
<-- classified as
3 4 | a = NO
1 2 | b = YES
É possível perceber diferenças nos resultados; as árvores são diferentes! A seguir, segue a árvore resultante
do J48 (sobre a Result list, trees.J48, clicar botão direito e selecionar Visualize Tree).
O algoritmo SimpleCart não tem visualização da árvore.
Compare os resultados dos 2 algoritmos, em termos de acurácia (taxa de acertos) e matriz de
confusão:
SimpleCart: Correctly
Classified Instances
7
70
%
=== Confusion Matrix ===
a b
<-- classified as
5 2 | a = NO
1 2 | b = YES
J48: Correctly
Classified Instances
5
50
%
=== Confusion Matrix ===
a b
<-- classified as
3 4 | a = NO
1 2 | b = YES
“Qual o melhor classificador, em sua opinião?”
Faça a mesma análise feita acima, para o arquivo academico4.arff. Para tanto, remova, pelo Weka, a coluna
NOTA.
Resultado do SimpleCart:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.SimpleCart -S 1 -M 1.0 -N 5 -U -C 1.0
acadêmico
40
10
disciplina
turma
ano_sem
professor
matricula
sexo
idade
curso
nota
desempenho
10-fold cross-validation
=== Classifier model (full training set) ===
CART Decision Tree
nota < 7.5
| nota < 5.5
| | disciplina=(46258-04): REGULAR(1.0/0.0)
| | disciplina!=(46258-04): RUIM(4.0/0.0)
| nota >= 5.5
| | turma=(138): BOM(1.0/0.0)
| | turma!=(138): REGULAR(14.0/0.0)
nota >= 7.5
| professor=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)|(Arruda)|(Karin)|(Hubert)|(Egidio): BOM(19.0/0.0)
| professor!=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)|(Arruda)|(Karin)|(Hubert)|(Egidio): REGULAR(1.0/0.0)
Number of Leaf Nodes: 6
Size of the Tree: 11
Time taken to build model: 0.05 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
36
4
0.8305
0.0667
0.2582
16.9565 %
58.3717 %
40
90
10
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.9
0.875
1
FP Rate
0.05
0.083
0.028
Precision
0.947
0.875
0.8
=== Confusion Matrix ===
a b
18 2
1 14
0 0
c
<-- classified as
0 | a = BOM
1 | b = REGULAR
4 | c = RUIM
Recall
0.9
0.875
1
F-Measure
0.923
0.875
0.889
ROC Area
0.925
0.896
0.986
Class
BOM
REGULAR
RUIM
Resultado do J48:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.J48 -U -M 1
acadêmico
40
10
disciplina
turma
ano_sem
professor
matricula
sexo
idade
curso
nota
desempenho
10-fold cross-validation
=== Classifier model (full training set) ===
J48 unpruned tree
-----------------nota <= 7
|
nota <= 5
|
|
disciplina = 46250-02: RUIM (1.0)
|
|
disciplina = 46251-04: RUIM (1.0)
|
|
disciplina = 46252-04: RUIM (0.0)
|
|
disciplina = 46256-04: RUIM (2.0)
|
|
disciplina = 46257-04: RUIM (0.0)
|
|
disciplina = 46258-04: REGULAR (1.0)
|
|
disciplina = 46266-04: RUIM (0.0)
|
|
disciplina = 46267-04: RUIM (0.0)
|
nota > 5
|
|
turma = 128: REGULAR (14.0)
|
|
turma = 138: BOM (1.0)
nota > 7
|
professor = Karin: BOM (2.0)
|
professor = Duncan: BOM (3.0)
|
professor = Arruda: BOM (2.0)
|
professor = Yamaguti: REGULAR (1.0)
|
professor = Bastos: BOM (1.0)
|
professor = Afonso: BOM (3.0)
|
professor = Hubert: BOM (1.0)
|
professor = Ana-Paula: BOM (7.0)
|
professor = Egidio: BOM (0.0)
Number of Leaves
:
Size of the tree :
19
24
Time taken to build model: 0.03 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
33
7
0.7021
0.1292
0.3461
32.8533 %
78.2459 %
40
82.5
17.5
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.8
0.875
0.75
FP Rate
0.05
0.208
0.028
Precision
0.941
0.737
0.75
=== Confusion Matrix ===
a b
16 4
1 14
0 1
c
<-- classified as
0 | a = BOM
1 | b = REGULAR
3 | c = RUIM
Recall
0.8
0.875
0.75
F-Measure
0.865
0.8
0.75
ROC Area
0.874
0.826
0.854
Class
BOM
REGULAR
RUIM
Com remoção da coluna Nota; resultado do SimpleCart:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.SimpleCart -S 1 -M 1.0 -N 5 -U -C 1.0
acadêmico-weka.filters.unsupervised.attribute.Remove-R9
40
9
disciplina
turma
ano_sem
professor
matricula
sexo
idade
curso
desempenho
10-fold cross-validation
=== Classifier model (full training set) ===
CART Decision Tree
professor=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)
| matricula=(93106842)|(94201018)|(95280018)|(94206067)|(94108293)|(95280023)|(94112046)|(94112192)|(94103839)|(96104543)
| | ano_sem=(1997/2)
| | | matricula=(94112192): BOM(1.0/0.0)
| | | matricula!=(94112192): REGULAR(2.0/0.0)
| | ano_sem!=(1997/2): BOM(14.0/0.0)
| matricula!=(93106842)|(94201018)|(95280018)|(94206067)|(94108293)|(95280023)|(94112046)|(94112192)|(94103839)|(96104543)
| | ano_sem=(1999/2)|(1997/2)|(1998/1)|(1998/2): REGULAR(1.0/0.0)
| | ano_sem!=(1999/2)|(1997/2)|(1998/1)|(1998/2): RUIM(1.0/0.0)
professor!=(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)
| matricula=(95280018)|(94206067)|(94108293)
| | professor=(Karin)|(Arruda)|(Hubert)|(Duncan)|(Bastos)|(Afonso)|(Ana-Paula)
| | | disciplina=(46257-04): REGULAR(1.0/0.0)
| | | disciplina!=(46257-04): BOM(4.0/0.0)
| | professor!=(Karin)|(Arruda)|(Hubert)|(Duncan)|(Bastos)|(Afonso)|(Ana-Paula): REGULAR(2.0/0.0)
| matricula!=(95280018)|(94206067)|(94108293)
| | disciplina=(46266-04)|(46258-04): REGULAR(4.0/0.0)
| | disciplina!=(46266-04)|(46258-04)
| | | disciplina=(46257-04)
| | | | ano_sem=(1999/2)|(1997/2)|(1998/1)|(1998/2): REGULAR(1.0/0.0)
| | | | ano_sem!=(1999/2)|(1997/2)|(1998/1)|(1998/2): BOM(1.0/0.0)
| | | disciplina!=(46257-04)
| | | | ano_sem=(1999/1): REGULAR(1.0/0.0)
| | | | ano_sem!=(1999/1)
| | | | | disciplina=(46256-04)|(46250-02)|(46252-04)|(46257-04)|(46258-04)|(46266-04)|(46267-04)
| | | | | | matricula=(94112046): REGULAR(1.0/0.0)
| | | | | | matricula!=(94112046)
| | | | | | | ano_sem=(1998/2)|(1997/2)|(1999/1)|(1999/2)
| | | | | | | |
matricula=(94112192)|(93106842)|(94108293)|(94112046)|(94201018)|(94206067)|(95280018)|(95280023)|(95280027)|(96104543):
REGULAR(1.0/0.0)
| | | | | | | |
matricula!=(94112192)|(93106842)|(94108293)|(94112046)|(94201018)|(94206067)|(95280018)|(95280023)|(95280027)|(96104543):
RUIM(1.0/0.0)
| | | | | | | ano_sem!=(1998/2)|(1997/2)|(1999/1)|(1999/2): RUIM(1.0/0.0)
| | | | | disciplina!=(46256-04)|(46250-02)|(46252-04)|(46257-04)|(46258-04)|(46266-04)|(46267-04)
| | | | | | matricula=(94112046): RUIM(1.0/0.0)
| | | | | | matricula!=(94112046): REGULAR(2.0/0.0)
Number of Leaf Nodes: 18
Size of the Tree: 35
Time taken to build model: 0.05 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
17
23
0.0086
0.3833
0.6191
97.5
%
139.9705 %
40
42.5
57.5
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.7
0.188
0
FP Rate
0.45
0.375
0.139
Precision
0.609
0.25
0
=== Confusion Matrix ===
a
14
9
0
b
5
3
4
c
<-- classified as
1 | a = BOM
4 | b = REGULAR
0 | c = RUIM
Recall
0.7
0.188
0
F-Measure
0.651
0.214
0
ROC Area
0.625
0.406
0.431
Class
BOM
REGULAR
RUIM
Com remoção da coluna Nota, com J48:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.trees.J48 -U -M 1
acadêmico-weka.filters.unsupervised.attribute.Remove-R9
40
9
disciplina
turma
ano_sem
professor
matricula
sexo
idade
curso
desempenho
10-fold cross-validation
=== Classifier model (full training set) ===
J48 unpruned tree
-----------------professor = Karin
|
matricula = 93106842: REGULAR (0.0)
|
matricula = 94103839: RUIM (1.0)
|
matricula = 94108293: BOM (1.0)
|
matricula = 94112046: REGULAR (1.0)
|
matricula = 94112192
|
|
ano_sem = 1997/2: REGULAR (0.0)
|
|
ano_sem = 1998/1: RUIM (1.0)
|
|
ano_sem = 1998/2: REGULAR (1.0)
|
|
ano_sem = 1999/1: REGULAR (0.0)
|
|
ano_sem = 1999/2: REGULAR (0.0)
|
matricula = 94201018: REGULAR (0.0)
|
matricula = 94206067: REGULAR (0.0)
|
matricula = 95280018: BOM (1.0)
|
matricula = 95280023: REGULAR (1.0)
|
matricula = 95280027: REGULAR (0.0)
|
matricula = 96104543: REGULAR (0.0)
professor = Duncan: BOM (3.0)
professor = Arruda
|
disciplina = 46250-02: BOM (0.0)
|
disciplina = 46251-04: BOM (0.0)
|
disciplina = 46252-04: BOM (0.0)
|
disciplina = 46256-04: BOM (0.0)
|
disciplina = 46257-04
|
|
matricula = 93106842: REGULAR (0.0)
|
|
matricula = 94103839: REGULAR (0.0)
|
|
matricula = 94108293: REGULAR (1.0)
|
|
matricula = 94112046: BOM (1.0)
|
|
matricula = 94112192: REGULAR (1.0)
|
|
matricula = 94201018: REGULAR (0.0)
|
|
matricula = 94206067: REGULAR (0.0)
|
|
matricula = 95280018: REGULAR (0.0)
|
|
matricula = 95280023: REGULAR (0.0)
|
|
matricula = 95280027: REGULAR (0.0)
|
|
matricula = 96104543: REGULAR (0.0)
|
disciplina = 46258-04: BOM (0.0)
|
disciplina = 46266-04: BOM (0.0)
|
disciplina = 46267-04: BOM (1.0)
professor = Yamaguti
|
matricula = 93106842: REGULAR (0.0)
|
matricula = 94103839: REGULAR (1.0)
|
matricula = 94108293: REGULAR (0.0)
|
matricula = 94112046: RUIM (1.0)
|
matricula = 94112192: REGULAR (1.0)
|
matricula = 94201018: REGULAR (0.0)
|
matricula = 94206067: REGULAR (1.0)
|
matricula = 95280018: REGULAR (0.0)
|
matricula = 95280023: REGULAR (0.0)
|
matricula = 95280027: REGULAR (0.0)
|
matricula = 96104543: REGULAR (0.0)
professor = Bastos: BOM (1.0)
professor = Afonso: BOM (4.0)
professor = Hubert
|
matricula = 93106842: REGULAR (0.0)
|
matricula = 94103839: REGULAR (1.0)
|
matricula = 94108293: BOM (1.0)
|
matricula = 94112046: REGULAR (1.0)
|
matricula = 94112192: REGULAR (1.0)
|
matricula = 94201018: REGULAR (0.0)
|
matricula = 94206067: REGULAR (0.0)
|
matricula = 95280018: REGULAR (0.0)
|
matricula = 95280023: REGULAR (0.0)
|
matricula = 95280027: REGULAR (0.0)
|
matricula = 96104543: REGULAR (0.0)
professor = Ana-Paula
|
ano_sem = 1997/2
|
|
sexo = M: REGULAR (2.0)
|
|
sexo = F: BOM (1.0)
|
ano_sem = 1998/1: BOM (2.0)
|
ano_sem = 1998/2: BOM (2.0)
|
ano_sem = 1999/1
|
|
idade <= 20: RUIM (1.0)
|
|
idade > 20: BOM (2.0)
|
ano_sem = 1999/2: REGULAR (1.0)
professor = Egidio: REGULAR (2.0)
Number of Leaves
:
66
Size of the tree :
76
Time taken to build model: 0.02 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
19
21
0.0749
0.3632
0.5379
92.3777 %
121.5963 %
40
47.5
52.5
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.7
0.313
0
FP Rate
0.4
0.417
0.083
Precision
0.636
0.333
0
Recall
0.7
0.313
0
F-Measure
0.667
0.323
0
ROC Area
0.714
0.488
0.438
Class
BOM
REGULAR
RUIM
=== Confusion Matrix ===
a
14
8
0
b
6
5
4
c
<-- classified as
0 | a = BOM
3 | b = REGULAR
0 | c = RUIM
Experimente variar o número mínimo de objetos (numMinObject), no último nível, e veja as diferenças de
resultados.
Download