This is the third article in a series of works (links to first and second work), describing a machine learning system based on the theory of lattices, entitled" VKF-system ". It uses a structural (lattice-theoretic) approach to the presentation of training examples and their fragments, considered as causes of the target property. The system computes these fragments as similarities between some subsets of training examples. There is an algebraic theory of such representations called Formal Concept Analysis (AFP).


However, the described system uses probabilistic algorithms to eliminate the disadvantages of an unlimited approach. Details are below.


Areas of application of the AFP


Introduction


We will begin by demonstrating our approach to the school task.


It consists in finding sufficient conditions for a convex quadrangle so that a circle can be described around it and this property can be predicted from the rectangle.


Therefore, there are two classes: positive (you can describe the circle around the quadrangle) and negative.


The training set includes a square, an isosceles trapezoid, a rhombus and a deltoid (see the line marks in the table).


The only test case is the rectangle.


We represent each quadrangle with many features associated with its symmetries:


"There is a center of symmetry" ( A ),
"The rotation group is trivial" ( B ),
"rotation group is nontrivial" ( C ),
"There is a diagonal axis of symmetry" ( D ),
"There is an off-diagonal axis of symmetry" ( E ).
They correspond to the column names in the table.


quadrangle goal A B C D E
square 1 1 0 1 1 1
trapeze 1 0 1 0 0 1
rhombus 0 1 0 1 1 0
deltoid 0 0 1 0 1 0
rectangle ? 1 0 1 0 1

To detect possible causes (in terms of symmetries), the program calculates the similarities (common features) between training examples of the same sign. Therefore, we have


$ \ langle \ {square, trapezoid \}, \ {E \ } \ rangle, $


where the first subset contains parents (all the training examples, the similarities of which were calculated), and the second is a common fragment of these examples.


Since the common fragment is $ \ {E \} $is a subset of the description of the rectangle $ \ {A, C, E \ } $ , the program will predict its target property positive, i.e. the rectangle is described. This corresponds to the extension procedure by analogy in the JSM method. Parents (a square and an isosceles trapezoid) who have a common fragment, which is also found in the rectangle, will be analogues of the rectangle.


However, we can change signs: the similarity of negative examples will be


$ \ langle \ {rhombus, deltoid \}, \ {D \ } \ rangle, $


This observation leads to the logic of Argumentation, but we prefer not to go into details here. An interested reader is referred to the articles of the author from the collection
Finn V.K., Anshakov O.M., Vinogradov D.V. (Ed.). Multivalued logics and their applications.Volume 2: Logics in artificial intelligence systems, M.: URSS, 2020, 238 p. ISBN 978-5-382-01977-2


It is important that the similarity of the negative examples allows us to demonstrate the principle of “banning counter-examples," since its fragment is  $ \ {D \} $ is embedded in the description  $ \ {A, C, D, E \} $ example of the opposite sign (square).


1. Formal Concept Analysis


Initially, the author planned to present the theory in terms of the JSM method of automatically generating hypotheses. But its creator expressed doubt that "it is possible to state deep ideas of the JSM method in a popular language." Therefore, the author decided to use the AFP language for this article. However, he will use some of his terms (along with the original ones in brackets) where it is preferable to change the terminology.


Sampling (= formal context ) is the triple  $ (G, M, I) $ where  $ G $ and  $ M $ are finite sets, and  $ I \ subseteq G \ times M $ . Elements $ G $and $ M $are called objects and featured , respectively. As usual, we write $ gIm $instead $ \ langle g, m \ rangle \ in I $to indicate when  $ g $ has the tag $ m $.


For $ A \ subseteq G $and $ B \ subseteq M $define


$ A '= \ {m \ in M ​​| \ forall g \ in A (gIm) \}, \\ B '= \ {g \ in G | \ forall m \ in B (gIm) \}; $


so $ A '$- this is the set of all signs common to all objects from $ A $and $ B '$Is the set of all objects that have all the attributes from  $ B $ .Mappings $ (\ cdot) ': 2 ^ G \ rightarrow 2 ^ M $and $ (\ cdot) ': 2 ^ M \ rightarrow 2 ^ G $ are called polar (= derivative operators ) for the selection  $ (G, M, I) $ .


Candidate (= formal concept ) for the sample  $ (G, M, I) $ the pair  $ \ langle A, B \ rangle $ , where  $ A \ subseteq G $ ,  $ B \ subseteq M $ ,  $ A '= B $ and  $ B '= A $ . The first component $ A $candidate $ \ langle A, B \ rangle $called a list of parents (= volume ) of the candidate, and the second component is  $ B $ is called its fragment (= content ). The set of all candidates for the sample is $ (G, M, I) $is denoted by $ L (G, M, I) $.


An easy exercise is to verify that $ L (G, M, I ) $ forms a grid with operations


$ \ langle A_ {1}, B_ {1} \ rangle \ vee \ langle A_ {2}, B_ {2} \ rangle=\ langle (A_ {1} \ cup A_ {2}) '', B_ {1} \ cap B_ {2} \ rangle, \\ \ langle A_ {1}, B_ {1} \ rangle \ wedge \ langle A_ {2}, B_ {2} \ rangle=\ langle A_ {1} \ cap A_ {2}, (B_ {1} \ cup B_ {2} ) '' \ rangle. $


We use a special case: for $ \ langle A, B \ rangle \ in L (G, M, I) $ , $ g \ in G $and  $ m \ in M ​​$ define


$ CbO (\ langle A, B \ rangle, g)=\ langle (A \ cup \ {g \}) '', B \ cap \ {g \} '\ rangle, \\ CbO (\ langle A, B \ rangle, m)=\ langle A \ cap \ {m \ } ', (B \ cup \ {m \})' '\ rangle. $


We call these operations CbO, since the first one was used in the well-known Close-by-One (CbO) algorithm to spawn all the elements  $ L (G, M, I) $ .


Наиболее важное свойство монотонности операций CbO составляет следующую лемму


Путь $(G,M,I)$— выборка, $\langle A_{1},B_{1}\rangle, \langle A_{2},B_{2}\rangle\in L(G,M,I)$, $g\in G$и $m\in M$. Тогда


$\langle A_{1},B_{1}\rangle\leq \langle A_{2},B_{2}\rangle\Rightarrow CbO(\langle A_{1},B_{1}\rangle,g)\leq CbO(\langle A_{2},B_{2}\rangle,g), \\ \langle A_{1},B_{1}\rangle\leq\langle A_{2},B_{2}\rangle\Rightarrow CbO(\langle A_{1},B_{1}\rangle,g)\leq CbO(\langle A_{2},B_{2}\rangle,g).$


2. Проблемы с АФП


К несчастью, автор и его коллеги обнаружили некоторые теоретические недостатки непосредственного применения АФП к машинному обучению:


  1. Число гипотез может оказаться экспоненциально велико по отношению к размеру входных данных (обучающей выборки) в худшем случае.


  2. Проблема обнаружения больших гипотез вычислительно (NP-)трудная.


  3. Переобучение неизбежно и возникает на практике.


  4. Существуют 'фантомные' сходства между обучающими примерами, где каждый такой родитель имеет альтернативную гипотезу о причине целевого свойства.



To demonstrate flaw 1, we need a Boolean algebra with a sampling defined by coatoms (as positive examples):


$ M \\ G $ $ m_ {1} $ $ m_ {2} $ $ \ ldots $ $ m_ {n} $
$ g_ {1} $ 0 1 $ \ ldots $ 1
$ g_ {2} $ 1 0 $ \ ldots $ 1
$ \ vdots $ $ \ vdots $ $ \ vdots $ $ \ ddots $ $ \ vdots $
$ g_ {n} $ 1 1 $ \ ldots $ 0

It is easy to verify that any pair is $ \ langle G \ setminus \ {g_ {i_ {1}}, \ ldots, g_ {i_ {k}} \}, \ {m_ {i_ {1}}, \ ldots, m_ {i_ {k}} \} \ rangle $ is a candidate. Therefore, there is $ 2 ^ n $candidates.


To estimate the exponential explosion of the output from the input size, we estimate the memory needed to store the sample for $ n=32 $, like 128 bytes, and the memory to save is  $ 2 ^ n $ candidates will need  $ 2 ^ {37} $ bits, i.e. 16 Gigabytes!


Deficiency 2 was discovered by prof. S.O. Kuznetsov (HSE Moscow).


Deficiencies 3 and 4 were discovered by the author during his work on the dissertation. He examined several probabilistic models to give rise to phantom similarities, along with counter-examples that could reject them.The most transparent result is the asymptotic theorem, which states that the probability of generating a “phantom” similarity between two parents without counter-examples tends to


$ 1-e ^ {- a} -a \ cdot {e ^ {-a}} \ cdot \ left [1-e ^ {- c \ cdot \ sqrt {a}} \ right], $


when the probability of the occurrence of each trait (considered as n.o.r.="$ p=\ sqrt {a/n} \ to 0 $" data-tex="inline">, the number of counter examples  $ m=c \ cdot \ sqrt {n} \ to \ infty $ , and the number of signs is  $ n \ to \ infty $ .


Note that even a smaller number $ 1-e ^ {- a} - a \ cdot {e ^ {- a}} $ is positive, since it coincides with the probability that the Poisson value with the average  $ a $ takes the value > 1.


You can refer to author’s dissertation for more details and results.


3. Probabilistic Algorithms


The key idea of ​​the VKF method is to randomly generate a small subset of the candidate lattice and use its elements as hypothetical reasons for the target property. With this trick, we avoid the exponentially high computational complexity of conventional AFP algorithms (and the JSM method too).


So we need algorithms similar to random walks on a huge lattice, with the generation of a candidate only when we need him.


The author invented and investigated the mathematical properties of several such algorithms (nonmonotonic, monotonic, paired, lazy paired and inverted paired Markov chains). Details can be found in the author's dissertation .


We will now present the paired Markov chain algorithm, which is the basis of the probabilistic approach to machine learning based on AFP (VKF method).


input: выборка (G,M,I), внешние операции CbO(, ) result: случайный кандидат <A,B> X=G U M; A=M'; B=M; C=G; D=G'; while (A!=C || B!= D) { выбрать случайный элемент x из X; <A,B>=CbO(<A,B>,x); <C,D>=CbO(<C,D>,x); } 

There is a lazy version of the paired Markov chain. The author proved that lazy calculations lead to acceleration (compared with the classical scheme) to


$ \ frac {(n + k) ^ 2} {2k \ cdot n}=2 + \ frac {(nk) ^ 2} {2k \ cdot n} \ geq 2 $


times where $ n $Is the number of signs, and $ k $- number of training examples.


This result is in excellent agreement with experimental estimates obtained by a former student of the RSUH L.A. Yakimova.


4. General structure of the VKF method


In machine learning with a teacher, there are usually two samples called learning and test , respectively.


From the positive examples of the training sample, form $ (G, M, I) $ . Negative training examples form the set $ O $counterexamples (obstacles to turning into VKF hypotheses).


The set of $ T $tests contain all elements of the test sample to predict the target property.


The program calls the Markov lazy paired chain algorithm to generate a random candidate $ \ langle A, B \ rangle \ in L (G, M, I) $ . The program saves it as a VKF hypothesis VKF-hypothesis $ \ langle A, B \ rangle $ if there is no such counter-example  $ o \ in O $ that  $ B \ subseteq \ {o \} '$ .


The basic inductive generalization algorithm is as follows


input: число N ВКФ-гипотез для порождения result: случайная выборка S затребованных гипотез while (i<N) { породить случайного кандидата <A,B> для (G,M,I); hasObstacle=false; for (o in O) { if (B включается в {o}') hasObstacle=true; } if (hasObstacle == false) { S=S U {<A,B>}; i=i+1; } } 

Condition $ B \ subseteq \ {o \} '$means the inclusion of the $ B $candidate  $ \ langle {A, B} \ rangle $ to the fragment (subset of attributes) of the counter-example $ o $ .


If the candidate avoids all such obstacles, he is added to the list of generated VKF hypotheses.


We have replaced the deterministic time-consuming algorithm (for example, the well-known Lock-One-by-One algorithm) to generate all candidates for a probabilistic one that generates a random subset of VKF hypotheses of a given volume.


After this, the machine learning system predicts the target property of the tests and compares the prediction results with the original values ​​of the target property. Here is the prediction algorithm by analogy


input: список T тестовых примеров для предсказания уелевого свойства input: случайная выборка S порожденных индукцией ВКФ-гипотез for (x in T) { target(x)=false; for (<A,B> in S) { if (B is a part of {x}') target(x)=true; } } 

The worst situation happens when some important positive test case skips all the generated VKF hypotheses and is negatively defined.


Test example $ x $is called $ \ varepsilon $- important if the probability of all VKF hypotheses is $ \ langle A, B \ rangle $ with  $ B \ subseteq \ { x \} '$ is superior to  $ \ varepsilon $ .


The author proved the theorem on estimating the parameter $ N $from inductive generalization algorithm to avoid the worst case.


Для $n=\left|{M}\right|$, для любого $\varepsilon>0$и любого $1>\delta>0$случайная выборка $S$ВКФ-гипотез мощности


$N\geq{\frac{2\cdot(n+1)-2\cdot\log_{2}{\delta} }{\varepsilon}} $


с вероятностью $>{1-\delta}$имеет свойство, что каждый $\varepsilon$-важный объект $x$ содержит фрагмент некоторой ВКФ-гипотезы $\langle A,B\rangle\in{S}$, т.е. $B\subseteq\{x\}'$.


Эта теорема является аналогом известных результатов проф. В.Н. Вапника и проф. А.Я. Червоненкиса из статистической теории обучения.


Заключение


Эта заметка описывает главные математические характеристики системы машинного обучения, основанной на теории решеток. Автор назвал ее "ВКФ-системой" в честь своего учителя проф. В.К. Финна.


Последняя статья цикла будет посвящена представлениям объектов с признаками различных типов для примения описываемой здесь машины обучения.


Дискретные признаки снова потребуют некоторую технику из АФП. Непрерывным признакам потребуется логистическая регрессия, энтропийные принципы разделения диапазонов на подынтервалы и представление, порождающее выпуклую оболочку интервалов, чье сходство вычисляется.


Автор рад возможности поблагодарить своих коллег и студентов за поддержку и стимулы.

.

Source