Mathematics of Machine Learning Based on Grid Theory
This is the third article in a series of works (links to first and second work), describing a machine learning system based on the theory of lattices, entitled" VKFsystem ". It uses a structural (latticetheoretic) approach to the presentation of training examples and their fragments, considered as causes of the target property. The system computes these fragments as similarities between some subsets of training examples. There is an algebraic theory of such representations called Formal Concept Analysis (AFP).
However, the described system uses probabilistic algorithms to eliminate the disadvantages of an unlimited approach. Details are below.
Introduction
We will begin by demonstrating our approach to the school task.
It consists in finding sufficient conditions for a convex quadrangle so that a circle can be described around it and this property can be predicted from the rectangle.
Therefore, there are two classes: positive (you can describe the circle around the quadrangle) and negative.
The training set includes a square, an isosceles trapezoid, a rhombus and a deltoid (see the line marks in the table).
The only test case is the rectangle.
We represent each quadrangle with many features associated with its symmetries:
"There is a center of symmetry" ( A ),
"The rotation group is trivial" ( B ),
"rotation group is nontrivial" ( C ),
"There is a diagonal axis of symmetry" ( D ),
"There is an offdiagonal axis of symmetry" ( E ).
They correspond to the column names in the table.
quadrangle  goal  A  B  C  D  E 

square  1  1  0  1  1  1 
trapeze  1  0  1  0  0  1 
rhombus  0  1  0  1  1  0 
deltoid  0  0  1  0  1  0 
rectangle  ?  1  0  1  0  1 
To detect possible causes (in terms of symmetries), the program calculates the similarities (common features) between training examples of the same sign. Therefore, we have
where the first subset contains parents (all the training examples, the similarities of which were calculated), and the second is a common fragment of these examples.
Since the common fragment is is a subset of the description of the rectangle , the program will predict its target property positive, i.e. the rectangle is described. This corresponds to the extension procedure by analogy in the JSM method. Parents (a square and an isosceles trapezoid) who have a common fragment, which is also found in the rectangle, will be analogues of the rectangle.
However, we can change signs: the similarity of negative examples will be
This observation leads to the logic of Argumentation, but we prefer not to go into details here. An interested reader is referred to the articles of the author from the collection
Finn V.K., Anshakov O.M., Vinogradov D.V. (Ed.). Multivalued logics and their applications.Volume 2: Logics in artificial intelligence systems, M.: URSS, 2020, 238 p. ISBN 9785382019772
It is important that the similarity of the negative examples allows us to demonstrate the principle of “banning counterexamples," since its fragment is is embedded in the description example of the opposite sign (square).
1. Formal Concept Analysis
Initially, the author planned to present the theory in terms of the JSM method of automatically generating hypotheses. But its creator expressed doubt that "it is possible to state deep ideas of the JSM method in a popular language." Therefore, the author decided to use the AFP language for this article. However, he will use some of his terms (along with the original ones in brackets) where it is preferable to change the terminology.
Sampling (= formal context ) is the triple where and are finite sets, and . Elements and are called objects and featured , respectively. As usual, we write instead to indicate when has the tag .
For and define
so  this is the set of all signs common to all objects from and Is the set of all objects that have all the attributes from .Mappings and are called polar (= derivative operators ) for the selection .
Candidate (= formal concept ) for the sample the pair , where , , and . The first component candidate called a list of parents (= volume ) of the candidate, and the second component is is called its fragment (= content ). The set of all candidates for the sample is is denoted by .
An easy exercise is to verify that forms a grid with operations
We use a special case: for , and define
We call these operations CbO, since the first one was used in the wellknown ClosebyOne (CbO) algorithm to spawn all the elements .
Наиболее важное свойство монотонности операций CbO составляет следующую лемму
Путь — выборка, , и . Тогда
2. Проблемы с АФП
К несчастью, автор и его коллеги обнаружили некоторые теоретические недостатки непосредственного применения АФП к машинному обучению:

Число гипотез может оказаться экспоненциально велико по отношению к размеру входных данных (обучающей выборки) в худшем случае.

Проблема обнаружения больших гипотез вычислительно (NP)трудная.

Переобучение неизбежно и возникает на практике.

Существуют 'фантомные' сходства между обучающими примерами, где каждый такой родитель имеет альтернативную гипотезу о причине целевого свойства.
To demonstrate flaw 1, we need a Boolean algebra with a sampling defined by coatoms (as positive examples):
0  1  1  
1  0  1  
1  1  0 
It is easy to verify that any pair is is a candidate. Therefore, there is candidates.
To estimate the exponential explosion of the output from the input size, we estimate the memory needed to store the sample for , like 128 bytes, and the memory to save is candidates will need bits, i.e. 16 Gigabytes!
Deficiency 2 was discovered by prof. S.O. Kuznetsov (HSE Moscow).
Deficiencies 3 and 4 were discovered by the author during his work on the dissertation. He examined several probabilistic models to give rise to phantom similarities, along with counterexamples that could reject them.The most transparent result is the asymptotic theorem, which states that the probability of generating a “phantom” similarity between two parents without counterexamples tends to
when the probability of the occurrence of each trait (considered as n.o.r.="$ p=\ sqrt {a/n} \ to 0 $" datatex="inline">, the number of counter examples , and the number of signs is .
Note that even a smaller number is positive, since it coincides with the probability that the Poisson value with the average takes the value > 1.
You can refer to author’s dissertation for more details and results.
3. Probabilistic Algorithms
The key idea of the VKF method is to randomly generate a small subset of the candidate lattice and use its elements as hypothetical reasons for the target property. With this trick, we avoid the exponentially high computational complexity of conventional AFP algorithms (and the JSM method too).
So we need algorithms similar to random walks on a huge lattice, with the generation of a candidate only when we need him.
The author invented and investigated the mathematical properties of several such algorithms (nonmonotonic, monotonic, paired, lazy paired and inverted paired Markov chains). Details can be found in the author's dissertation .
We will now present the paired Markov chain algorithm, which is the basis of the probabilistic approach to machine learning based on AFP (VKF method).
input: выборка (G,M,I), внешние операции CbO(, ) result: случайный кандидат <A,B> X=G U M; A=M'; B=M; C=G; D=G'; while (A!=C  B!= D) { выбрать случайный элемент x из X; <A,B>=CbO(<A,B>,x); <C,D>=CbO(<C,D>,x); }
There is a lazy version of the paired Markov chain. The author proved that lazy calculations lead to acceleration (compared with the classical scheme) to
times where Is the number of signs, and  number of training examples.
This result is in excellent agreement with experimental estimates obtained by a former student of the RSUH L.A. Yakimova.
4. General structure of the VKF method
In machine learning with a teacher, there are usually two samples called learning and test , respectively.
From the positive examples of the training sample, form . Negative training examples form the set counterexamples (obstacles to turning into VKF hypotheses).
The set of tests contain all elements of the test sample to predict the target property.
The program calls the Markov lazy paired chain algorithm to generate a random candidate . The program saves it as a VKF hypothesis VKFhypothesis if there is no such counterexample that .
The basic inductive generalization algorithm is as follows
input: число N ВКФгипотез для порождения result: случайная выборка S затребованных гипотез while (i<N) { породить случайного кандидата <A,B> для (G,M,I); hasObstacle=false; for (o in O) { if (B включается в {o}') hasObstacle=true; } if (hasObstacle == false) { S=S U {<A,B>}; i=i+1; } }
Condition means the inclusion of the candidate to the fragment (subset of attributes) of the counterexample .
If the candidate avoids all such obstacles, he is added to the list of generated VKF hypotheses.
We have replaced the deterministic timeconsuming algorithm (for example, the wellknown LockOnebyOne algorithm) to generate all candidates for a probabilistic one that generates a random subset of VKF hypotheses of a given volume.
After this, the machine learning system predicts the target property of the tests and compares the prediction results with the original values of the target property. Here is the prediction algorithm by analogy
input: список T тестовых примеров для предсказания уелевого свойства input: случайная выборка S порожденных индукцией ВКФгипотез for (x in T) { target(x)=false; for (<A,B> in S) { if (B is a part of {x}') target(x)=true; } }
The worst situation happens when some important positive test case skips all the generated VKF hypotheses and is negatively defined.
Test example is called  important if the probability of all VKF hypotheses is with is superior to .
The author proved the theorem on estimating the parameter from inductive generalization algorithm to avoid the worst case.
Для , для любого и любого случайная выборка ВКФгипотез мощности
с вероятностью имеет свойство, что каждый важный объект $x$ содержит фрагмент некоторой ВКФгипотезы , т.е. .
Эта теорема является аналогом известных результатов проф. В.Н. Вапника и проф. А.Я. Червоненкиса из статистической теории обучения.
Заключение
Эта заметка описывает главные математические характеристики системы машинного обучения, основанной на теории решеток. Автор назвал ее "ВКФсистемой" в честь своего учителя проф. В.К. Финна.
Последняя статья цикла будет посвящена представлениям объектов с признаками различных типов для примения описываемой здесь машины обучения.
Дискретные признаки снова потребуют некоторую технику из АФП. Непрерывным признакам потребуется логистическая регрессия, энтропийные принципы разделения диапазонов на подынтервалы и представление, порождающее выпуклую оболочку интервалов, чье сходство вычисляется.
Автор рад возможности поблагодарить своих коллег и студентов за поддержку и стимулы.
.Source