In the author’s previous note , a web server was described for experiments with the VKF method of machine learning based on the theory of lattices. As an alternative to using the web server, this note attempts to indicate the path to using the CPython library directly. We will reproduce the working sessions of experiments with the Mushroom and Wine Quality arrays from the UCI data repository for testing machine learning algorithms. Then explanations will be given about the input data formats.

ITKarma picture


Introduction


The article describes a new version of the machine learning program based on the theory of lattices. The main advantage of this version is the interface for Python programmers to efficient algorithms programmed in C++.

Today, machine learning procedures (such as convolutional neural networks, random forests and reference vector machines) have reached a very high level, surpassing people in recognizing speech, video and images. They, however, cannot provide arguments to prove the correctness of their conclusions.

On the other hand, symbolic approaches to machine learning (inductive logic programming, covering training using integer programming) have provably very high computational complexity and are practically not applicable to samples of even comparatively small size.

The approach described here uses probabilistic algorithms to avoid these problems. The VKF method of machine learning uses the technique of the modern algebraic theory of lattices (analysis of formal concepts) and probability theory (especially Markov chains). But now you do not need knowledge of advanced mathematics to use and create programs using the VKF system. The author created a library (vkf.cp38-win32.pyd for Windows or vkf.cpython-38-x86_64-linux-gnu.so for Linux) to provide access to the program through a Python programmer-friendly interface.

There is a parental approach to the described - invented at the beginning of the 80s of the XX century, Doctor of Engineering prof. VK. Finn's JSM method for automatically generating hypotheses. Currently, it has become, as its creator says, a method of automated research support. It allows you to use the methods of argumentation logic, check the stability of the hypotheses found when expanding the training sample, and combine the results of predictions using various strategies of JSM reasoning.

Unfortunately, the author of this note and his colleagues discovered and investigated some theoretical flaws of the JSM method:

  1. In the worst case, the number of similarities generated may turn out to be exponentially larger than the size of the input data (training set).
  2. The task of detecting large similarities is computationally (NP-) difficult.
  3. Retraining is inevitable and observed in practice.
  4. "Phantom" similarities of training examples are found, each of which has an alternative candidate for the reasons for the target property.
  5. The likelihood of detecting persistent similarities tends to zero as the number of extensions of the original sample increases.


The author’s research is summarized in chapter 2 of his dissertation . The last point was discovered by the author recently, but, in his opinion, puts an end to the approach with expanding samples.

Finally, the JSM community does not offer access to the source codes of their systems. Moreover, the programming languages ​​used (Fort and C #) will not allow their use by the general public. The only freeware version of the JSM solver in C++ known to the author (by T.A. Volkova robofreak ) contains an annoying error, which sometimes leads to an aborted calculation.

Initially, the author planned to share the encoder developed for the VKF-method system with the DSM community. Therefore, he transferred all the algorithms that are simultaneously applicable to both DSM and VKF systems to a separate library (vkfencoder.cp38-win32.pyd on Windows or vkfencoder.cpython-38-x86_64-linux-gnu.so on Linux). Unfortunately, these algorithms turned out to be unclaimed by the JSM community. The VKF library implements these algorithms (for example, the vkf.FCA class), but relying on the filling of tables not from files, but directly through the web interface. Here we will use the vkfencoder library.

1 Procedure for working with the library for discrete attributes



We will assume that the reader knows how to install MariaDB server + MariaDB Connector/C (by default, use IP-address 127.0.0.1haps306 and the user 'root' with the password 'toor'). Let's start by installing the vkfencoder and vkf libraries in the demo virtual environment and by creating an empty database under MariaDB under the name 'mushroom'.

krrguest@amd2700vii:~/src$ python3 -m venv demo krrguest@amd2700vii:~/src$ source demo/bin/activate (demo) krrguest@amd2700vii:~/src$ cd vkfencoder (demo) krrguest@amd2700vii:~/src/vkfencoder$ python3./setup.py build (demo) krrguest@amd2700vii:~/src/vkfencoder$ python3./setup.py install (demo) krrguest@amd2700vii:~/src/vkfencoder$ cd../vkf (demo) krrguest@amd2700vii:~/src/vkf$ python3./setup.py build (demo) krrguest@amd2700vii:~/src/vkf$ python3./setup.py install (demo) krrguest@amd2700vii:~/src/vkf$ cd../demo (demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS mushroom; MariaDB [(none)]> exit; 


According to the results of the work, the 'mushroom' database will appear in the folder ~/src/demo/lib/python3.8/site-packages/vkfencoder-1.0.3-py3.8-linux-x86_64.egg/
the file vkfencoder.cpython-38-x86_64-linux-gnu.soа appears, and in the folder ~/src/demo/lib/python3.8/site-packages/vkf-2.0.1-py3.8-linux-x86_64.egg/
vkf.cpython-38-x86_64-linux-gnu.so file appears

We launch Python3 and conduct a VKF experiment on the Mushrooms array. It is assumed that there are 3 files in the ~/src/demo/files/folder (mushrooms.xml, MUSHROOMS.train, MUSHROOMS.rest). The structure of these files will be described in the next paragraph. The first file 'mushrooms.xml' defines the structure of the lower half-lattice on the values ​​of each characterizing mushroom. The second and third files - the file 'agaricus-lepiota.data', divided in half by the random number sensor in about half - is the digitized book “Key to Mushrooms of North America”, published in New York in 1981. The names' encoder ',' lattices', ' trains' and 'tests' are the names of the tables in the mushroom database for, respectively, encodings of attribute values ​​with bit substrings, covering relationships for semilattice diagrams at these values, training and test cases.

(demo) krrguest@amd2700vii:~/src/demo$ Python3 >>> import vkfencoder >>> xml=vkfencoder.XMLImport('./files/mushrooms.xml', 'encoder', 'lattices', 'mushroom', '127.0.0.1', 'root', 'toor') >>> trn=vkfencoder.DataImport('./files/MUSHROOMS.train', 'e', 'encoder', 'trains', 'mushroom', '127.0.0.1', 'root', 'toor') >>> tst=vkfencoder.DataImport('./files/MUSHROOMS.rest', 'e', 'encoder', 'tests', 'mushroom', '127.0.0.1', 'root', 'toor') 


The 'e' in the last two lines corresponds to the edibility of the fungus (this is how the target trait is encoded in this array).

It is important to note that there is a class vkfencoder.XMLExport, which allows you to save information from two tables 'encoder' and 'lattices' in an xml file, which after making changes can be processed again with the class vkfencoder.XMLImport.

Now we turn to the actual VKF experiment: we connect the coding ones, load the previously calculated hypotheses (if any), calculate an additional number (100) of hypotheses in the specified number (4) of flows and save them in the table 'vkfhyps'.

>>> enc=vkf.Encoder('encoder', 'mushroom', '127.0.0.1', 'root', 'toor') >>> ind=vkf.Induction() >>> ind.load_discrete_hypotheses(enc, 'trains', 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor') >>> ind.add_hypotheses(100, 4) >>> ind.save_discrete_hypotheses(enc, 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor') 


You can get a Python list of all non-trivial pairs (attribute_name, attribute_value) for the VKF hypothesis ndx number

>>> ind.show_discrete_hypothesis(enc, ndx) 


One of the hypotheses for deciding on the edibility of the fungus has the form

[('gill_attachment', 'free'), ('gill_spacing', 'close'), ('gill_size', 'broad'), ('stalk_shape', 'enlarging'), ('stalk_surface_below_ring', 'scaly'), ('veil_type', 'partial'), ('veil_color', 'white'), ('ring_number', 'one'), ('ring_type', 'pendant')]

Due to the probabilistic nature of the algorithms of the VKF method, it may not be generated, but the author proved that with a sufficiently large number of generated VKF hypotheses, a very similar hypothesis will arise, which will also predict the target property in important test cases almost well. Details are in chapter 4 of the dissertation of the author.

Finally, we turn to prediction. We create a test sample to assess the quality of the generated hypotheses and at the same time predict the target property of its elements

>>> tes=vkf.TestSample(enc, ind, 'tests', 'mushroom', '127.0.0.1', 'root', 'toor') >>> tes.correct_positive_cases() >>> tes.correct_negative_cases() >>> exit() 


2 Description of the data structure


2.1 Structures of lattices on discrete signs



The vkfencoder.XMLImport class parses an XML file describing the order between attribute values, creates and fills two tables 'encoder' (converting each value into a string of bits) and 'lattices' (storing the relationship between the values ​​of one attribute).
Input file structure should look like

<?xml version="1.0"?> <document name="mushrooms_db"> <attribute name="cap_color"> <vertices> <node string="null" char='_'></node> <node string="brown" char='n'></node> <node string="buff" char='b'></node> <node string="cinnamon" char='c'></node> <node string="gray" char='g'></node> <node string="green" char='r'></node> <node string="pink" char='p'></node> <node string="purple" char='u'></node> <node string="red" char='e'></node> <node string="white" char='w'></node> <node string="yellow" char='y'></node> </vertices> <edges> <arc source="brown" target="null"></arc> <arc source="buff" target="brown"></arc> <arc source="buff" target="yellow"></arc> <arc source="cinnamon" target="brown"></arc> <arc source="cinnamon" target="red"></arc> <arc source="gray" target="null"></arc> <arc source="green" target="null"></arc> <arc source="pink" target="red"></arc> <arc source="pink" target="white"></arc> <arc source="purple" target="red"></arc> <arc source="red" target="null"></arc> <arc source="white" target="null"></arc> <arc source="yellow" target="null"></arc> </edges> </attribute> </document> 


The above example represents the ordering between the values ​​of the third attribute ‘cap_color’ (hat color) for the Mushrooms array from the Machine Learning Data Repository (University of California, Irvine). Each ‘attribute’ field represents the lattice structure on the values ​​of the corresponding attribute. In our example, the group corresponds to the discrete attribute ‘cap_color’. A list of all the attribute values ​​forms a subgroup. We added a value to indicate a trivial (missing) similarity. The remaining values ​​correspond to the description in the accompanying file ‘agaricus-lepiota.names’.
The ordering between them is represented by the contents of the subgroup. Each position describes a “more private/more general” relationship between pairs of attribute values. For example, means that the pink hat of the mushroom is more specific than the red hat.

The author proved the theorem on the correctness of the representation generated by the constructor of the vkfencoder.XMLImport class and stored in the 'encoder' table. His algorithm and proof of the correctness theorem use a modern section of lattice theory, known as Formal Concept Analysis. For details, the reader is again referred to the thesis of the author (chapter 1).

2.2 Sample structure for discrete features



First of all, it should be noted that the reader can implement his own bootloader of training and test cases in the database or use the vkfencoder.DataImport class available in the library. In the second case, the reader should consider that the target attribute should be in the first position and consist of one character (for example, '+'/'-', '1'/'0' or 'e'/'p').

The training sample should be a CSV file (with comma-separated values) that describes the training examples and counter-examples (examples that do not have a target property).
Input file structure should look like

e, f, f, g, t, n, f, c, b, w, t, b, s, s, p, w, p, w, o, p, k, v, d
p, x, s, p, f, c, f, c, n, u, e, b, s, s, w, w, p, w, o, p, n, s, d
p, f, y, g, f, f, f, c, b, p, e, b, k, k, b, n, p, w, o, l, h, v, g
p, x, y, y, f, f, f, c, b, p, e, b, k, k, n, n, p, w, o, l, h, v, p
e, x, y, b, t, n, f, c, b, e, e,?, s, s, e, w, p, w, t, e, w, c, w
p, f, f, g, f, f, f, c, b, g, e, b, k, k, n, p, p, w, o, l, h, y, g
p, x, f, g, f, f, f, c, b, p, e, b, k, k, p, n, p, w, o, l, h, y, g
p, x, f, y, f, f, f, c, b, h, e, b, k, k, n, b, p, w, o, l, h, y, g
p, f, f, y, f, f, f, c, b, h, e, b, k, k, p, p, p, w, o, l, h, y, g
p, x, y, g, f, f, f, c, b, h, e, b, k, k, p, p, p, w, o, l, h, v, d
p, x, f, g, f, c, f, c, n, u, e, b, s, s, w, w, p, w, o, p, n, v, d
p, x, y, g, f, f, f, c, b, h, e, b, k, k, p, b, p, w, o, l, h, v, g
e, f, y, g, t, n, f, c, b, n, t, b, s, s, p, g, p, w, o, p, k, y, d
e, f, f, e, t, n, f, c, b, p, t, b, s, s, p, p, p, w, o, p, k, v, d
p, f, y, g, f, f, f, c, b, p, e, b, k, k, p, b, p, w, o, l, h, y, p

Each line here describes one tutorial. The first position contains 'e' or 'p' depending on the edibility/toxicity of the described fungus. The remaining positions (separated by commas) contain letters (short names) or strings (full names of the values ​​of the corresponding attribute). For example, the fourth position corresponds to the sign 'cup_color', where 'g', 'p', 'y', 'b' and 'e' denote “gray”, “pink”, “yellow”, “sand” and “red”, respectively. A question mark in the middle of the table indicates a missing value. However, the values ​​of the remaining (non-target) characteristics can be strings. It is important to note that when saving to the database, spaces in their names will be replaced by the characters '_'.

The file with test examples has the same form, only the system is not told about the real sign of the example, and its prediction is compared with it to calculate the quality of the generated hypotheses and their predictive power.

2.3 Sample structure for continuous features



In this case, options are again possible: the reader can implement his own bootloader of training and test cases in the database or use the vkfencoder.DataLoad class available in the library. In the second case, the reader should take into account that the target attribute should be in the first position and consist of a natural number (for example, 0, 7, 15).

The training sample should be a CSV file (with values ​​separated by a separator) that describes the training examples and counter-examples (examples that do not have a target property).
Input file structure should look like

“Quality”; “fixed acidity”; “volatile acidity”; “citric acid”; “residual sugar”; “chlorides”; “free sulfur dioxide”; “total sulfur dioxide”; “density”; “pH”; “sulphates ";" Alcohol "
5; 7.4; 0.7; 0; 1.9; 0.076; 11; 34; 0.9978; 3.51; 0.56; 9.4
5; 7.8; 0.88; 0; 2.6; 0.098; 25; 67; 0.9968; 3.2; 0.68; 9.8
5; 7.8; 0.76; 0.04; 2.3; 0.092; 15; 54; 0.997; 3.26; 0.65; 9.8
6; 11.2; 0.28; 0.56; 1.9; 0.075; 17; 60; 0.998; 3.16; 0.58; 9.8
5; 7.4; 0.7; 0; 1.9; 0.076; 11; 34; 0.9978; 3.51; 0.56; 9.4
5; 7.4; 0.66; 0; 1.8; 0.075; 13; 40; 0.9978; 3.51; 0.56; 9.4
5; 7.9; 0.6; 0.06; 1.6; 0.069; 15; 59; 0.9964; 3.3; 0.46; 9.4
7; 7.3; 0.65; 0; 1.2; 0.065; 15; 21; 0.9946; 3.39; 0.47; 10
7; 7.8; 0.58; 0.02; 2; 0.073; 9; 18; 0.9968; 3.36; 0.57; 9.5

In our case, these are the first few lines of the file 'vv_red.csv', generated from the file 'winequality-red.csv' of Portuguese red wines from the Wine Quality array of the UCI repository for testing machine learning procedures by moving the last column to the very beginning. It is important to note that when saving to the database, spaces in their names will be replaced by the characters '_'.

3 VKF experiment with continuous signs



As previously written, we will demonstrate work on the Wine Quality array from the UCI repository. Let's start by creating an empty database under MariaDB with the name 'red_wine'.

(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS red_wine; MariaDB [(none)]> exit; 


By results of work there will be a DB 'red_wine'.

We launch Python3 and conduct a VKF experiment on the Wine Quality array. It is assumed that in the ~/src/demo/files/folder there is a file vv_red.csv. The structure of these files was described in the previous paragraph. The names 'verges', 'complex', 'trains' and 'tests' encountered later are the names of the tables in the database 'red_wine' for, respectively, thresholds (both for the initial and for the signs calculated by regression), coefficients of significant logistic regressions between pairs of signs, training and test cases.

(demo) krrguest@amd2700vii:~/src/demo$ Python3 >>> import vkfencoder >>> nam=vkfencoder.DataLoad('./files/vv_red.csv', 7, 'verges', 'trains', 'red_wine', ';', '127.0.0.1', 'root', 'toor') 


The second argument sets the threshold (7 in our case), above which the example is declared positive. Argument ';' matches the delimiter (the comma ',' is selected by default).

As stated in the previous note by the author , the procedure is different from the case of discrete signs. First, we calculate the logistic regressions using the vkf.Join class, whose coefficients are stored in the 'complex' table.

>>> join=vkf.Join('trains', 'red_wine', '127.0.0.1', 'root', 'toor') >>> join.compute_save('complex', 'red_wine', '127.0.0.1', 'root', 'toor') 


Now, using information theory, we calculate thresholds using the vkf.Generator class, which, together with the maximum and minimum, are stored in the 'verges' table.

>>> gen=vkf.Generator('complex', 'trains', 'verges', 'red_wine', 7, '127.0.0.1', 'root', 'toor') 

The fifth argument specifies the number of thresholds on the source features. By default (and with a value equal to 0), it is calculated as the logarithm of the number of signs. On regressions, a single threshold is sought.

Now we turn to the VKF experiment itself: we connect encoders, load previously calculated hypotheses (if any), calculate an additional number (300) of hypotheses in the specified number (8) of flows, and save them in the table 'vkfhyps'.

>>> qual=vkf.Qualifier('verges', 'red_wine', '127.0.0.1', 'root', 'toor') >>> beg=vkf.Beget('complex', 'red_wine', '127.0.0.1', 'root', 'toor') >>> ind=vkf.Induction() >>> ind.load_continuous_hypotheses(qual, beg, 'trains', 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor') >>> ind.add_hypotheses(300, 8) >>> ind.save_continuous_hypotheses(qual, 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor') 


You can get a Python list of triples (trait_name, (bottom_front, top_front)) for the VKF hypothesis ndx number

>>> ind.show_continuous_hypothesis(enc, ndx) 


Finally, we turn to prediction. We create a test sample to assess the quality of the generated hypotheses and at the same time predict the target property of its elements

>>> tes=vkf.TestSample(qual, ind, beg, 'trains', 'red_wine', '127.0.0.1', 'root', 'toor') >>> tes.correct_positive_cases() >>> tes.correct_negative_cases() >>> exit() 


Since we have only one file, here we used a training set as test cases.

4. File Upload Note



The author used the aiofiles library to upload files (such as 'mushrooms.xml') to the web server. Here is a code snippet that might prove useful

import aiofiles import os import vkfencoder async def write_file(path, body): async with aiofiles.open(path, 'wb') as file_write: await file_write.write(body) file_write.close() async def xml_upload(request): #determine names of tables encoder_table_name=request.form.get('encoder_table_name') lattices_table_name=request.form.get('lattices_table_name') #Create upload folder if doesn't exist if not os.path.exists(Settings.UPLOAD_DIR): os.makedirs(Settings.UPLOAD_DIR) #Ensure a file was sent upload_file=request.files.get('file_name') if not upload_file: return response.redirect("/?error=no_file") #write the file to disk and redirect back to main short_name=upload_file.name.split('/')[-1] full_path=f"{Settings.UPLOAD_DIR}/{short_name}" try: await write_file(full_path, upload_file.body) except Exception as exc: return response.redirect('/?error='+ str(exc)) return response.redirect('/ask_data') 


Conclusion



The vkf.cpython-38-x86_64-linux-gnu.so library contains many hidden classes and algorithms. They use such advanced mathematical concepts as closed-by-one operations, lazy calculations, paired Markov chains, irreversible states of the Markov chain, metric of total variation, etc.

In practice, experiments with arrays of data repositories for machine learning (University of California atIrvine) showed the real applicability of C++ - the ‘VKF Method’ program to medium-sized data (the Adult array contains 32560 training and 16280 test cases).

The ‘VKF Method’ system surpassed (with respect to accuracy of prediction) the CLIP3 system (Learning to cover using integer programming v.3) on the SPECT array, where CLIP3 is a program created by the authors of the SPECT array. On the Mushrooms array (randomly divided into training and test samples), the F VKF Method ’system showed 100% accuracy (both regarding the criterion of mushroom toxicity and the criterion of edibility). The program was also applied to the Adult array to generate more than a million hypotheses (no glitches).

The source code for the vkf CPython library is being moderated at savannah.nongnu.org. Codes for the vkfencoder helper library can be picked up from Bitbucket .

The author would like to thank his colleagues and students (D. A. Anokhin, E. D. Baranova, I. V. Nikulin, E. Yu. Sidorova, L. A. Yakimova) for their support, useful discussions, and joint research. However, as usual, the author is solely responsible for all the shortcomings.

Source