tl; dr: A simplified analysis of the article in which the author proposes two interesting theorems on the basis of which he found a way to extract hidden meaning vectors from the embedding matrix. A guide is given on how to reproduce the results. The laptop is available on the github .

## Introduction

In this article I want to talk about one amazing thing that researcher Sanjev Arora found in the article Linear Algebraic Structure of Word Senses, with Applications to Polysemy . It is one of a series of articles in which he tries to give theoretical justifications for the properties of word embeddings. In the same work, Arora makes the assumption that simple embeddings, such as word2vec or Glove, actually include several meanings for one word and offer a way to restore them. In the course of the article, I will try to stick to the original examples.

More formally, for we denote a certain embedding vector of the word tie , which may have the meaning of a knot or tie, or it may be a tie. Arora suggests that this vector can be written as the next linear combination

where this is one of the possible meanings of the word tie , and - coefficient. Let's try to figure out how it happens.

## Theory

Disclaimer

Written by a non-mathematician, please report all errors, especially to the mat. terminology.

### A short note on the theory of Arara

Since Arora’s starting work is far more complicated than this, I haven’t prepared the full review yet. However, we will briefly see what it is like.

So, Arora offers the idea that any text is generated by a generative model. In the process, at each time step the word is generated. The model consists of a context vector and embedding vectors . In separate dimensions (dimensions), or coordinates, of a vector, information and attributes of words are encoded. For example, we could say that in some abstract vector, the positive value of the fifth dimension is responsible for masculinity (man, king), and the negative - for femininity (woman, queen), and the positive value of the hundredth dimension indicates that the word will be verb, and negative - a noun.

The context vector makes a slow evenly distributed random walk, i.e. over time, the values ​​of its coordinates change to some vector semulated from the uniform distribution. Thus, the information what is being encoded is currently being discussed. Conditions for the slow speed and uniformity of the process are needed so that words close in meaning have a high probability of being generated together.In other words, this condition allows you to generate an interconnected text: the sequence of words “Girl combed her braid” is more likely to be generated than “Man combed his braid”. However, the model does not impose a strict ban on “jumping”: after all, some men do wear long hair.

Unlike the context vector, the embedding vectors are static. In fact, these are mappings that allow you to compare a word with a specific set of information that is encoded in the coordinates of the vectors.
The process of generating text is very simple: moving in the space of vectors, the context vector becomes closer to one word, then to another. The probability that at the step the word will be generated by the expression

where - vector context for the time , - embedding the word , and - this is partition function. It plays the role of a normalizer, which ensures that we get a valid probability.

Now, using the example of images, we will try to understand this using the simplest example. Suppose we have a two-dimensional space in which there are only embeddings for four words: queen, king, woman, man . We could encode status information along the y axis, and encode the gender along x. Then the picture could look as follows

Now we need a context vector. It will be just a separate point near some vector that happened to be there at some time step. Here she is

Now, let's imagine that he is moving slowly. Let us agree that this means that he cannot, at the countdown, pass a distance greater than one and without "jumps". Then the trajectory might look like this:

It turns out that such a detour along the contour of the square will generate the text "Queen, King, Man, Woman." In this case, for our simple model, just such a text is interconnected, since we can not change the distance by more than one, which means we can’t get the text “queen, man, woman, king” because they are “not so” strongly related statistically.

### Come back

Arora began by making a modification to his generative model from a previous article. In the new model, he suggests replacing the uniform distribution along which the context vector moves with Gausovsky. This gives rise to some changes in the calculations from the first part, but we will not consider them (at the moment, in any case). Further, based on this model, he puts forward and proves two theorems. I will give them here without proof.

### Theorem 1

Assuming the model described above, let this is a random variable for a window with a long words. Then there is a linear transformation such, what

This means that between a certain word and its window there is a linear relationship. The article considers the following thought experiment. Let's take some word and put it together for this word all the windows. Denote the set of all windows by . Then, take the average each windows , and then take the average of all the window averages and denote it by . Then, according to the theorem, we can display this vector at using a certain matrix (which still need to find). It should be said that if such a matrix were found, it would reduce the problems of out-of-vocabulary words, since it is clearly easier to calculate the average of all windows and multiply this by the matrix than to completely retrain all vectors from scratch.

Arora is conducting an experiment to test this. To understand it, the concept of SIF embeddings is necessary. Although there is a note about them in this article, in general, a separate article is devoted to them, which I will also discuss later. For now, it will be enough to know that SIF embeddings is the embedding for the text window k, which is considered as the average of the embeddings included in this window of the words times their TF-IDF.

By the way, it turns out that from the perspective of Theorem 1, these embeddings are the maximum likelihood estimate for the context vector . As you can see, this is the easiest way to embed some piece of text, for example, a sentence.

Going back to the experiment. Imagine that we have a matrix of embeddings and some word for which there is no embedding in the matrix, but we want to count it.The algorithm will look like this:

1. We randomly select many paragraphs from Wikipedia. Let us denote the dictionary of all such a sub-case .
2. For each word , for which embeddings are known, and of all their occurrences among the paragraphs, individually, we calculate the SIF embeddings for 20-word windows centered at using our well-known embedding matrix. As a result, for each word we get a lot of vectors , where n is the number of entries of the word in the sub-body.
3. Calculate the average of all SIF embeddings for each word using the formula .
4. We solve the regression
5. We take the average of all SIF embeddings for our unknown word and consider

Arora shows that these so-called induced embeddings are pretty close to their originals. He thus calculated the embeddings for 1/3 of the words from the known matrix, having trained the matrix A in the remaining 2 \ 3 parts. The table below shows the results from the original work. Pay attention to the number of paragraphs.

#paragraphs 250k 500k 750k 1 million
cos similarity 0.94 0.95 0.96 0.96

### Theorem 2

Suppose that there is a certain word with two different values ​​ and . We get the vector for this word on some case where both of these values ​​are present. Then, in the same case, manually put down labels for these values, i.e. artificially remake words into pseudowords, for example, tie_1 and tie_2, where the word tie_1 is for the value for the tie and tie 2 for the node.
For the new corpus with our pseudo-words, we count the embeddings again and denote them, respectively, \$ & lt;! - math > \$ inline \$ \ upsilon
{w {s1}} \$ inline \$ & lt;/math - > \$ and \$ & lt;! - math > \$ inline \$ \ upsilon {w_ {s2}} \$ inline \$ & lt;/math - > \$.Again, assuming our generative model, the theorem states that these embeddings will satisfy where is defined as

where and this is the number of occurrences of the values ​​ and in the enclosure. As you can see, this is exactly the same expression as at the beginning of the article.

Okay, all of this, of course, is good, but how can we reconstruct this linear combination of word meanings? There can be a gigantic amount of them, even if we know . The generative model will come to our aid again. Assume the context vector now It is located next to the topic of girls and, therefore, a model about them will generate text. Then it is logical to assume that if the word braid appears at the moment, then it will have the meaning hairstyles . Moreover, words such as eyes, outfit, beautiful, dress should be generated with high probability. Also, according to our model, this probability is defined as the inner product between the context vector and the word vector. Then let's assume that the context vector that has the largest internal product with all contextually close words ( eyes, outfit, beautiful, dress ) will be the vector for the value braids, like a female hairstyle! It remains only to heal these vectors.

Let's play maths again. What do we want? Having a set of word vectors in and two integers . such that , we want to find the set of vectors such that

where no more than the coefficients are not equal to zero and is the error vector.To solve this problem, we can write the following optimization equation

This is a typical problem of sparse coding, where k is called the sparsity parameter, and m is the number of so-called atoms, which in our case are the same vectors. This can be solved using the k-SVD algorithm. It is worth noting that this task is not a clean solution. We can only hope that among the set there is an axis corresponding to certain topics (we get that the set looks like a coordinate system). In order to be sure that any atom is participates in expressing its attribute or topic in words, expressing context, and is not a blank axis, we need to limit is much less than the number of words. The authors of the original article called it the atoms of discourse.

## Practice

Let's finally code this and look at these atoms with our own eyes.

``import numpy as np from gensim.test.utils import datapath, get_tmpfile from gensim.models import KeyedVectors from gensim.scripts.glove2word2vec import glove2word2vec from scipy.spatial.distance import cosine import warnings warnings.filterwarnings('ignore') ``

Please note that I and the author use vectors in 300-meter space.

``tmp_file=get_tmpfile("test_word2vec.txt") _=glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file) model=KeyedVectors.load_word2vec_format(tmp_file) ``

``embeddings=model.wv index2word=embeddings.index2word embedds=embeddings.vectors ``

``print(embedds.shape) ``

``(400000, 300) ``

We have 400,000 unique words.

2. Install and apply k-svd to the embedding matrix
Now we need to get the atoms of discourse using sparse recovery. We will use the ksvd package.

``!pip install ksvd from ksvd import ApproximateKSVD ``

``Requirement already satisfied: ksvd in/home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3) Requirement already satisfied: numpy in/home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5) Requirement already satisfied: scikit-learn in/home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1) ``

In the article, the author chose the number of atoms equal to 2000 and the rarefaction parameter equal to 5.
I trained two versions: the first for 10,000 embeddings and the second for everyone. Since this process takes a lot of time, especially for the full matrix, I saved the results so that you can just download them.

``%time aksvd=ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, ) embedding_trans=embeddings.vectors dictionary=aksvd.fit(embedding_trans).components_ gamma=aksvd.transform(embedding_trans) ``

``CPU times: user 4 µs, sys: 0 ns, total: 4 µs Wall time: 9.54 µs ``

``#gamma=np.load('./data/mats/.npz') # dictionary_glove6b_300d.np.npz - whole matrix file dictionary=np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz') dictionary=dictionary[dictionary.keys()[0]] ``

``#print(gamma.shape) print(dictionary.shape) ``

``(2000, 300) ``

``#np.savez_compressed('gamma_glove6b_300d.npz', gamma) #np.savez_compressed('dictionary_glove6b_300d.npz', dictionary) ``

3. Determining the Relationship Between Atoms and the Source Matrix

Let's play with our atoms and see what we got. Let's look at the nearest neighbors of several selected atoms.

``embeddings.similar_by_vector(dictionary[1354,:]) ``

``[('slave', 0.8417330980300903), ('slaves', 0.7482961416244507), ('plantation', 0.6208109259605408), ('slavery', 0.5356900095939636), ('enslaved', 0.4814416170120239), ('indentured', 0.46423888206481934), ('fugitive', 0.4226764440536499), ('laborers', 0.41914862394332886), ('servitude', 0.41276970505714417), ('plantations', 0.4113745093345642)] ``

``embeddings.similar_by_vector(dictionary[1350,:]) ``

``[('transplant', 0.7767853736877441), ('marrow', 0.699995219707489), ('transplants', 0.6998592615127563), ('kidney', 0.6526087522506714), ('transplantation', 0.6381147503852844), ('tissue', 0.6344675421714783), ('liver', 0.6085026860237122), ('blood', 0.5676015615463257), ('heart', 0.5653558969497681), ('cells', 0.5476219058036804)] ``

``embeddings.similar_by_vector(dictionary[1546,:]) ``

``[('commons', 0.7160810828208923), ('house', 0.6588335037231445), ('parliament', 0.5054076910018921), ('capitol', 0.5014163851737976), ('senate', 0.4895153343677521), ('hill', 0.48859673738479614), ('inn', 0.4566132128238678), ('congressional', 0.4341348707675934), ('congress', 0.42997264862060547), ('parliamentary', 0.4264637529850006)] ``

``embeddings.similar_by_vector(dictionary[1850,:]) ``

``[('okano', 0.2669774889945984), ('erythrocytes', 0.25755012035369873), ('windir', 0.25621023774147034), ('reapportionment', 0.2507009208202362), ('qurayza', 0.2459488958120346), ('taschen', 0.24417680501937866), ('pfaffenbach', 0.2437630295753479), ('boldt', 0.2394050508737564), ('frucht', 0.23922981321811676), ('rulebook', 0.23821482062339783)] ``

Amazing result! It looks like atoms are centers for similar words. Let’s now specifically take a couple of ambiguous words and find the nearest atom for them. For each atom, a list of nearest neighbors is displayed to help you understand what context this atom describes. For English, let's try to take "tie" and "spring" from the article.

``itie=index2word.index('tie') ispring=index2word.index('spring') tie_emb=embedds[itie] string_emb=embedds[ispring] ``

``simlist=[] for i, vector in enumerate(dictionary): simlist.append( (cosine(vector, tie_emb), i) ) simlist=sorted(simlist, key=lambda x: x[0]) six_atoms_ind=[ins[1] for ins in simlist[:15]] for atoms_idx in six_atoms_ind: nearest_words=embeddings.similar_by_vector(dictionary[atoms_idx,:]) nearest_words=[word[0] for word in nearest_words] print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words))) ``

``Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring Atom #58: game play match matches games played playing tournament players stadium Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2 Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans Atom #281: black wearing man pair white who girl young woman big Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled Atom #98: first team start final second next time before test after Atom #1455: after later before when then came last took again but Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore ``

``simlist=[] for i, vector in enumerate(dictionary): simlist.append( (cosine(vector, string_emb), i) ) simlist=sorted(simlist, key=lambda x: x[0]) six_atoms_ind=[ins[1] for ins in simlist[:15]] for atoms_idx in six_atoms_ind: nearest_words=embeddings.similar_by_vector(dictionary[atoms_idx,:]) nearest_words=[word[0] for word in nearest_words] print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words))) ``

``Atom #528: autumn spring summer winter season rainy seasons fall seasonal during Atom #1070: start begin beginning starting starts begins next coming day started Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities Atom #1455: after later before when then came last took again but Atom #754: but so not because even only that it this they Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre Atom #1335: last ago year months years since month weeks week has Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony Atom #98: first team start final second next time before test after Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic ``

Very good! Indeed, though with some errors, we see atoms that reflect different meanings of words.
Interestingly, if this is done on the entire matrix, the result will be much less accurate. I suspect this is due to the uncertainty of the problem that was mentioned earlier.

Now let's try to do the same for the Russian language. We will use fastText as an embedding provided by RusVectores . They were trained at NKRY with a dimension of 300.

``fasttext_model=KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model') ``

``embeddings=fasttext_model.wv index2word=embeddings.index2word embedds=embeddings.vectors ``

``embedds.shape ``

``(164996, 300) ``

``%time aksvd=ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, ) embedding_trans=embeddings.vectors[:10000] dictionary=aksvd.fit(embedding_trans).components_ gamma=aksvd.transform(embedding_trans) ``

``CPU times: user 1 µs, sys: 2 µs, total: 3 µs Wall time: 6.2 µs ``

``dictionary=np.load('./data/mats/dictionary_rus_fasttext_300d.npz') dictionary=dictionary[dictionary.keys()[0]] ``

``embeddings.similar_by_vector(dictionary[1024,:], 20) ``

``[('исчезать', 0.6854609251022339), ('бесследно', 0.6593252420425415), ('исчезавший', 0.6360634565353394), ('бесследный', 0.5998549461364746), ('исчезли', 0.5971367955207825), ('исчез', 0.5862340927124023), ('пропадать', 0.5788886547088623), ('исчезлотец', 0.5788123607635498), ('исчезнувший', 0.5623885989189148), ('исчезинать', 0.5610565543174744), ('ликвидироваться', 0.5551878809928894), ('исчезнуть', 0.551397442817688), ('исчезнет', 0.5356274247169495), ('исчезание', 0.531707227230072), ('устраняться', 0.5174376368522644), ('ликвидируть', 0.5131562948226929), ('ликвидировать', 0.5120065212249756), ('поглощаться', 0.5077806115150452), ('исчезаний', 0.5074601173400879), ('улетучиться', 0.5068254470825195)] ``

``embeddings.similar_by_vector(dictionary[1582,:], 20) ``

``[('простой', 0.45191124081611633), ('простоть', 0.4515378475189209), ('простота', 0.4478364586830139), ('наибол', 0.4280813932418823), ('простототь', 0.41220104694366455), ('простейший', 0.40772825479507446), ('простотое', 0.4047147035598755), ('наиболь', 0.4030646085739136), ('наилучше', 0.39368513226509094), ('формула', 0.39012178778648376), ('простое', 0.3866344690322876), ('просто', 0.37968817353248596), ('наглядна', 0.3728911876678467), ('простейшее', 0.3663109242916107), ('первооснова', 0.3640827238559723), ('наибольший', 0.3474290072917938), ('первородство', 0.3473641574382782), ('легко', 0.3468908369541168), ('наилучший', 0.34586742520332336), ('авось', 0.34555742144584656)] ``

``embeddings.similar_by_vector(dictionary[500,:], 20) ``

``[('фонд', 0.6874514222145081), ('-фонд', 0.5172050595283508), ('фондю', 0.46720415353775024), ('жилфонд', 0.44713956117630005), ('госфонд', 0.4144558310508728), ('фильмофонд', 0.40545403957366943), ('гвардия', 0.4030636250972748), ('хедж-фонд', 0.4016447067260742), ('генофонд', 0.38331469893455505), ('литфонд', 0.37292781472206116), ('госфильмофонд', 0.3625457286834717), ('фондан', 0.35121074318885803), ('стабфонд', 0.3504621088504791), ('фондъ', 0.34097471833229065), ('тонд', 0.33320850133895874), ('бонд', 0.3277249336242676), ('эхо', 0.3266661763191223), ('ржонд', 0.31865227222442627), ('юрий::левада', 0.30150306224823), ('ргва', 0.2975207567214966)] ``

``itie=index2word.index('коса') ispring=index2word.index('ключ') tie_emb=embedds[itie] string_emb=embedds[ispring] ``

``simlist=[] for i, vector in enumerate(dictionary): simlist.append( (cosine(vector, string_emb), i) ) simlist=sorted(simlist, key=lambda x: x[0]) six_atoms_ind=[ins[1] for ins in simlist[:10]] for atoms_idx in six_atoms_ind: nearest_words=embeddings.similar_by_vector(dictionary[atoms_idx,:]) nearest_words=[word[0] for word in nearest_words] print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words))) ``

``Atom #185: загадка загадк загадкай загад проблема вопрос разгадка загадать парадокс задача Atom #1217: дверь дверью двери дверка дверной калитка ставень запереть дверь-то настежь Atom #1213: папка бумажник сейф сундук портфель чемодан ящик сундучк пачка сундучок Atom #1978: кран плита крышка вентиль клапан электроплита котел плитка раковина посуда Atom #1796: карман пазуха кармашек бумажник карманута карманбыть пазух карманчик карманьол кармашка Atom #839: кнопка кнопф нажимать кноп клавиша нажать кнопа кнопочка рычажок нажатие Atom #989: отыскивать искать отыскиваться поискать разыскивать разыскиваться поиск поискивать отыскать отыскаться Atom #414: молоток молот топор пила колот молотобоец молотой кувалда молота умолот Atom #1140: капиталец капитал капиталовек капиталист капитально капитализм -капиталист капитальный капиталоемкий капиталовложение Atom #878: хранитель хранить храниться хранивший хранивать хранимый храниваться хранилище хранеть хранившийся ``

``simlist=[] for i, vector in enumerate(dictionary): simlist.append( (cosine(vector, tie_emb), i) ) simlist=sorted(simlist, key=lambda x: x[0]) six_atoms_ind=[ins[1] for ins in simlist[:10]] for atoms_idx in six_atoms_ind: nearest_words=embeddings.similar_by_vector(dictionary[atoms_idx,:]) nearest_words=[word[0] for word in nearest_words] print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words))) ``

``Atom #883: косой русый кудряшка косичка челка русой черноволосой кудрявый кудряш светло-русый Atom #40: кустарник заросль осока ивняк трава бурьян папоротник кустик полукустарник бурьяна Atom #215: ниточка паучок бусинка паутинка жердочка стебелька веточка стебелек травинка пупырышек Atom #688: волос валюта кудри валютный борода валютчик ус бивалютный коса усы Atom #386: плечотец грудь шея подбородок бедро грудью ляжка плечо затылок живот Atom #676: веревка канат бечевка веревочка бечевкий шест репшнур жердь веревочный ремень Atom #414: молоток молот топор пила колот молотобоец молотой кувалда молота умолот Atom #127: сюртучок сюртук галстучок фрак панталоны галстучек сюртуки галстук платье галстух Atom #592: салфетка скатерть салфеточка платок шаль полотенце кружевной кружевцо кисея шелка Atom #703: шлюпка катер баркас фок-мачта грот-мачта мачта фрегат судно корвет шхуна ``

``#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma) #np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary) ``

The result is also pretty impressive.

# Conclusion

Some things from the article were not included in the review of the article, such as the algorithm for the induction of the meaning of words (Word sense indection), which is based on all these calculations, and I did not conduct the experiment for Theorem 1. This article is one of the bricks in Understanding how embeddings work. I will say in advance that there are articles that show that not all the assumptions of this author are correct. However, surprisingly, experiments show that the approach works. We will see what happens next.

.

Source