Using graph analysis methods to search for anomalies
First Count of the Russian Empire Boris Petrovich Sheremetev. No anomalies found.
In the introduction, the authors answer the question: “What are the advantages of searching for anomalies using graph theory?”:
- most of the data that we encounter is interconnected, and you need to use entities that can take this into account;
- columns are extremely convenient and understandable to humans;
- anomalies are usually related to each other, especially when considering fraud;
- fraudsters find it more difficult to adapt behavior to such methods, since they lack a vision of the graph as a whole and, accordingly, an understanding of how to adapt their strategy in accordance with the possible actions of antifraud systems.
I would like to add on my own that, although for some types of data the transition to a graphical representation of the data requires careful setting of the principles of such a transition and requires certain tricks, for many types of data (social networks, banking operations, computer networks) this representation is natural and requires appropriate working methods with him.
Thus, the prospect of using graph theory to search for anomalies is justified. Let's move on to the description of the methods themselves.
In the article, graph methods for searching for anomalies are divided depending on whether they are applied to static or dynamic graphs. The authors divide static graphs into two groups: ordinary and those for which some properties correspond to nodes and edges. Within each subgroup, the authors divide the approaches into structural and search-based communities.
Search for anomalies in graphs without properties assigned to vertices and/or edges
For simple graphs, three directions of structural approaches are distinguished: the search for anomalies based on features of nodes, dyads of nodes, triads, egonets, based on communities and based on proximity metrics: pagerank, personalized pagerank, simrank. At the same time, it is proposed to use conventional algorithms (for example, Isolation Forest, or if there is markup, then standard classifiers) to solve the problem of searching for anomalies in a graph, but based on graph features.
Example of Egonet
Separately, an approach is described with signs of egonets - subgraphs, including the target node, and its nearest neighbors.Authors referring to the article of 2010 (Akoglu, 2010) ( there will be a lot of such links in the article, I gave hyperlinks to some in the text, but more detailed links, for example, with the indication of the pages, you will find in the list of references at the end of this article ), they suggest looking for patterns of egonets with a pronounced dependence between the characteristics, and egonettes that do not correspond to these patterns are considered abnormal, and thus their central nodes are considered abnormal. Since there can be many various indicators based on egonets, the authors in the same article substantiate their choice of the subgroup of such indicators that best reflect the properties of the graph (for example, the number of triangles, the total weight of the edges). It is proposed to measure the distance between the patterns and between the egonet and the pattern as the deviation of the characteristics of this egonet from the characteristic dependence corresponding to the pattern.
Another graph-based anomaly search branch relies on the discovery of closely related groups or communities. It is suggested that nodes or edges that do not belong to any community or belong to several at once are considered abnormal.
The authors also describe other approaches:
- clustering of nodes based on the similarity of their immediate environment; a reorganization of the adjacency matrix is proposed to produce denser and less dense blocks (Chakrabarti 2004; Rissanen, 1999);
- matrix factorization; an analogue of Non-negative matrix factorization (NMF) is proposed (Tong and Lin 2011).
Search for anomalies in graphs with nodes and/or edges with properties
The main idea of such approaches consists in the sequential solution of two problems: the search for anomalous subgraphs in terms of structure and the search within this subset of anomalous subgraphs in terms of the properties of nodes and/or edges. The authors consider the solution to this problem as a search for rare elements in the set, thus reducing it to a task inverse to the search for elements that are most often found in the graph, and thus are most characteristic for it (the graph is "compressed" best).
The problem of the presence of many characteristics of various modalities in nodes is also considered. It is proposed that all conditionally normal values be assigned to a single value of the categorical variable, and for emissions, the value of a single anomaly indicator should be matched, for example, based on metric methods for detecting anomalies: kNN, LOF.
Other features are also listed: SAX (Symbolic Aggregate approXimation) (Lin et al., 2003), MDL-binning (Minimum description length) (Kontkanen and Myllymki, 2007) and entropy minimum sampling (Fayyad and Irani, 1993). The authors of this article (Eberlie and Holder, 2007) take a different approach to the determination of anomalies in graph data, considering those subgraphs that are similar to a conditionally normal graph within certain limits to be abnormal. The authors justify this approach by saying that the most successful scammers will try to imitate reality as much as possible. They also propose taking into account the cost of modifying the indicator and formulating anomaly indicators taking into account this cost (the lower the cost, the more anomalous the indicator).
The search for anomalies for graphs with attributed nodes is also considered in a community-based paradigm. It is proposed to divide the columns into communities. Next, within each community, look for anomalies by attribute. For example, a smoker on a baseball team. A smoker is not an anomaly for society as a whole, but in his community is. Another approach (Müller, 2013) is based on the selection by the user (analyst) of a set of nodes for which a subspace of indicators similar to them is further defined. And the anomalies in this approach are nodes that structurally belong to the cluster of these nodes, but are far from them in the selected subspace of indicators.
Semi-supervised methods are considered separately, under the assumption that some of the nodes are marked as normal and abnormal, and the remaining nodes can be classified using the appropriate methods, and in the simplest case, they can be assigned labels of neighboring nodes.The main approaches are listed: iterative classification algorithm, gibbs sampling (more about these approaches write here ), loopy belief propagation , weighted-vote relational network classifier .
Search for anomalies in a dynamic graph
For a dynamic graph, which is a sequence of static graphs ordered in time, the basic approach is as follows:
- some compression or integral characteristic of each static graph is highlighted;
- calculates the distance of consecutive graphs;
- those graphs for which the distance is above the threshold are accepted as abnormal.
As distance measures are offered:
- maximum common subgraph (MCS) distance;
- error correcting graph matching distance, that is, a distance that measures how many steps you need to take to make another graph from one graph;
- graph edit distance (GED), the same as the previous one, but only topological changes are possible;
- distances between adjacency matrices (for example, Hamming);
- different distances based on the weights of the ribs;
- distances between the spectral representation of graphs (eigenvector distributions);
- a more exotic measure is also described: the Euclidean distance between the Perron eigenvectors of the graph.
In an article from Bunke et al. (2006) the authors propose to consider the distance not only between successive graphs, but generally between all graphs in a sequence, and then apply multidimensional scaling, translating graphs into two-dimensional space. Next, emissions are sought in this two-dimensional space.
The following way of working with dynamic graphs is also described (Box and Jenkins, 1990): a certain number is assigned to the graph (calculated indicator) and then standard methods for searching for anomalies in time series are applied. For example, discrepancies with the ARIMA model.
In an article by Akoglu and Faloutsos (2010), the authors perform the following sequence of operations:
- allocate for each node of the graph for each moment of time F-signs;
- for each feature with a time window W, correlation matrices between nodes are counted;
- select eigenvectors and then consider only the first eigenvector;
- simultaneously distinguish the "typical" behavior of the eigenvectors of the correlation matrix (for this, one more SVD decomposition is performed over the matrix of the change in all eigenvectors of the correlation matrix in time);
- compare (through the cosine product) with the real behavior of this vector, thus obtaining an anomaly indicator of the considered time window.
Matrix decomposition is also used in Rossi (2013):
- similarly to the previous approach, F-signs are allocated per node for each time interval;
- for each time interval, NMF decomposition is performed, in which a role is assigned to each node;
- Next, the role changes of each node are monitored.
Matrix decomposition for interpreting results
I would also like to note the matrix approximation methods presented by the authors that are alternative to the well-known SVD, PCA, NMF: CUR (Drineas et al., 2006), CMD (Sun et al. 2007b) and Colibri (Tong et al. 2008). The main advantage of these methods is interpretability, because unlike SVD, which transfers points to another space, these methods leave the space intact, only by sampling the points from it. The simplest of them is CUR, in which the authors note two drawbacks: in it, points are selected from the matrix with repetition. CMD succeeds in removing this drawback, however, as in CUR, linear redundancy is inherent in this method, which the authors of the Colibri algorithm manage to avoid. Although the methods were invented specifically for solving the problems of searching for anomalies in graphs using matrix approximation methods, their use can be promising for other problems.
In the problems discussed in this review, these approaches are applied according to the following principle: approximation is performed and it is estimated how different columns/rows differ in the approximated matrix from the original one. The authors also note the NrMF method (Tong and Lin 2011), a modification of NMF, in which the restriction on non-negativity is imposed on the residual matrix R, since it contains the basic information on the difference between the approximation and the original matrix, and it would be difficult to interpret negative values in this case. Nevertheless, it is not completely clear why SVD cannot be used in a similar way for decomposition, subsequent reconstruction and subsequent calculation of the difference from the original matrix.
Identification of nodes connecting abnormal
When analyzing the results, the task may arise to determine the nodes associated with abnormal. So, for example, how an abnormal node can be defined that has undergone a DDoS attack, while the attacking nodes are not defined. Or how abnormal can be identified by members of some group, while the people who lead them are not defined as abnormal. To solve this problem, the authors propose several approaches, the main idea of which is to select a subgraph from a complete graph that contains abnormal nodes and nodes that best connect them.
- Definition of a connection subgraph (Faloutsos et al., 2004). The problem is proposed to be solved in terms of electrical engineering, assigning one node a positive potential, and the other nodes zero, and watch how the current will "flow" between them if you assign a certain resistance to the ribs.
- Center-Piece Subgraphs (CePS) (Tong and Faloutsos, 2006). In contrast to the previous method, an attempt is made to isolate only k-nodes from all the anomalous ones, since it is not necessary that all nodes are given. In this case, k must be specified.
- Dot2Dot (Akoglu et al., 2013b; Chau et al., 2012). In this approach, the authors solve the problem of grouping selected nodes and then further select the nodes that connect them.
Examples of searching for anomalies in various fields
The authors describe cases where methods for detecting anomalies in graphs were used.
Telecommunications. The goal is people who use the services for free. Cortes et al. (2002) searched for subgraphs closely related to the key node in terms of the number and duration of calls. Observations that the authors found: fraud accounts were connected, that is, the offenders either called each other, or called on the same phones. The second observation - violators can be detected by the similarity of their subgraphs defined by the proposed image.
Online auction. Violators create fake accounts and win ratings. They cannot be tracked by the usual aggregate indicators, but it is possible to see the graph. Intruder accounts are more associated with fake accounts than with good accounts. Fakes are associated approximately equally with the accounts of violators and with good ones. The latter are mainly associated with similar accounts. Pandit et al. (2007) solve this problem by converting to relational markov networks and then classify nodes through Loopy Belief Propagation (class labels iteratively propagate along the graph).
Transactions. McGlohon et al. (2009) solve this problem through relational classification under the assumption that intruders will be close to each other. That is, similar to the approach from the previous example.
Brokers who cheat on securities. Here Neville et al. (2005) analyze a multimodal graph, highlighting subgraphs that include a person under suspicion, his colleagues, the company with which he is associated, etc. They calculate aggregated attributes and attribute them to the central node. Next, relational probability trees (Neville et al. 2003) are used for relational classification.
Search for fake posts on forums that provide false information. The authors describe several approaches using featured descriptions, text analysis, and graph methods. Graph methods used by Wang et al. (2011a) , were applied for a situation when in the task there are reviews of some product. In the algorithm they proposed, it was proposed to assign to the reviewers indicators of the “degree of trust in them,” their reviews of “reliability” and goods — indicators of “reliability”. All these indicators are interconnected. So, how much you can trust the reviewer depends on how reliable the reviewer is. The reliability of the goods depends on the degree of trust to the reviewers who describe it, and the reliability of the reviews depends on the reliability of the goods on which they are written, and on the trust in their authors. The proposed algorithm first randomly initializes them, and then iteratively improves the estimate.
Trading. Fraudsters first make a large number of transactions with each other on some type of stock, increasing their attractiveness, and then when the stock goes up in price, they sell them to other traders. Both of these successive incidents can be tracked by graph data. In the first case, a subgraph will be highlighted where there are no external transactions (called a “black hole”), and after a period of time the same subgraph will be converted to a subgraph in which transactions from the subgraph to another part of the graph (called a volcano) are very prevalent. The authors cite the work Li et al. (2010) .
Web Resources. One of the first approaches to combat “bad” websites was the proliferation of indicators of “reliability” and “unreliability” of resources. If there is a link from one page to another, then for the last it increases its status as reliable. If the page points to another page for which it is known to be spam, this reduces the reliability of the original page. The TrustRank algorithm is mentioned (Gyöngyi et al., 2004) - a modification of PageRank to combat web spam. It requires that initially experts mark out part of the sites as reliable. These indicators are further distributed throughout the graph, gradually fading. Anti-TrustRank (Krishnan and Raj, 2006) follows the same principle, but with the spread of unreliability indicators from deliberately labeled untrustworthy sites. Both methods have the disadvantage that reliability is divided by the number of child nodes. Benczúr et al. (2005) suggest a completely different approach: analyze the PageRank of the site and its neighbors. The PageRank distribution of such subgraphs must obey a certain power law. For those nodes for which the PageRank distribution of their neighbors is knocked out of this law, a fine is assigned. In the work (Castillo et al., 2007) it is proposed to first train the classifier on known reliable and unreliable pages, and then “blur” the result of scoring the remaining websites according to the graph.
Social networks. To detect fraudster posts on social networks (to increase the number of likes, to redirect to a malicious page or to a questionnaire), approaches are described based on conventional classifiers, but taking into account graph signs: the length of distribution of “bad” posts according to the graph, the number of likes and comments of other users on the post, the similarity of messages spreading the users post, the degree of the site of the user who wrote the post.
Attacks on computer networks. Sun et al. (2008) successfully apply matrix decomposition (CMD) to solve this problem. Ding et al. (2012) take a community search approach by highlighting bridges between communities as suspicious.
Graph theory, until recently, came into contact with machine learning only for social networks, where in a completely different way it is impossible. Now the application of graph theory for solving classical ML-problems is gradually developing, but so far slowly. Mainly because there are still few problems where it is advisable to go over to the graph representation. Now at world conferences the trend of gradual development of this field, but mainly theory, not practice, is guessed. Graph libraries are quite limited and poorly developed. The search for anomalies is an even rarer task, since it is carried out without marking, and the quality of the detection of anomalies can only be assessed expertly.In many problems, the transition to a graph description is not advisable.
If it’s interesting to read about standard methods for detecting anomalies, then I already wrote about it a year ago on Habr in this article .
If you are interested in the topic, then to get more information, you definitely need to go to the ODS (OpenDataScience) slack, to the #network_analysis and # class_cs224w channels, watch the Stanford cs224w course.
More recently, a course on Knowledge Graphs was taught. Well, of course, you need to read the article Graph based anomaly detection and description: a survey from the authors Leman Akoglu, Hanghang Tong, Danai Koutra (Akoglu, 2015 ) , which is discussed in this post. I did not translate it all, but only those fragments that I considered important, and I understood to some extent. Most authors refer specifically to this article, because there are no more reviews of this level and breadth on the topic. At least I did not find such.
- Akoglu L, McGlohon M, Faloutsos C (2010) OddBall: spotting anomalies in weighted graphs. In: Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Hyderabad, India, pp 410-421 https://link.springer.com/chapter/10.1007/978-3-642-13672-6_40
- Akoglu L, Faloutsos C (2010) Event detection in time series of mobile communication graphs. In: Proceedings of army science conference https://www.andrew.cmu.edu/user/lakoglu/pubs/EVENTDETECTION_AkogluFaloutsos.pdf
- Akoglu L, Vreeken J, Tong H, Duen HC, Tatti N, Faloutsos C (2013b) Mining connection pathways for marked nodes in large graphs. In: Proceedings of the 13th SIAM international conference on data mining (SDM), Texas-Austin, TX https://eda.mmci.uni-saarland.de/pubs/2013/dot2dot-akoglu,vreeken,tong,chau,tatti,faloutsos.pdf
- Akoglu, L., Tong, H. & amp; Koutra, D. (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Disc 29, 626–688. https://www.andrew.cmu.edu/user/lakoglu/pubs/14-dami-graphanomalysurvey.pdf
- Benczúr AA, Csalogány K, Sarlós T, Uher M (2005) Spamrank: fully automatic link spam detection. In: Proceedings of the first international workshop on adversarial information retrieval on the web
- Box GEP, Jenkins G (1990) Time series analysis. Forecasting and Control, Holden-Day, Incorporated https://dl.acm.org/doi/book/10.5555/574978
- Bunke H, Dickinson PJ, Humm A, Irniger C, Kraetzl M (2006a) Computer network monitoring and abnormal event detection using graph matching and multidimensional scaling. In Proceedings of 6th industrial conference on data mining (ICDM), pp 576-590 https://dl.acm.org/doi/10.1007/11790853_45
- Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: web spam detection using the web topology. In: Proceedings of the 30th international conference on research and development in information retrieval (SIGIR), Amsterdam. ACM, pp 423-430 https://chato.cl/papers/cdgms_2006_know_your_neighbors.pdf
- Chakrabarti D (2004) Autopart: parameter-free graph partitioning and outlier detection. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD), Pisa. Italy Springer, New York, pp 112–124 https://dl.acm.org/doi/10.5555/1053072.1053085
- Chau DH, Akoglu L, Vreeken J, Tong H, Faloutsos C (2012) Tourviz: interactive visualization of connection pathways in large graphs.In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD), Beijing, China, pp 1516–1519 https://asu.pure.elsevier.com/en/publications/tourviz-interactive-visualization-of-connection-pathways-in-large
- Cortes C, Pregibon D, Volinsky C (2002) Communities of interest. Intell Data Anal 6(3):211–219 https://dl.acm.org/doi/10.5555/647967.741620
- Ding Q, Katenka N, Barford P, Kolaczyk ED, Crovella M (2012) Intrusion as (anti)social communication: characterization and detection. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD), Beijing, China. ACM, pp 886–894 https://dl.acm.org/doi/10.1145/2339530.2339670
- Drineas P, Kannan R, Mahoney MW (2006) Fast monte carlo algorithms for matrices iii: computing a compressed approximate matrix decomposition. SIAM J Comput 36(1):184–206 https://epubs.siam.org/doi/abs/10.1137/S0097539704442702?mobileUi=0
- Eberle W, Holder LB (2007) Discovering structural anomalies in graph-based data. In: Proceedings of the international workshop on mining graphs and complex structures at the 7th IEEE international conference on data mining (ICDM), Omaha, NE. IEEE Computer Society, pp 393–398 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.8477&rep=rep1&type=pdf
- Faloutsos C, McCurley KS, Tomkins A (2004) Fast discovery of connection subgraphs. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 118–127 https://dl.acm.org/doi/10.1145/1014052.1014068
- Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 5th international joint conference on artificial intelligence (IJCAI), Chambery, France. Morgan Kaufmann, pp 1022–1029 https://www.ijcai.org/Proceedings/93-2/Papers/022.pdf
- Gao H, Chen Y, Lee K, Palsetia D, Choudhary A (2012) Towards online spam filtering in social networks. In: Proceedings of the 19th annual network & distributed system security symposium http://cucis.ece.northwestern.edu/publications/pdf/GaoChe12.pdf
- Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with trustrank. In: Proceedings of the 30th international conference on very large data bases (VLDB), Canada, Toronto, pp 576–587 https://www.vldb.org/conf/2004/RS15P3.PDF
- Kontkanen P, Myllymki P (2007) MDL histogram density estimation. J Mach Learn Res Proc Track 2:219–226 http://proceedings.mlr.press/v2/kontkanen07a/kontkanen07a.pdf
- Krishnan V, Raj R (2006) Web spam detection with anti-trust rank. In: Proceedings of the 2nd international workshop on adversarial IR on the Web at the 29th international conference on research and development in information retrieval (SIGIR), Seattle, WA, pp 37–40
- Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD), San Diego, CA. ACM, pp 2–11 https://www.cs.ucr.edu/~eamonn/SAX.pdf
- Li Z, Xiong H, Liu Y, Zhou A (2010) Detecting blackhole and volcano patterns in directed networks. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, Australia. IEEE Computer Society, pp 294–303 http://datamining.rutgers.edu/publication/blackhole.pdf
- Molloy, Ian & Chari, Suresh & Finkler, Ulrich & Wiggerman, Mark & Jonker, Coen & Habeck, Ted & Park, Youngja & Jordens, Frank & Schaik, Ron. (2016). Graph Analytics for Real-time Scoring of Cross-channel Transactional Fraud.
- Müller E, Sánchez PI, Mülle Y, Böhm K (2013) Ranking outlier nodes in subspaces of attributed graphs. In: Proceedings of the 4th international workshop on graph data management: techniques and applications https://www.ipd.kit.edu/mitarbeiter/muellere/publications/GDM2013.pdf
- Rahman MS, Huang T.-K., Madhyastha HV, Faloutsos M (2012) Efficient and scalable socware detection in online social networks. In: Proceedings of the 21st USENIX conference on Security symposium (Security). USENIX Association, pp 32–32 https://dl.acm.org/doi/10.5555/2362793.2362825
- Rissanen J (1999) Hypothesis selection and testing by the MDL principle. Comput J 42:260–269 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.9851&rep=rep1&type=pdf
- Rossi RA, Gallagher B, Neville J, Henderson K (2013) Modeling dynamic behavior in large evolving graphs. In: Proceeding of the 6th ACM international conference on Web search and data mining (WSDM), pp 667–676 https://www.cs.purdue.edu/homes/neville/papers/rossi-et-al-wsdm2013.pdf
- Sun J, Xie Y, Zhang H, Faloutsos C (2007b) Less is more: compact matrix decomposition for large sparse graphs. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MN http://www.cs.cmu.edu/~christos/PUBLICATIONS/sdm07-lsm.pdf
- Sun J, Xie Y, Zhang H, Faloutsos C (2008) Less is more: sparse graph mining with compact matrix decomposition. Stat Anal Data Min 1(1): 6–22. ISSN 1932–1864 https://onlinelibrary.wiley.com/doi/abs/10.1002/sam.102
- Tong H, Faloutsos C (2006) Center-piece subgraphs: problem definition and fast solutions. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 404–413 http://www.cs.cmu.edu/~christos/PUBLICATIONS/kdd06CePS.pdf
- Tong H, Papadimitriou S, Jimeng S, Yu PS, Faloutsos C (2008) Colibri: fast mining of large static and dynamic graphs. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD), Las Vegas, NV, pp 686–694 http://www.cs.cmu.edu/~htong/pdf/kdd08_tong_1.pdf
- Tong H, Lin C-Y (2011) Non-negative residual matrix factorization with application to graph anomaly detection. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, pp 143–153 http://www.cs.cmu.edu/~htong/pdf/sdm11_tong.pdf
- Tong H, Lin C-Y (2012) Non-negative residual matrix factorization: problem definition, fast solutions, and applications. Stat Anal Data Min 5(1):3–15 https://asu.pure.elsevier.com/en/publications/non-negative-residual-matrix-factorization-problem-definition-fas
- Wang G, Xie S, Liu B, Yu PS (2011a) Review graph based online store review spammer detection. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada, pp 1242–1247 https://www.cs.uic.edu/~sxie/paper/ICDM-2011-final.pdf
Да, наша команда CleverDATA не только пишет и переводит статьи. Большую часть времени мы посвящаем решению интересных и разнообразных практических задач, для чего используем не только теорию графов, но и множество методов машинного и глубокого обучения.