Structuring unstructured data with GROK


If you are using the Elastic (ELK) stack and are interested in matching custom Logstash logs with Elasticsearch, then this post is for you.


ITKarma picture


The ELK stack is an abbreviation for three open source projects: Elasticsearch, Logstash, and Kibana. Together they form a magazine management platform.


  • Elasticsearch is a search and analytics system.
  • Logstash is a server-side data processing pipeline that receives data from several sources at the same time, converts it and then sends it to a “cache”, for example Elasticsearch.
  • Kibana allows users to visualize data using charts and graphs in Elasticsearch.

Beats appeared later and is an easy data shipper. The introduction of Beats has transformed the Elk Stack into an Elastic Stack, but that's not the point.


This article is about Grok, which is a feature in Logstash that can convert your logs before they are logged. For our purposes, I will only talk about processing data from Logstash to Elasticsearch.


ITKarma picture


Grok is a filter inside Logstash that is used to parse unstructured data into something structured and subject to a request. It sits on top of the regular expression (regex) and uses text patterns to match strings in log files.


As we will see in the following sections, using Grok is important when it comes to managing logs efficiently.


Without Grok your log data is Unstructured


ITKarma picture


Without Grok, when logs are sent from Logstash to Elasticsearch and rendered in Kibana, they appear only in the message value.


Requesting meaningful information in this situation is difficult, since all log data is stored in one key. It would be better if the journal posts were better organized.


Unstructured data from logs


localhost GET/v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0 

If you look closely at the raw data, you will see that they actually consist of different parts, each of which is separated by a space.


For more experienced developers, you can probably guess what each part means and that this is a log message from an API call. Each item is presented below.


A structured view of our data


  • localhost == environment
  • GET == method
  • /v2/applink/5c2f4bb3e9fda1234edc64d == url
  • 400 == response_status
  • 46ms == response_time
  • 5bc6e716b5d6cb35fc9687c0 == user_id

As we see in structured data, there is an order for unstructured journals. The next step is software processing of the raw data. That's where Grok shines.


Grok Templates


Grok Embedded Templates


Logstash comes with over 100 built-in templates for structuring unstructured data. You should definitely take advantage of this when possible for general syslogs such as apache, linux, haproxy, aws and so on.


However, what happens when you have custom logs, as in the example above? You must build your own Grok template.


Custom Grok Templates


You need to try to build your own Grok template.I used Grok Debugger and Grok Patterns .


Note that the syntax for Grok templates is as follows: CDMY0CDMY


The first thing I tried to do was go to the Discover tab in the Grok debugger. I thought it would be great if this tool could automatically generate a Grok template, but it was not very useful, as it found only two matches.


ITKarma picture


Using this discovery, I began to create my own template on the Grok debugger, using the syntax found on the Github Elastic page.


ITKarma picture


After playing with different syntaxes, I was finally able to structure the log data as I wanted.


ITKarma picture


Link to the Grok debugger https://grokdebug.herokuapp.com/


Source:


localhost GET/v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0 

Pattern:


%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id} 

What happened in the end


{ "environment": [ [ "localhost" ] ], "method": [ [ "GET" ] ], "url": [ [ "/v2/applink/5c2f4bb3e9fda1234edc64d" ] ], "response_status": [ [ "400" ] ], "BASE10NUM": [ [ "400" ] ], "response_time": [ [ "46ms" ] ], "user_id": [ [ "5bc6e716b5d6cb35fc9687c0" ] ] } 

With the Grok template in hand and the associated data, the last step is to add it to Logstash.


Updating the Logstash.conf configuration file


On the server where you installed the ELK stack, go to the Logstash configuration:


sudo vi/etc/logstash/conf.d/logstash.conf 

Paste the changes.


input { file { path => "/your_logs/*.log" } } filter{ grok { match => { "message" => "%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id}"} } } output { elasticsearch { hosts => [ "localhost:9200" ] } } 

After saving the changes, restart Logstash and check its status to make sure it is still working.


sudo service logstash restart sudo service logstash status 

Finally, to make sure the changes take effect, be sure to update the Elasticsearch index for Logstash in Kibana!


ITKarma picture


With Grok, your log data is structured!


ITKarma picture


As we see in the image above, Grok is able to automatically map log data to Elasticsearch. This facilitates log management and quick information retrieval. Instead of rummaging through log files for debugging, you can simply filter out what you are looking for, such as the environment or url.


Try giving Grok expressions a chance! If you have another way to do this, or if you have any problems with the examples above, just write a comment below to let me know.


Thanks for reading - and please follow me here on Medium for more interesting articles on software engineering!



PS Link to the source


Telegram channel by Elasticsearch

.

Source