Monitoring and profiling Spring Boot applications – InformTFB

Monitoring and profiling Spring Boot applications

Monitoring and profiling Spring Boot applications

Monitoring is very important for modern applications, modern applications are inherently highly distributed and have different dependencies, such as database, services, caching, and more. Therefore, it is increasingly important to track and monitor these services so that the application adheres to the SLA conditions (service Level Agreement). An SLA is an agreement between a customer and a service provider that takes into account reliability, response speed, and other service level metrics.

We always try not to violate any SLA, violation of any part of the SLA can have many consequences. If the service does not meet the conditions defined in the SLA, it can damage the brand’s reputation and lead to loss of revenue. Worst of all, the company may lose a customer to a competitor because of its inability to meet the customer’s service level requirements.

What metrics should I track?

  • The availability of the service : the time that the service is available for use. This can be measured in terms of response time, for example, percentile X, abbreviated as pX, for example, p95, p99, p99. 999. Not all services require p99, 999, systems with guaranteed high availability, such as e-Commerce, search, payment, etc., should have a higher SLA.
  • The level of defects: despite all the efforts to develop the system, no system is 100% perfect. We need to count the percentage of errors in the main threads. This category can include production failures such as server error, database query error, connection error, network errors, and missed deadlines.
  • Security: in these over-regulated times, application and network security breaches can be costly. You can include measurement of controlled security measures in this category, such as system access, unauthorized database access, bulk loading of database records, or moving large amounts of data.
  • Business results: it customers increasingly want to include business process metrics in their SLAs so that they can make better business decisions. This category can collect various data, such as the number of users who visited the page, login activity, coupon effectiveness, and so on.

Initially, choosing a set of metrics may be a non-trivial task, and we may also face the dilemma of how many metrics we should track. We can start at the very minimum and add as much as we need later.

Configuring monitoring

A typical monitoring setup system will consist of three components:

  1. Metric repositories (usually a time series database), such as InfluxDB, TimescaleDB, Prometheus, and so on.
  2. Toolbar (this panel will be used to visualize data stored in the Yandex. Metrica store).
  3. Applications that will continue to send metrics to the metric store or metric store periodically retrieve data from the local application state.

We can also have other components, such as an alert, where the notification channels can be email, Slack, or any other. The alerts component will send alerts to app owners or event subscribers. We are going to use Grafana as a dashboard and notification system, and Prometheus as a metric storage system.

We will need:

  1. Any IDE
  2. The Java Platform
  3. Gradle

Create a project using the Spring Boot initializerand add as many dependencies as we need. We’re going to use the Micrometer library , which is a tool facade that provides bindings for many metric repositories, such as Prometheus, Datadog, and New Relic, to name just a few.

Out of the box Micrometer provides

  1. The HTTP request
  2. JVM
  3. Database
  4. Metrics related to the caching system, and so on.

Some metrics are enabled by default, while others can be enabled, disabled, or configured. We will use the application.properties file to enable, disable, and configure metrics. We also need to use the Spring boot actuator, as it will open access to the Prometheus endpoint.

Add these dependencies to the build file.gradle:

  1. io.micrometer: micrometer-registry-prometheus
  2. org.springframework.boot: spring-boot-starter-actuator
dependencies {
  implementation 'org.springframework.boot:spring-boot-starter-data-jpa'
  implementation 'org.springframework.boot:spring-boot-starter-web'
  compileOnly 'org.projectlombok:lombok'
  annotationProcessor 'org.projectlombok:lombok'
  providedRuntime 'org.springframework.boot:spring-boot-starter-tomcat'
  implementation 'io.micrometer:micrometer-registry-prometheus'
  implementation 'org.springframework.boot:spring-boot-starter-actuator'
  // https://mvnrepository.com/artifact/com.h2database/h2
  compile group: 'com.h2database', name: 'h2', version: '1.4.200'
  testImplementation('org.springframework.boot:spring-boot-starter-test') {
  exclude group: 'org.junit.vintage', module: 'junit-vintage-engine'
  }
}

We can enable Prometheus export by adding the following line to the properties file.

management.metrics.export.prometheus.enabled = true

After adding this line, Micrometer will start accumulating data about the app, and this data can be viewed by going to the actuator/Prometheus endpoint. This endpoint will be used in the Prometheus script to get data from our application servers.

Even though we added this line in the properties, we can’t view the Prometheus endpoint because it is disabled by default, we can open it using the management endpoint by including Prometheus in the list.

management.endpoints.web.exposure.include=prometheus

NOTE: Do not enable all of the actuator endpoints, as this may open a security loophole. We need to select them selectively, especially in a production system, even if we want to avoid exposing the endpoint to the whole world, as it can reveal a large amount of data about the application, use some kind of proxy or some rule to hide the data from the outside world.

Various parts of HTTP requests are configured, for example, the SLA. setting the attribute should be calculated or not the percentile histogram is done using the metrics.distribution properties.

The application.properties example may contain the following lines:

# Включить экспорт prometheus
management.metrics.export.prometheus.enabled = true
# Включить конечную точку Prometheus
management.endpoints.web.exposure.include = Прометей
# включить гистограмму на основе процентилей для http запросов
management.metrics.distribution.percentiles-histogram.http.server.requests = true
# сегментов гистограммы http SLA
management.metrics.distribution.sla.http.server.requests = 100 мс, 150 мс, 250 мс, 500 мс, 1 с
# включить метрики JVM
management.metrics.enable.jvm = true

Now, if we launch the app and go to the page http://locahost:8080/actator/prometheus, a hell of a lot of data will be displayed.

The data above show details from the HTTP request, exception=None means that an exception has not occurred, if it is, we can use this to filter the number of requests that failed due to this exception, method=GET name HTTP. status=200 HTTP status is 200, uri=/actator/prometheus displays the URL path, le=xyz displays the processing time, and N. 0 displays the number of calls to this endpoint.

This data is a histogram that can be plotted in Grafana. for example, to plot p95 in 5 minutes, we can use the following query.

histogram_quantile(0.95,sum(rate(http_server_requests_seconds_bucke[5m])) by (le))

You can also use Grafana to create other metric graphs, such as a pie chart, and so on.

Custom metrics

Often we need special metrics, some of the use cases are the number of logged-in users, information about the inventory available, currently available, the number of orders in the queue, etc. Some of the business use cases can be solved using custom metrics, micrometer. supports various types of metrics, such as timer, gauge, counter, distribution summaries, timers for long-running tasks, and so on. We will mainly focus on the sensor and counter. The sensor gives us instantaneous data, such as queue length, whereas the counter is like a monotonically increasing number starting from one.

To demonstrate this, we are going to create an example of an inventory Manager that will store parts in memory and provide two functions::

  1. Add products
  2. Get the goods

To do this, we will create one counter and one metric in the method init. with each callgetItems, we increment the counter and also measure the stock size, whereas with the calladdItems, we only update the metric.

@Component
public class StockManager {
  @Autowired private MeterRegistry meterRegistry;
  private List<String> orders = new Vector<>();
  private Counter counter;
  private Gauge gauge;

@PostConstruct
  public void init() {
    counter =
        Counter.builder("order_created")
            .description("number of orders created")
            .register(meterRegistry);
    gauge =
        Gauge.builder("stock.size", this, StockManager::getNumberOfItems)
            .description("Number of items in stocks")
            .register(meterRegistry);
  }

public int getNumberOfItems() {
    return orders.size();
  }

public void addItems(List<String> items) {
    orders.addAll(items);
    // measure gauge
    gauge.measure();
  }

public List<String> getItem(int count) {
    List<String> items = new ArrayList<>(count);
    while (count < 0) {
      try {
        items.add(orders.remove(0));
      } catch (ArrayIndexOutOfBoundsException e) {
        break;
      }
      count -= 1;
    }
    // increase counter
    counter.increment();
    //  measure gauge
    gauge.measure();
    return items;
  }
}

For demonstration purposes, we will add two endpoints for adding products and receiving products.

@RestController
@RequestMapping(path = "stocks")
@RequiredArgsConstructor(onConstructor = @__(@Autowired))
public class StockController {
  @NonNull private StockManager stockManager;

  @GetMapping
  @ResponseBody
  public List<String> getItems(@RequestParam int size) {
    return stockManager.getItem(size);
  }

  @PostMapping
  @ResponseBody
  public int addItems(@RequestParam List<String> items) {
    stockManager.addItems(items);
    return stockManager.getNumberOfItems();
  }
}

  @PostMapping
  @ResponseBody
  public int addItems(@RequestParam List<String> items) {
    stockManager.addItems(items);
    return stockManager.getNumberOfItems();
  }
}

Let’s first add ten products using two API calls:

  1. Curl -X POST http://localhost:8080/stocks?Items = 1,2,3,4
  2. Curl -X POST http://localhost:8080/stocks?Items = 5,6,7,8,9,10

Now, if we go to the Prometheus endpoints, we see the following data, which shows that we currently have 10 products in stock.

# HELP stock_size Количество товаров на складе
# TYPE stock_size gauge
stock_size 10.0

Now we are going to place an order for 3 products:

http://localhost:8080/stocks?size=3

Now, if we look at the Prometheus endpoint, we get the following data, which shows that the stock size has been changed to seven.

# HELP stock_size Количество товаров на складе
# TYPE stock_size gauge
stock_size 7.0

In addition, we can see that a value of 1 has been added to the counter, which means that one order has been placed.

# HELP order_created_total количество созданных заказов
# TYPE order_created_total counter
order_created_total 1.0ordercreated_total 1.0

Profiling

In software engineering, profiling (“program profiling”,” software profiling”) is a form of dynamic program analysis that measures, for example, the amount (memory) or time complexity of a program, the use of certain instructions, or the frequency and duration of function calls.

Most often, profiling information helps you optimize your program. Profiling is achieved by collecting program performance characteristics, such as the execution time of individual fragments (usually subroutines), the number of correctly predicted conditional jumps, the number of cache misses, and so on. The tool used for job analysis is called a Profiler or Profiler. Profilers can use a number of different methods, such as event-based methods, statistical methods, instrumental methods, and modeling methods. — Wikipedia

Profiling is very useful when diagnosing system problems, such as how long an HTTP call takes, if it takes N seconds, and then where all that time was spent, what is the distribution of N seconds between different database requests, subsequent service calls, etc. We can use a histogram to plot the distribution graph in the toolbar, we can also use a counter to measure the number of DB requests, etc.

The important thing is that the embedded part of the profiling code is the same for similar types of profilers, which means that we need to copy and paste similar code in thousands of places, if we need to change something, we need to update the same thing. The Profiler code in each file, and probably in each function that requires profiling, will increase complexity and may become a complete mess, but we can avoid this mess by using aspect-oriented programming (AOP).

Короче говоря, АОП работает по шаблону проектирования прокси, хотя его можно реализовать и с помощью модификации байтового кода.

Whenever a method is called, we expect the called method to be called directly without any intermediate steps, but when the AOP is put in place, then the method call is intercepted by the proxy method, and the proxy method calls the target method, the proxy method returns the result for the caller, as shown in the figure below.

The system depends on other systems, so we may be interested in profiling different components in different ways, such as database calls, HTTP requests, subsequent service calls, or some specific methods that are critical, or would like to see what happens in some specific methods. We can use the same Micrometer library for profiling, but this may not be exactly what we need, so we’ll change the code.

Micrometer comes with a Timed annotation. this annotation can be placed in any public method, since the name suggests that it can measure the execution time, this will measure the execution time of the corresponding method. Instead of using this annotation directly, we will extend this annotation to support other features such as logging, retry, and so on.

The Timed annotation is useless without the TimedAspect bean, as we are redefining the timed annotation, so we will also define the TimedAspect class according to needs such as logging, bulk profiling (profile all methods in the package without adding any annotations to any method or class), retry, etc. In this story, we’ll look at three usage scenarios::

  1. Mass profiling.
  2. Logging.
  3. Profile-specific method.

Creating the java MonitoringTimed file.java, to which we will add a new field named loggingEnabled, this field will be used to check whether logging is enabled or not, if enabled, and then log method arguments and return values.

@Target({ElementType.ANNOTATION_TYPE, ElementType.TYPE, ElementType.METHOD})
@Retention(RetentionPolicy.RUNTIME)
public @interface MonitoringTimed {
  /** All fields are same as in {@link io.micrometer.core.annotation.Timed} */
  String value() default "";

  String[] extraTags() default {};

  boolean longTask() default false;

  double[] percentiles() default {};

  boolean histogram() default false;

  String description() default "";

  // NEW fields starts here
  boolean loggingEnabled() default false;
}

This annotation is useless without the Timed aspect class, so a new MonitoringTimedAspect class will be defined with all the necessary details, this class will have a method for profiling any method based on the combined processing object and MonitoringTimed object, and another for profiling the method based on The monitoringtimed annotation.

@Around("execution (@com.gitbub.sonus21.monitoring.aop.MonitoringTimed * *.*(..))")
public Object timedMethod(ProceedingJoinPoint pjp) throws Throwable {
  Method method = ((MethodSignature) pjp.getSignature()).getMethod();
  MonitoringTimed timed = method.getAnnotation(MonitoringTimed.class);
  if (timed == null) {
    method = pjp.getTarget().getClass().getMethod(method.getName(), method.getParameterTypes());
    timed = method.getAnnotation(MonitoringTimed.class);
  }
  final String metricName = generateMetricName(pjp, timed);
  return timeThisMethod(pjp, timed, metricName);
}

public Object timeThisMethod(ProceedingJoinPoint pjp, MonitoringTimed timed) throws Throwable {
  final String metricName = generateMetricName(pjp, timed);
  return timeThisMethod(pjp, timed, metricName);
}

The TimedMethod method with the around annotation is used to filter all method calls annotated with MonitoringTimed.

For mass profiling, we identify the class of the Profiler that will work with the filter at the package level. For example, to profile an HTTP request, we can add a ControllerProfiler that handles profiling all public methods available in the controller package.

@Aspect
@Component
public class ControllerProfiler {
  private static Map<String, Object> timedAnnotationData = new HashMap<>();

  static {
    // use percentile data of p90, p95, p99.99
    double[] percentiles = {0.90, 0.95, 0.9999};
    // set histogram to true
    timedAnnotationData.put("histogram", true);
    // set percentile
    timedAnnotationData.put("percentiles", percentiles);
  }

  @Autowired private MonitoringTimedAspect timedAspect;

  private static final MonitoringTimed timed = Javanna.createAnnotation(MonitoringTimed.class, timedAnnotationData);

  private static final Logger logger = LoggerFactory.getLogger(ControllerProfiler.class);

  @Pointcut("execution(* com.gitbub.sonus21.monitoring.controller..*.*(..))")
  public void controller() {}

  @Around("controller()")
  public Object profile(ProceedingJoinPoint pjp) throws Throwable {
    // here add other logic like error happen then log parameters etc
    return timedAspect.timeThisMethod(pjp, timed);
  }
}

An interesting line in the above code is @Pointcut (“execution(com. gitbub. sonus21.monitoring.controller...*(..))”), whichdefines pointcut. the pointcut expression can be defined using logical operators like not (!), or (||), and (&&). After a method is qualified according to the pointcut expression, it can call the corresponding method defined with the [at] around annotation. Since we have defined the profile method to be called, we can also define other methods using the annotations [at]After, [at]Before, and so on.

After adding several elements using the POST method, we can see the following data in the Prometheus endpoint.

method_timed_seconds {class = "com.gitbub.sonus21.monitoring.controller.StockController", exception = "none", method = "addItems", quantile = "0.9",} 0.0
method_timed_seconds_bucket {class = "com.gitbub.sonus21.monitoring.controller.StockController", exception = "none", method = "addItems", le = "0.001",} 3.0
method_timed_seconds_bucket {class = "com.gitbub.sonus21.monitoring.controller.StockController", exception = "none", method = "addItems", le = "0.002446676",} 3.0

We can also directly use The monitoringtimed annotation for any method to measure execution time, for example, let‘s measure how much time StockManager’s addItems method spends adding items.

@MonitoringTimed
public void addItems(List<String> items) {
  orders.addAll(items);
  // measure gauge
  gauge.measure();
}

Once we launch the app and add a few items, we’ll see the following in the Prometheus endpoint.

method_timed_seconds_count{class="com.gitbub.sonus21.monitoring.service.StockManager",exception="none",method="addItems",} 4.0
method_timed_seconds_sum{class="com.gitbub.sonus21.monitoring.service.StockManager",exception="none",method="addItems",} 0.005457965
method_timed_seconds_max{class="com.gitbub.sonus21.monitoring.service.StockManager",exception="none",method="addItems",} 0.00615316

MonitoringTimed can be further configured, for example, you can add the number of retries to support retries in the event of a failure, record function arguments in the event of a failure, so that you can analyze the cause of the failure later.

The full code is available on GitHub.

Valery Radokhleb
Valery Radokhleb
Web developer, designer

Leave a Reply

Your email address will not be published. Required fields are marked *