Chapter 15: Continuing the Voyage: Monitoring and Maintenance

The effective use of AI Services services is not only just about implementation, it’s also about continuous monitoring and maintenance.

The importance of monitoring and maintenance in Azure AI services cannot be overstated. It ensures the smooth operation of AI applications, helps in identifying and rectifying issues promptly, and aids in optimizing performance over time. Moreover, it provides valuable insights into the system’s behavior, which can be used to make informed decisions and strategic improvements.

The Monitor Azure AI Services course will teach you the significance of monitoring Azure AI Services. This is crucial for tracking utilization, identifying trends, and detecting and troubleshooting issues. Upon completion of this module, you will have the ability to monitor Azure AI Services costs, create alerts, view metrics, and manage diagnostic logging. Metrics for the resource type Microsoft.CognitiveServices/accounts are comprehensively listed and can be accessed via the following link: Supported metrics for Microsoft.CognitiveServices/accounts.

In the sections that follow, we will highlight the resources available that discuss best practices, and provide practical guidance on how to effectively monitor and maintain your Azure AI services.

Continuing the Voyage: Monitoring and Maintenance

Azure OpenAI

Azure OpenAI, a part of the Azure AI services, is a robust and comprehensive tool that offers a wide range of features to optimize performance, manage resources, and ensure security. This section will guide you through various aspects of Azure OpenAI, providing a deeper understanding of its functionalities and how to leverage them effectively.

Understanding Performance and Latency

Performance and latency are crucial aspects of any AI service. They can significantly impact the efficiency and effectiveness of your AI applications. Azure OpenAI is designed to provide high performance with minimal latency. However, understanding how latency works with Azure OpenAI and how to optimize your environment can further enhance your application’s performance. This guide provides valuable insights into these aspects, offering practical tips on optimizing your environment for improved performance.

Managing Azure OpenAI Service Quota

Azure OpenAI provides a flexible quota system that allows you to manage rate limits across your deployments effectively. This system ensures that resources are allocated optimally, preventing overuse and underuse. This article walks you through the process of managing your Azure OpenAI quota, helping you to actively manage the allocation of rate limits across the deployments within your subscription.

Monitoring Azure OpenAI Service

Monitoring is a critical aspect of managing any service. Azure OpenAI uses Azure Monitor, a powerful service that provides monitoring and analytics of Azure resources. This guide will help you understand how to monitor your Azure OpenAI resources effectively, ensuring optimal performance and availability.

Additionally, the Azure OpenAI Insights article provides a downloadable Azure Monitor workbook. This workbook, combined with Azure OpenAI’s comprehensive metrics and diagnostic logs, can be a powerful tool for scaling and monitoring your AI initiatives.

Implementing Logging and Monitoring

This architecture provides advanced logging capabilities for tracking API usage and performance. The architecture also offers robust security measures to protect sensitive data and prevent malicious activity, ensuring that your enterprise deployments are secure and efficient.

This repository details the deployment of an Enterprise Azure OpenAI reference architecture. It provides advanced patterns for customers using models with larger token sizes, advanced analytics on prompts and responses, and requirements to send these events to another type of data store. It ensures comprehensive logging of Azure OpenAI model execution, advanced usage and throttling controls

Abuse Monitoring

Azure OpenAI Service detects and mitigates instances of recurring content and/or behaviors that suggest use of the service in a manner that may violate the Code of Conduct or other applicable product terms. Details on how to do Abuse Monitoring can be found here.

Azure AI search

Understanding Performance Benchmarking

Azure AI Search’s performance depends on a variety of factors including the size of your search service and the types of queries you’re sending. To help estimate the size of search service needed for your workload, this article shows you the several performance benchmarks for different search services and configurations.

Factors Influencing Azure AI Search Performance

Azure AI Search’s performance is influenced by several factors, including the types of queries you’re sending. This article describes the tools, behaviors, and approaches for analyzing query and indexing performance in Azure AI Search. In any large implementation, it’s critical to do a performance benchmarking test of your Azure AI Search service before you roll it into production. You should test both the search query load that you expect, but also the expected data ingestion workloads. Having benchmark numbers helps to validate the proper search tier, service configuration, and expected query latency.

Tips for Boosting Performance

Understanding the factors that most likely impact search performance can help you avoid inefficiencies and maximize your search service’s potential. Key factors include index composition (schema and size), query design, and service capacity (tier, and the number of replicas and partitions). Discover more:Tips for better performance in Azure AI Search

Monitoring Azure AI Search with Azure Monitor

Azure Monitor provides monitoring capabilities over all Azure resources, including Azure AI Search. This article explains how monitoring works for Azure AI Search, including how to measure query performance and volume using built-in metrics and resource logging.

Monitoring Query Requests

This article explains how to measure query performance and volume using built-in metrics and resource logging. The Azure portal shows basic metrics about query latency, query load, and throttling.

Monitoring Indexer Status and Results

You can monitor indexer processing in the Azure portal, or programmatically through REST calls or an Azure SDK. In addition to status about the indexer itself, you can review start and end times, and detailed errors and warnings from a particular run.

Discover more: Monitor indexer status and results in Azure AI Search

Collect telemetry data for search traffic analytics

Search traffic analytics is a pattern for collecting telemetry about user interactions with your Azure AI Search application, such as user-initiated click events and keyboard inputs. Using this information, you can determine the effectiveness of your search solution, including popular search terms, clickthrough rate, and which query inputs yield zero results.

This pattern takes a dependency on Application Insights (a feature of Azure Monitor) to collect user data. It requires that you add instrumentation to your client code, as described in this article. Finally, you will need a reporting mechanism to analyze the data.

Discover more: Collect telemetry data for search traffic analytics

Visualizing Azure AI Search Logs and Metrics with Power BI

Azure AI Search can send operation logs and service metrics to an Azure Storage account, which you can then visualize in Power BI. This article explains the steps and how to use a Power BI Template App to visualize the data. The template can help you gain detailed insights about your search service, including information about queries, indexing, operations, and service metrics.

Setting Up for Success with Azure AI Search Performance

The article provides an overview of performance in Azure AI Search. It guides you through the key factors that determine performance, shows you some performance benchmarks, and how you can run your own performance tests. It also provides tips on how you can diagnose and fix performance issues you might be experiencing.

Azure Machine Learning

Data Collection from Models in Production

This article introduces the Azure Machine Learning Data collector, which provides real-time logging of input and output data from models deployed to managed online endpoints or Kubernetes online endpoints. The logged inference data is stored in Azure blob storage and can be used for model monitoring, debugging, or auditing, thereby providing observability into the performance of your deployed models.

Collecting Production Data from Deployed Models

Learn how to collect production inference data from a model deployed to an Azure Machine Learning managed online endpoint or Kubernetes online endpoint. Azure Machine Learning Data collector logs inference data in Azure blob storage. You can enable data collection for new or existing online endpoint deployments. The data collected with the provided Python SDK is automatically registered as a data asset in your Azure Machine Learning workspace and can be used for model monitoring. If you’re interested in collecting production inference data for a MLFlow model deployed to a real-time endpoint, doing so can be done with a single toggle. To learn how to do this, see Data collection for MLFlow models.

Model Monitoring in Azure Machine Learning

Model monitoring is the last step in the machine learning end-to-end lifecycle. This article discusses the signals and metrics you can monitor, and the recommended practices for using model monitoring. It also explains how Azure Machine Learning acquires monitoring signals through data analysis on streamed production inference data and reference data.

Monitoring Azure Machine Learning with Azure Monitor

When you have critical applications and business processes relying on Azure resources, you want to monitor those resources for their availability, performance, and operation. This article describes the monitoring data generated by Azure Machine Learning and how to analyze and alert on this data with Azure Monitor.

Monitoring Performance of Models Deployed to Production

Once a machine learning model is in production, it’s important to critically evaluate the inherent risks associated with it and identify blind spots that could adversely affect your business. This article provides insights on how to perform out-of-box and advanced monitoring setup for models that are deployed to Azure Machine Learning online endpoints. It also explains how to set up model monitoring for models that are deployed outside Azure Machine Learning or deployed to Azure Machine Learning batch endpoints.

Azure Machine Learning Studio

Monitor and analyze jobs in studio

You can use Azure Machine Learning studio to monitor, organize, and track your jobs for training and experimentation. Your ML job history is an important part of an explainable and repeatable ML development process. Discover more here.

Organize & track training jobs

You can use the jobs list view in Azure Machine Learning studio to organize and track your jobs. By selecting a job, you can view and analyze its details, such as metrics, parameters, logs, and outputs. This way, you can keep track of your ML job history and ensure a transparent and reproducible ML development process. Discover more here.

Visualize training results in studio

Explore your experimentation results with a dashboard view. The dashboard view contains a combination of different tiles – chart visualizations, comparison table, markdown, and more for a view that is dynamic, flexible, and customizable for you to explore your experimentation results. The dashboard can help you to save time, keep your results organized, and make informed decisions such as whether to re-train or deploy your model. Discover more here.

Key Takeaways and Actionable Strategies for AI Monitoring in Azure

It is important to monitor AI solutions to ensure their availability, performance, and operation. Monitoring can help detect and resolve issues quickly, and ensure that the AI solution is performing optimally. Here are some recommendations when building AI solutions on Azure to act on:

Set up monitoring dashboards: Azure provides out-of-box dashboards for each of your Azure resources. To access the monitoring dashboards, sign in to the Azure portal and select the overview pane for one of your Azure resources.
Configure diagnostic settings: You can configure Azure Monitor to generate data in activity logs, resource logs, virtual machine logs, and platform metrics. Platform metrics and the Azure Monitor activity log are collected and stored automatically. This data can be routed to other locations by using a diagnostic setting. Azure Monitor resource logs aren’t collected and stored until you create a diagnostic setting and then route the logs to one or more locations. When you create a diagnostic setting, you specify which categories of logs to collect.
Analyze metrics: You can analyze metrics for your Azure resources with Azure Monitor tools in the Azure portal. From the Overview page for your Azure resource, select Metrics under Monitoring in the left pane.

By following these recommendations, teams can ensure that their AI solutions are performing optimally and that any issues are detected and resolved quickly.