PAPI

Introduction

PAPI (Performance API) is a programming interface that allows developers to access hardware performance counters, providing important statistics for program performance analysis. We will explain how to install and use the PAPI library in Python to monitor metrics such as instruction count, CPU cycles, and others.

More information:

Performance Application Programming Interface

python-papi 6.0.0.2

Benefits of using PAPI

Using PAPI in Python offers several advantages for code development and optimization:

  1. Detailed performance measurement: - It provides low-level performance metrics, such as the number of executed instructions, CPU cycles, and other hardware-specific events.

  2. Code optimization: - It helps identify bottlenecks and areas of code that can be optimized to reduce execution time or resource usage.

  3. CPU efficiency tuning: - It helps tune CPU cycle usage to improve performance and reduce energy consumption in compute-intensive environments.

Installing the pypapi library

To use PAPI with Python, you need to install the pypapi library, which provides an interface to work with PAPI counters. Remember to install it in your venv.

  1. Install with pip:

    $ pip install pypapi
    

How to use PAPI in Python

After installing pypapi, you can start using it to monitor the performance of a code fragment.

The following example shows how to measure the number of executed instructions and consumed CPU cycles in a simple function.

from pypapi import events, papi_high as high

# Initialize PAPI to collect metrics
high.start_counters([events.PAPI_TOT_INS, events.PAPI_TOT_CYC])

# Code to measure
def simple_sum():
    total = 0
    for i in range(1, 1000000):
        total += i
    return total

# Run the function
simple_sum()

# Stop counters and get results
results = high.stop_counters()

# Print collected metrics
print("Total number of instructions (PAPI_TOT_INS):", results[0])
print("Total number of cycles (PAPI_TOT_CYC):", results[1])

# Compute CPI (Cycles Per Instruction)
if results[0] != 0:
    cpi = results[1] / results[0]
    print("Cycles per instruction (CPI):", cpi)
else:
    print("No instructions were counted.")

Code explanation:

  1. Counter initialization: We use high.start_counters() to enable PAPI counters. In this example, we count the total number of instructions (PAPI_TOT_INS) and total number of cycles (PAPI_TOT_CYC).

  2. Code to measure: We define a simple_sum() function that performs a simple sum. This is the code we want to analyze.

  3. Counter stop: We use high.stop_counters() to stop counters and retrieve results. results[0] contains the number of executed instructions, and results[1] contains the number of cycles.

  4. CPI calculation (Cycles Per Instruction): We divide the number of cycles by the number of instructions to obtain CPI, an important performance metric.

How to enable other PAPI counters

PAPI provides a wide range of performance counters to measure hardware-specific events. In addition to PAPI_TOT_INS and PAPI_TOT_CYC, other common counters include:

  • PAPI_FP_OPS: Number of floating-point operations.

  • PAPI_L1_DCM: Level-1 data cache misses.

  • PAPI_L2_TCM: Level-2 cache misses.

  • PAPI_BR_MSP: Branch mispredictions.

  • PAPI_L3_TCM: Level-3 cache misses.

  • PAPI_TLB_DM: Data Translation Lookaside Buffer (TLB) misses.

  • PAPI_VEC_INS: Number of executed vector instructions.

Be aware that not all counters are compatible with each other.

To add more counters, simply include the event in the list passed to start_counters:

high.start_counters([events.PAPI_TOT_INS, events.PAPI_TOT_CYC, events.PAPI_L1_DCM])

# Run the code to measure
...

PAPI limitations in Python

Although PAPI is a powerful tool, it has some limitations:

  • Compatibility: PAPI may not be compatible with all processors and operating systems.

  • Counter availability: Some specific events may not be available depending on the hardware.

  • Counter incompatibility: Some counters or events may be incompatible with each other.

Things to keep in mind

Performance can vary depending on cache size, which is relevant when running a very extensive analysis.

The total number of instructions executed as measured by PAPI may vary indirectly when cache misses occur. When a cache miss happens, the CPU must fetch data from slower memory (such as main memory), introducing additional latency for some instructions. This can cause reordering or stalling while required data is loaded, which may affect the number of instructions observed within a given interval.

However, the number of originally programmed instructions does not change. In practice, latency and cycles per instruction (CPI) are affected, but not the total number of instructions counted by PAPI. For a global instruction count, the value should remain constant regardless of cache misses.