Your Guide to Understanding System Performance

Read Time:6 Minute, 55 Second
Your Guide to Understanding System Performance

Meet Intel® VTune™ Amplifier’s Platform Profiler

Bhanu Shankar, Performance Tools Architect, and Munara Tolubaeva, Software Technical Consulting Engineer, Intel Corporation

Have you ever wondered
how well your system is being utilized throughout a long stretch of application
runs? Or whether your system was misconfigured, leading to a performance
degradation? Or, most importantly, how to reconfigure it to get the best
performance out of your code? State-of-the-art performance analysis tools,
which allow users to collect performance data for longer runs, don’t always
give detailed performance metrics. On the other hand, performance analysis
tools suitable for shorter application runs can overwhelm you with a huge
amount of data.

This article
introduces you to Intel® VTune™
Amplifier
’s
Platform Profiler, which provides data to learn whether there are problems with
your system configuration that can lead to low performance, or if there’s
pressure on specific system components that can cause performance bottlenecks.
It analyzes performance from either the system or hardware point of view, and
helps you identify under- or over-utilized resources. Platform Profiler uses a
progressive disclosure method, so you’re not overwhelmed with information. That
means it can run for multiple hours, giving you the freedom to monitor and
analyze long-running or always-running workloads in either development or
production environments.

You can use Platform
Profiler to:

  • Identify common system
    configuration problems
  • Analyze the performance of the
    underlying platform and find performance bottlenecks

First, the platform
configuration charts Platform Profiler provides can help you easily see how the
system is configured and identify potential problems with the configuration.
Second, you get system performance metrics including:

  • CPU and
    memory utilization
  • Memory
    and socket interconnect bandwidth
  • Cycles per
    instruction
  • Cache
    miss rates
  • Type of
    instructions executed
  • Storage
    device access metrics

These metrics provide
system-wide data to help you identify if the system―or a specific platform
component such as CPU, memory, storage, or network―is under- or over-utilized,
and whether you need to upgrade or reconfigure any of these components to
improve overall performance.

Platform Profiler in Action

To see it in action,
let’s look at some analysis results collected during a run of the
open-source HPC Challenge (HPCC) benchmark suite and see how it
uses our test system. HPCC consists of seven tests to measure performance of:

  • Floating-point (FP) execution
  • Memory access
  • Network communication operations

Figure 1 shows system
configuration view of the machine where we ran our tests. The two-socket
machine contained Intel® Xeon® Platinum 8168 processors, with two memory controllers and six memory
channels per socket, and two storage devices connected to Socket 0.

Figure 2 shows CPU
utilization metrics and the cycles per Instruction (CPI) metric, which measures
how much work the CPUs are performing. Figure 3 shows memory, socket interconnect, and
I/O bandwidth metrics. Figure 4 shows the ratio of load, store, branch, and FP
instructions being used per core. Figures 5 and 6 show memory bandwidth and latency chart for each memory channel. Figure 7 shows a rate of branch and FP
instructions over all instructions. Figure 8 shows L1 and L2 cache miss rate per
instruction. Figure
9 shows memory
consumption chart. On average, only 51% of memory was consumed throughout the
run. A larger test case can be run to increase memory consumption.

In Figures 5 and 6, we see that only two channels instead of six are being used.
This clearly shows that there’s a problem with the memory DIMM configuration on
our test system that’s preventing us from making full usage of memory channel
capacity―leading to a performance degradation of HPCC.

The CPI (Figure 2), DDR memory bandwidth utilization, and
instruction mix metrics in the figures show which specific type of test―either
compute or FP operation- or memory-based―is being executed at a specific time
during the HPCC run. For example, we can see that during 80-130 and 200-260
seconds of the run, both the memory bandwidth utilization and CPI rate
increase―confirming that a memory-based test inside HPCC was executed during
that period of time. Moreover, the Instruction Mix chart in Figure 7 shows that between 280-410 seconds,
threads execute FP instructions in addition to some memory access operations
during 275-360 seconds (Figure 3). This observation leads us to the idea that a test with a
mixture of both compute and memory operations is executed during this period.
Another observation is that we may be able to improve the performance of the
compute part in this test by optimizing the execution of FP operations using
code vectorization.

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig1-1024x261.png

Figure 1 – System
Configuration View

CPU Metrics

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig2-1024x680.jpg

Figure 2 – CPU
utilization metrics

Throughput Metrics

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig3-1024x544.jpg

Figure 3 – Throughput
metrics for memory, UPI and I/O

Operations Metrics

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig4-1024x441.jpg

Figure 4 – Types of instructions
used in throughout program execution

Memory Throughput

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig5-1024x378.jpg

Figure 5 – Memory
bandwidth chart at a memory channel level

Memory Latency

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig6-1024x372.jpg

Figure 6 – Memory
latency chart at a memory channel level

Instruction Mix

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig7-1024x388.jpg

Figure 7 – Rate of
branch and floating point instructions over all instructions

L1 and L2 Miss per Instruction

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig8-1024x391.jpg

Figure 8 – L1 and L2
miss rate per instruction

Memory Utilization

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig9-1024x100.jpg

Figure 9 – Memory
consumption

HPCC doesn’t perform
any tests that include I/O, so we’ll show Platform Profiler results
specifically on disk access from a second test case, LS-Dyna*, a proprietary multiphysics simulation software developed by
LSTC. Figure 10 shows disk I/O
throughput for LS-Dyna. Figure 11 shows I/O per second (IOPS) and latency metrics for
LS-Dyna application. The LS-Dyna implicit model periodically flushes the data
to the disk, so we see periodic spikes in the I/O throughput chart (see
read/write throughput in Figure 10). Since the amount of data to be written isn’t large, the I/O
latency remains consistent during the whole run (see read/write latency
in Figure 11).

Read/Write Throughput

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig10-1-1024x249.jpg

Read/Write Operation Mix

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig10-2-1024x220.jpg

Figure 10 – Disk I/O
throughput for LS-Dyna

Read/Write Latency

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig11-1-1024x233.jpg

IOPS

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig11-2-1024x215.jpg

Figure 11 – IOPS and
latency metrics for LS-Dyna

Understanding System Performance

In this article, we
presented Platform Profiler, a tool that analyzes performance from the system
or hardware point of view. It provides insights into where the system is
bottlenecked and identifies whether there are any over- or under-utilized
subsystems and platform-level imbalances. We also showed its usage and the
results collected from the HPCC benchmark suite and the LS-Dyna application.
Using the tool, we found that poor memory DIMM placement was limiting memory
bandwidth. Also, we found a part of the test had a high FP execution, which we
could optimize for better performance using code vectorization. Overall, we
found that this specific test case for HPCC and LS-Dyna doesn’t put any
pressure on our test system, and there’s more room for system resources―meaning
we can run an even larger test case next time.

Software and workloads
used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary.
You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. For more complete information visit
http://www.intel.com/performance.

Intel’s compilers may
or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel
does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.

Source: https://educationecosystem.com/blog/your-guide-to-understanding-system-performance/


You might also like this video