Skip to content

04 Metrics and Measurement

Software Engineering: Principles, practices (technical and nontechnical) for confidently building high-quality software.

Measurements and Metrics

  • Measurement: empirical, objective assignment of numbers, according to a rule derived from a model or theory, to attributes of objects or events with the intent of describing them

Software Quality Metrics

  • Entities: software product, modules, software development process, people
  • Software qualities: Functionality (e.g., data integrity), Scalability, Security, Extensibility, Bugginess, Documentation, Performance, Installability, Availability, Consistency, Portability, Regulatory compliance
  • Process qualities: Development efficiency, Meeting efficiency, Conformance to processes, Reliability of predictions, Fairness in decision making, Regulatory compliance, On-time release
  • People qualities:
    • Developers: Maintainability • Performance • Employee satisfaction and well-being • Communication and collaboration • Efficiency and flow • Satisfaction with engineering system • Regulatory compliance
    • Customers: Satisfaction • Ease of use • Feature usage • Regulatory compliance
  • Non-trivial qualities: Software: Code elegance, Code maintainability; Process: Fairness in decision making; Team: Team collaboration, Creativity

McNamara fallacy

  • Measure whatever can be easily measured.
  • Disregard that which cannot be measured easily.
  • Presume that which cannot be measured easily is not important.
  • Presume that which cannot be measured easily does not exist.

Code Complexity

  • Lines of Code (LOC): A simple measure of how large a codebase is, but not always indicative of complexity or quality.
  • Halstead Volume: Measures the size and complexity of a program based on the number of operators and operands.
  • Cyclomatic Complexity: Measures the number of independent paths through a program and can indicate the number of test cases required for full coverage.
  • Object-Oriented Metrics: Includes metrics like number of methods per class, depth of inheritance, and coupling between classes.
  • “Allowable mass” proxy: More complex the software task -> consumes bigger part of the mass budget (in grams)

How to use measurements and metrics?

Goal-Question-Metric (GQM) Framework

  • Goal: Define what you want to achieve.
  • Questions: Determine what you need to answer to know if your goal is met.
  • Metrics: Identify what measurements are necessary to answer the questions.
  • Examples:
    • Goal: Evaluate the effectiveness of a coding standard from team's perspective.
    • Questions: How comprehensible is the coding standard? What is its impact on team productivity?
    • Metrics: Number of revisions required to achieve compliance, team members' understanding, code size.

Defining Goals

  • Purpose: The reason or aim for defining the goal. It can be to improve, evaluate, or monitor something.
    • Example: Improve the reliability of the software.
  • Issue: The specific aspect or challenge you're addressing. Common issues include reliability, usability, or effectiveness.
    • Example: Address the issue of poor usability in the current software design.
  • Object: The target of the goal. This could be the final product, a specific component, a process, or an activity.
    • Example: Evaluate the usability of the user interface component.
  • Viewpoint: The perspective from which the goal is assessed. It can be from the perspective of any stakeholder (e.g., user, developer, manager).
    • Example: Measure the usability from the perspective of end users.

Measurement for Decision Making

  • Key Decisions Metrics Help With:
    • Fund project?: Should we allocate resources and continue funding this project?
    • More testing?: Is the current testing sufficient, or do we need additional tests to ensure quality?
    • Fast enough? Secure enough?: Is the system performing efficiently and securely?
    • Code quality sufficient?: Is the codebase meeting the desired standards of quality?
    • Which feature to focus on?: Deciding which features require the most attention and improvement based on their importance.
    • Developer bonus?: Should performance metrics influence developer bonuses?
    • Time and cost estimation? Predictions reliable?: Are our time and cost estimations accurate, and are they backed by reliable data?

Trend Analyses

  • Monitoring Test Result Trends:![[Screen Shot 2024-10-06 at 21.00.37.png]]
    • The chart shows trends in test results over time, with the green areas representing passing tests and the red spikes indicating failures.
    • Regular trend analysis helps track improvements, regressions, or inconsistencies in the software over time.
    • Purpose: Trend analysis enables early detection of performance degradation or bugs and helps maintain a stable codebase.

Benchmarking Against Standards

  • Monitor and Compare: ![[Screen Shot 2024-10-06 at 21.01.19.png]]
    • Monitor multiple projects or modules: Keep track of different projects or code modules to establish typical values for metrics like test-to-code ratios.
    • Report deviations: Identify and report when a project deviates from established benchmarks or standards.

Case study: Autonomous Vehicle Software

By what metrics can we judge AV software (e.g., safety)? 1. Code Coverage - Definition: Code coverage refers to the amount of code that gets executed during testing. - Types of Coverage: - Statement coverage: Ensures individual lines of code are executed. - Line coverage: Measures whether specific lines of code are reached during testing. - Branch coverage: Ensures all possible branches in if-else conditions are covered. - Example: 75% branch coverage means that 3 out of 4 possible outcomes in an if-else statement have been tested.

  1. Model Accuracy
  2. Definition: This metric evaluates how accurately the machine learning models in AV software recognize objects, make decisions, or navigate environments.
  3. Training: The models are trained using labeled data, which includes sensor data (e.g., from cameras, radar) and corresponding ground truth.
  4. Testing: Accuracy is computed on a separate labeled test set.
  5. Example: If the model has 90% accuracy, it means that the object recognition is correct for 90% of the test inputs.

  6. Failure Rate

  7. Definition: Failure rate tracks the frequency of crashes or fatalities involving autonomous vehicles.
  8. Measurement Units: per 1,000 rides, per million miles, or per month.

  9. Mileage

  10. Definition: Mileage is a measure of how many miles an autonomous vehicle has driven, which provides data on the software's real-world performance over time.
  11. Importance: The more miles an AV drives, the more data it accumulates to improve its algorithms, reduce failure rates, and increase reliability.
  12. Example: Waymo's vehicles have accumulated over 15 billion autonomously driven miles in simulation and over 20 million real-world miles on public roads. Such data is used to demonstrate the safety and reliability of AV software.

Risks and challenges

  • The streetlight effect: A known observational bias. People tend to look for something only where it’s easiest to do so.
  • Making inferences: Provide a theory (from domain knowledge, independent of data), Show correlation, Demonstrate ability to predict new cases (replicate/validate)
  • Spurious Correlations:
    • Confounding variables.
    • Berkson's paradox: when a dataset is biased or filtered in a certain way.
    • Survivorship bias: occurs when only the successes (or survivors) are considered in analysis. WWII plane example: focus on the areas without bullet holes.
  • Measurement Reliability: The goal is to reduce uncertainty and increase consistency in measurements, which often requires multiple observations to account for variability.

Metrics and incentives

  • Goodhart’s Law: When a measure becomes a target, it can lose its effectiveness as a measure (e.g., focusing solely on lines of code might encourage bad coding practices).
  • Incentivizing Productivity: Basing developer rewards on metrics (like number of bugs fixed or lines of code written) can lead to undesirable behaviors, such as writing unnecessary code or overlooking bugs.