Cache memory organization and operation¶
- Cache memories are small, fast SRAM-based memories managed automatically in hardware
- Hold frequently accessed blocks of main memory
- CPU looks first for data in cache ![[Screen Shot 2024-06-10 at 02.59.35.png]]
General Cache Organization¶
![[Pasted image 20240610030113.png]] $\text{Cache Size} = S \times E \times B \text{ data bytes}$
Cache Read¶
- Locate set (set index)
- Check if any line in set has matching tag (tag)
- Yes + line valid: hit
- Locate data starting at offset
No match or not valid (= miss): - One line in set is selected for eviction and replacement - Replacement policies: random, least recently used (LRU) - lowest locality, …
Cache Write¶
-
Write Hit
When we write a word w that is already cached (write hit), first the cache updates its copy of w. Then, cache can do 2 things about updating the copy of w in the next lower level cache :
- Write-through
- write immediately to memory
- Advantage : Being simple
- Disadvantage : causing bus traffic with every write.
- Write-back
- Write-back defers the update as long as possible by writing the updated block to the next lower level only when it is evicted from the cache by the replacement algorithm.
- defer write to memory until replacement of line
- Advantage : Reducing the amount of bus traffic.
- Disadvantage : Additional complexity - the cache must maintain an additional dirty bit for each cache line that indicates whether or not the cache block has been modified.
- Write-through
-
Write Miss
- Write-allocate
- load into cache, update line in cache
- Write-back caches are typically write-allocate.
- Advantage : Can exploit spatial locality of writes
- Disadvantage : Every miss results in a block transfer from the next lower level to the cache.
- No-write-allocate
- writes straight to memory, does not load into cache
- Write-through caches are typically no-write-allocate.
- Write-allocate
To write cache-friendly programs, assume the write-back, write-allocate caches. - Modern computers generally uses write-back caches at all levels, and it is symmetric to the way reads are handled, in that it tries to exploit locality.
Practical Write-back Write-allocate ![[Screen Shot 2024-06-10 at 03.58.09.png]] - A write to address X is issued - If it is a hit - Update the contents of block - Set dirty bit to 1 (bit is sticky and only cleared on eviction) - If it is a miss - Fetch block from memory (per a read miss) - The perform the write operations (per a write hit) - If a line is evicted and dirty bit is set to 1 - The entire block of $2^b$ bytes are written back to memory - Dirty bit is cleared (set to 0) - Line is replaced by new contents
Intel Core i7 Cache Hierarchy¶
![[Screen Shot 2024-06-10 at 04.09.06.png]]
Cache Performance Metrics¶
- Miss Rate
- Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate
- Typical numbers (in percentages): ▪ 3-10% for L1 ▪ can be quite small (e.g., < 1%) for L2, depending on size, etc.
- Hit Time
- Time to deliver a line in the cache to the processor
- includes time to determine whether the line is in the cache
- Typical numbers: ▪ 4 clock cycle for L1 ▪ 10 clock cycles for L2
- Time to deliver a line in the cache to the processor
-
Miss Penalty
- Additional time required because of a miss
- typically 50-200 cycles for main memory (Trend: increasing!)
- Additional time required because of a miss
-
Average access time = HT + MR * MP
- 99% hits is twice as good as 97%
- This is why “miss rate” is used instead of “hit rate”
- Cache friendly code: <5% miss rate
- Make the common case go fast
- Focus on the inner loops of the core functions
- Minimize the misses in the inner loops
- Repeated references to variables are good (temporal locality)
- Stride-1 reference patterns are good (spatial locality)
- Make the common case go fast
Performance impact of caches¶
The memory mountain¶
- Read throughput (read bandwidth): Number of bytes read from memory per second (MB/s)
- Memory mountain: Measured read throughput as a function of spatial and temporal locality. ![[Screen Shot 2024-06-10 at 04.55.11.png]]
Rearranging loops to improve spatial locality¶
Matrix Multiplication ![[Screen Shot 2024-06-10 at 05.09.14.png]] Miss rate analysis - Assume: - Block size = 32B (big enough for four doubles) - Matrix dimension (N) is very large: Approximate 1/N as 0.0 - Cache is not even big enough to hold multiple rows
Layout of C Arrays in Memory
- C arrays allocated in row-major order
- each row in contiguous memory locations
- a[i][j] = a[i * N + j]where N is the number of columns
- Stepping through columns in one row:
- for (i = 0; i < N; i++) sum += a[0][i];
- accesses successive elements
- if block size (B) > sizeof($a_{ij}$) bytes, exploit spatial locality: miss rate = sizeof($a_{ij}$) / B
- Stepping through rows in one column:
- for (i = 0; i < n; i++) sum += a[i][0];
- accesses distant elements
- no spatial locality! miss rate = 1 (i.e. 100%)
![[Screen Shot 2024-06-11 at 16.29.31.png]]
Using blocking to improve temporal locality¶
Matrix Multiplication¶
![[Screen Shot 2024-06-11 at 16.49.04.png]]
Cache Miss Analysis - Assume: - Matrix elements are doubles - Cache block = 8 doubles - Cache size C << n (much smaller than n) - First iteration: n/8 + n = 9n/8 misses - Second iteration: n/8 + n = 9n/8 misses - Total misses: $9n/8 \times n^2 = (9/8) n^3$
Blocked Matrix Multiplication¶
![[Screen Shot 2024-06-11 at 16.51.30.png]]
Cache Miss Analysis
- Assume:
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)
- Three blocks fit into cache: $3B^2 < C$
- First (block) iteration:
- $B\times B/8$ misses for each block
- $2n/B \times B^2 /8 = n\cdot B/4$ (omitting matrix c)
- Second (block) iteration: $2n/B x B^2 /8 = n\cdot B/4$
- Total misses: $n\cdot B/4 \times (n/B)^2 = n^3 /(4B)$
Blocking Summary¶
- No blocking: $(9/8) n^3$ misses
- Blocking: $(1/(4B)) n^3$ misses
- Use largest block size $B$, such that $B$ satisfies $3B^2 < C$
- Fit three blocks in cache! Two input, one output.
- Reason for dramatic difference:
- Matrix multiplication has inherent temporal locality:
- Input data: $3n^2$ , computation $2n^3$
- Every array elements used O(n) times!
- But program has to be written properly
- Matrix multiplication has inherent temporal locality: