# **CPSC 213**

## **Introduction to Computer Systems**

Unit 3

Course Review

### **Learning Goals 1**

· Endianness and memory-address alignment

· Instance variables of objects and structs

Dynamic storage allocation and deallocation

. Pointers in C, & and \* operators, and pointer arithmetic

· Procedures, call, return, stacks, local variables and arguments

Dvnamic flow control, polymorphism, and switch statements

· Machine model for access to global variables; static and dynamic arrays and structs

Memory

Pointers

Instance Variables

Dynamic Storage

If statements and loops

Dynamic Flow Control

If and Loop

Procedures

## Common mistakes:

- forgetting to pad with 0s when sign extended
- normally, pad with 0s when extending to larger size • 0x8b byte (139) becomes 0x0000008b int (139)
- but that would change value for negative 2's comp:
- 0xff byte (-1) should not be 0x000000ff int (255)
- so: pad with Fs with negative numbers in 2's comp:
- 0xff byte (-1) becomes 0xffffffff int (-1)
- in binary: padding with 1, not 0

## reminder: why do all this?

### add/subtract works without checking if number positive or negative

### Numbers

Threads

### Hex vs. decimal vs. binary

Using and implementing threads

**Learning Goals 2** 

Read Assembly

Write Assembly

Synchronization

Virtual Memory

Read assembly code

Write assembly code

ISA-PL Connection

- in SM-213 assembly 0x in front of number means it's in hex
- otherwise it's decimal
- converting from hex to decimal convert each hex digit separately to decimal
- $-0x2a3 = 2x16^2 + 10x16^1 + 3x16^0$
- converting from hex to binary
- convert each hex digit separately to binary: 4 bits in one hex digit

Connection between ISA and high-level programming language

• Using and implementing spinlocks, monitors, condition variables and semaphores

• PIO, DMA, interrupts and asynchronous programming

Virtual memory translation and implementation tradeoffs

- converting from binary to hex
- convert each 4-bit block to hex digit

reconstruct your own lookup table in the margin if you need to do this

## Big Ideas: First Half

- Static and dynamic
- anything that can be determined before execution (by compiler) is called
- anything that can only be determined during execution (at runtime) is called dvnamic
- SM-213 Instruction Set Architecture
- hardware context is CPU and main memory with fetch/execute loop



## **Memory Access**

- Memory is
- . an array of bytes, indexed by byte address Memory access is
- restricted to a transfer between registers and memory the ALU is thus unchanged, it still takes operands from registers
- this is approach taken by Reduced Instruction Set Computers (RISC)

unsigned

int (32 bits)

end up at -1

- wrong: trying to have instruction read from memory and do computation all at once must always load from memory into register as first step, then do ALU computations from registers only
- wrong: trying to have instruction do computation and store into memory all at once all ALU operations write to a register, then can store into memory on next step

• the first half of the numbers are positive, the second half are negative

• start at 0, go to top positive value, "wrap around" to most negative value,



Two's Complement: Reminder

all possible values interpreted as positive numbers

0x0

signed: two's complement

# **Loading and Storing**

- immediate value: 32-bit number directly inside instruction
- from memory: base in register, direct offset as 4-bit numbe
- offset/4 stored in machine language
   common mistake: forget 0 offset when just want store value from register into memory
   from memory: base in register, index in register
- from register
- store into memory
- base in register, direct offset as 4-bit number base in register, index in register
- take: cannot directly store immediate value into memory

Name Semantics

| load immediate    | r[d] ← v                          | ld \$v, rd       | 0d vvvvvvvv |
|-------------------|-----------------------------------|------------------|-------------|
| load base+offset  | $r[d] \leftarrow m[r[s]+(o=p*4)]$ | ld o(rs), rd     | 1psd        |
| load indexed      | $r[d] \leftarrow m[r[s]+4*r[i]]$  | ld (rs,ri,4), rd | 2sid        |
| register move     | r[d] ← r[s]                       | mov rs, rd       | 60sd        |
| store base+offset | $m[r[d]+(o=p*4)] \leftarrow r[s]$ | st rs, o(rd)     | 3spd        |
| store indexed     | $m[r[d]+4*r[i]] \leftarrow r[s]$  | st rs, (rd,ri,4) | 4sdi        |
|                   |                                   |                  |             |

Two's Complement and Sign Extension

exam advice

## **Numbers**

0010

0110

1000

1001

1100

1101

1110

Memory

i

i + 1

i + 2

i + 3

Register bits

7 0111

3 3 0011 4 4 0100 5 5 0101

10 A 1010

11 B 1011

### Common mistakes

- treating hex number as decimal: interpret 0x20 as 20, but it's actually decimal 32 using decimal number instead of hex: writing 0x20 when you meant decimal 20
- wasting your time converting into format you don't particularly need
- wasting your time trying to do computations in unhelpful format
- adding small numbers easy in hex: B+2=D
- unless multiply/divide by power of 2: then hex or binary is fast with bitshifting.

## **Endianness**

### Consider 4-byte memory word and 32-bit register

- it has memory addresses i, i+1, i+2, and i+3
- we'll just say its "at address i and is 4 bytes long"
- e.g., the word at address 4 is in bytes 4, 5, 6 and 7.
- Big or Little Endian • we could start with the BIG END of the number
- - most computer makers except for Intel, also network protocols



- Inte

| i + 3                              | i + 2                              | i + 1                             | i                                |
|------------------------------------|------------------------------------|-----------------------------------|----------------------------------|
| 2 <sup>31</sup> to 2 <sup>24</sup> | 2 <sup>23</sup> to 2 <sup>16</sup> | 2 <sup>15</sup> to 2 <sup>8</sup> | 2 <sup>7</sup> to 2 <sup>0</sup> |

## **Alignment**

### Power-of-two aligned addresses simplify hardware

• required on many machines, faster on all machines



- · computing alignment: for what size integers is address X aligned? byte address to integer address is division by power to two, which is just shifting bits
  - (j shifted k bits to right)  $j / 2^k == j >> k$
  - convert address to decimal; divide by 2, 4, 8, 16, .....; stop as soon as there's a remainder

## Static Variable Access (static arrays)

0x80000000 0xfffffff0x0

-2,147,483,648 -1 0



Key observations



0x1000: value of a 0x2000: value of bl0 0x2004: value of b[1] 0x2020: value of b[9]

4,294,967,295

2,147,483,647

0x7fffffff

Static Memory Layout

0xfffffff

### • address of b[a] cannot be computed statically by compiler

- address can be computed dynamically from base and index stored in element size can known statically, from array type
- Array access: use load/store indexed instruction
- Name Semantics Assembly
- Machine load indexed  $r[d] \leftarrow m[r[s] + 4*r[i]]$ d (rs,ri,4), rd  $m[r[d]+4*r[i]] \leftarrow r[s]$ st rs. (rd.ri.4) store indexed

## Static vs Dynamic Arrays

- Same access, different declaration and allocation · for static arrays, the compiler allocates the whole array
- for dynamic arrays, the compiler allocates a pointer



# **Dereferencing Registers**

### Common mistakes no dereference when you need it

- extra dereference when you don't need it
- example

extra dereference

- Id \$a\_data, r0 # r0 = address of a ld (r0), r1 # r1 = a ld \$b data, r2 # r2 = address of b Id (r2), r3 # r3 = b st r1, (r3,r1,4) # b[a] = a
- b dereferenced twice
- once with offset load once with indexed stor
- no dereference: value in register
- one dereference: address in register
- two dereferences: address of pointer in register







convert address to binary; sweep from right to left, stop when find a 1

## **Basic ALU Operations**

## Arithmetic

| register move | $r[d] \leftarrow r[s]$         | mov rs, rd | 60sd |
|---------------|--------------------------------|------------|------|
| add           | $r[d] \leftarrow r[d] + r[s]$  | add rs, rd | 61sd |
| and           | $r[d] \leftarrow r[d] \& r[s]$ | and rs, rd | 62sd |
| inc           | $r[d] \leftarrow r[d] + 1$     | inc rd     | 63-d |
| inc address   | $r[d] \leftarrow r[d] + 4$     | inca rd    | 64-d |
| dec           | r[d] ← r[d] - 1                | dec rd     | 65-d |
| dec address   | r[d] ← r[d] - 4                | deca rd    | 66-d |
| not           | r[d] ← ~ r[d]                  | not rd     | 67-d |

Shifting, NOP and Halt

| Ivaille     | Semanucs                         | Assembly  | Wacillie |
|-------------|----------------------------------|-----------|----------|
| shift left  | $r[d] \leftarrow r[d] << S = s$  | shl rd, s | 7dSS     |
| shift right | $r[d] \leftarrow r[d] >> S = -s$ | shr rd, s | 7u33     |
| halt        | halt machine                     | halt      | f0       |
| пор         | do nothing                       | nop       | ff       |



```
Pointer Arithmetic in C
 Alternative to a[i] notation for dynamic array access

 a[x] equivalent to *(a+x)

 &a[x] equivalent to (a+x)
```

Pointer arithmetic takes into account size of datatype 0x2000: value of a[0] int a[4]; 0x2004: value of a[1]

0x2008: value of a[2 0x200a: value of al3  $- &a[0] = 0 \times 2004; &a[2] = 0 \times 2008$ 

- (& a[2]) - (& a[1])) == 1 == (a+2) - (a+1) compiler treats pointer-to-int differently than int! even though both can be stored with 32 bits on IA-32 machine

### Common mistake

• treat pointer arithmetic like direct calculations with addresses - off by 4 when doing pointer arithmetic with integers

## Memory Management in Java

- Garbage collection model
- allocation with new
- · deallocation handled by Java system, not programmer
- thus some kinds of programmer errors are impossible, including dangling pointers
- Advantages
- · much easier to program
- Disadvantages
- some performance penalties
- system knows less than programmer in best case
- GC pass could occur at bad time (realtime/interactive situation)
- programmers tempted to ignore memory management completely
- GC is not perfect, memory leaks can still occur!

# tmm% ./array2 k hex: bffff7d0, k dec: -1073743920, m hex: bffff7c4, m dec -1073743932, n: 12, o: 3 $\,$ (gdb) p &a[4] \$1 = (int \*) 0xbffff510 (gdb) p k \$2 = -1073744624

Polymorphic Dispatch

Exam studying advice

- Method address is determined dynamically
- compiler can not hardcode target address in procedure call
- instead, compiler generates code to lookup procedure address at runtime

Pointer Arithmetic Example Program

• try writing simple test programs, use gdb and print to explore

tmm% gcc -g -o array2 array2.c array2.c: In function 'main': array2.c:6: warning: initialization makes integer from pointer without a cast array2.c:7: warning: initialization makes integer from pointer without a cast

tmm% cat array2.c
#include <stdio.h>
int main (int argc, char\*\* argv) {
 int | | | | | | | | | | | | | | |
 int | | | | | | | |
 int | | | | | | |
 int | | | | | |
 int | |
 i

- address is stored in memory in the object's class jump table
- Class Jump table
- · every class is represented by class object
- the class object stores the class's jump table
- the jump table stores the address of every method implemented by the class
- objects store a pointer to their class object
- Static and dynamic of method invocation
- · address of jump table is determined dynamically
- method's offset into jump table is determined statically

### Dynamic Jumps in C

• how does this C code check for endianness?

create array of 4 bytes (char data type is 1 byte

casting between arrays of bytes and integers

Function pointer

things to understand:

concepts of endiananess

masking bits, shifting bits

#include <stdio.h>

int main ()

char a[4]

\*((int\*)a) = 1;

- a variable that stores a pointer to a procedure
- <return-type> (\*<variable-name>)(<formal-argument-list>);

Determining Endianness of a Computer

printf("a[0]=%d a[1]=%d a[2]=%d a[3]=%d\n",a[0],a[1],a[2],a[3]);

- used to make dynamic call
- <variable-name> (<actual-argument-list>);
- Example



# Memory Management in C

- Explicit allocation with malloc and deallocation with free
- Dangling pointer problem
- pointer to object that has already been freed
- happens when allocate and free happen in different parts of code
- various strategies to avoid (reduce likelihood, but not a guaranteed cure)
- use local variables (allocated on the stack) and pass in address of the local from caller, instead
- coding conventions
- explicit reference counting (heavyweight solution)
- Memory leak problem
- allocated memory is not deallocated when no longer needed, so memory usage steadily grows (problem especially for long-running programs)
- Common mistake
- don't free any memory to avoid dangling pointer problem
- result is memory leak, leads to later problems even though no immediate crash

### Indirect Jump: Base/Offset

- Key observation
- base address stored in register (dynamic)
- for polymorphism jump table, offset can be computed statically by
- Function pointers: use indirect base/offset jump instruction

| $i \neq 0$ indir iump b+0 pc $\leftarrow m[r[s] + (o==pp*2)]$ i $\neq o(rs)$ dspp | Name           | Semantics                           | Assembly | Machine |
|-----------------------------------------------------------------------------------|----------------|-------------------------------------|----------|---------|
|                                                                                   | indir jump b+o | $pc \leftarrow m[r[s] + (o==pp*2)]$ | j *o(rs) | dspp    |

### Switch Statement

```
int i:
int j;
void foo () {
 switch (i) {
   case 0: j=10; break;
case 1: j=11; break;
case 2: j=12; break;
```

void bar () { if (j==0) j = 11; else if (i==2) j = 12; else if (i==3) j = 13; else j = 14;

- Semantics the same as simplified nested if statements
- choosing one computation from a set
- restricted syntax: static, cardinal values
- Potential benefit: more efficient computation (usually)
- jump table to select correct case with single operation • if statement may have to execute each check
- number of operations is number of cases (if unlucky)

### Choose one of two strategies to implement • use jump table unless case labels are sparse or there are very few of them

**Switch Statement Strategy** 

- use nested-if-statements otherwise
- Jump-table strategy statically
- build jump table for all label values between lowest and highest
- generate code to goto default if condition is less than minimum case label or greater than maximum
- normalize condition to lowest case label
- use jump table to go directly to code selected case arm

goto address of code\_default if cond < min\_label\_value goto address of code\_default if cond > max\_label\_value goto jumptable[cond-min label value] statically: jumptable[i-min\_label\_value] = address of code\_ forall i: min\_label\_value <= i <= max\_label\_value

# **Switch Snippet**

switch (i) {
 case 20: j=10; break;
 case 21: j=11; break;
 case 22: j=12; break;
 case 23: j=13; break;
 default: j=14; break;
}

# r0 = &i ), r0 # r0 = i ffed, r1 # r1 = -19 ld \$i, r0 \$i, r0 0x0(r0), r0 # r0 = i-19 # goto I0 if i>19 but 1, default # goto default if i>23 ld \$0xfffffec, r1 # r1 = -20 add r1, r0 # r0 = i-20 ld \$5mptable, r1 # r1 =  $\frac{1}{2}$ mptable, r1 # r1 =  $\frac{1}{2}$ mptable j \*(r1, r0, 4) # goto jmptable[i-20] case20: Id \$0xa, r1 # r1 = 10 br done # goto done

default: ld \$0xe, r1 # r1 = 14 br done # goto done jmptable: .long 0x00000140 # & (case 20) .long 0x00000148 # & (case 21) .long 0x00000150 # & (case 22) .long 0x00000158 # & (case 23)

## **Dynamic Control Flow Summary**

- Static vs dynamic flow control static if jump target is known by compiler
- dynamic for polymorphic dispatch, function pointers, and switch statements
- Polymorphic dispatch in Java
- invoking a method on an object in Java
- method address depends on object's type, which is not known statically
- object has pointer to class object; class object contains method jump table
- procedure call is an indirect jump i.e., target address in memory

### Function pointers in C

- · a variable that stores the address of a procedure
- used to implement dynamic procedure call, similar to polymorphic dispatch

### Switch statements

- syntax restricted so that they can be implemented with jump table
- jump-table implementation running time is independent of the number of case labels
- but, only works if case label values are reasonably dense

## Key observation

Indirect Jump: Indexed

- base address stored in register (dynamic)
- for switch jump table, have index stored in register
- Switch: use indirect jump indexed instruction

| ir jump indexed | $pc \leftarrow m[r[s] + r[i]*4]$ | j *(rs,ri,4) | esi- |
|-----------------|----------------------------------|--------------|------|
|                 |                                  |              |      |
|                 |                                  |              |      |
|                 |                                  |              |      |

Assembly Machine

## Static and Dynamic Jumps

- Jump instructions
- specify a target address and a jump-taken condition • target address can be static or dynamic
- jump-target condition can be static (unconditional) or dynamic (conditional)
- Static jumps • jump target address is static
- compiler hard-codes this address into instruction
- Semantics
  - pc ← (a==pc+oo\*2) branch if equal  $pc \leftarrow (a==pc+oo*2)$  if r[c]==0branch if greater  $pc \leftarrow (a==pc+oo^*2)$  if r[c]>0jump  $pc \leftarrow a$  (a specified as label)
- Dynamic jumps
- jump target address is dynamic

## **Dynamic Jumps**

- Jump base+offset Jump target address stored in a register
- We already introduced this instruction, but used it for static procedure

| Name             | Semantics                        | Assembly | Machine |
|------------------|----------------------------------|----------|---------|
| indirect jump po | $c \leftarrow r[s] + (o = pp*2)$ | j o(rs)  | cspp    |

### Indirect jumps

- Jump target address stored in memory Base-plus-offset (function pointers) and indexed (switch) modes for
- memory access Assembly Machine

| indir jump b+o     | $pc \leftarrow m[r[s] + (o==pp*2)]$ | j *o(rs)     | dspp |
|--------------------|-------------------------------------|--------------|------|
| indir jump indexed | $pc \leftarrow m[r[s] + r[i]*4]$    | j *(rs,ri,4) | esi- |
|                    |                                     |              |      |

# Big Ideas: Second Half

Memory hierarchy

indir

- progression from small/fast to large/slow
  - registers (same speed as ALU instruction execution, roughly: 1 ns clock tick)
  - memory (over 100x slower: 100ns) disk (over 1.000.000x slower: 10 millisec)
- network (even worse: 200+ millisec RT to other side of world just from speed of light in fiber) implications
- don't make ALU wait for memory
  - · ALU input only from registers, not memor
  - don't make CPU wait for disk interrupts, threads, asynchr
- Clean abstraction for programmer
- ignore asynchronous reality via threads and virtual memory (mostly)
- explicit synchronization as needed



I/O devices have small processors: I/O controllers

 processing power available outside CPU **CPU** Memory I/O Rus The Processors

I/O-Mapped Memory

- I/O-Mapped Memory
- use familiar syntax for load/store for both memory and I/O
- memory addresses beyond the end of main memory handled by I/O controllers

Memory 0x000000000-

- · loads and stores are translated into I/O-bus messages to controller

PIO vs DMA: Phone Call Analogy

Example

1: PIO

• to read/write to controller at address 0x80000000

st r1 (r0) # write the value of r1 to the device Id (r0), r1 # read a word from device into r1

- CPU requests one word at a time and waits for I/O controller
- CPU must wait until data is available

Programmed IO (PIO)

- but I/O devices may be much slower than CPU (disks millions of times slower)
- large transfers slow since must be done one word at a time
- CPU must check back with I/O controller (for instance by polling)
- poll too seldom means high latency
- no way for I/O controller to initiate communication

**Asynchronous Disk Reading** 

available before next statement executed

(buf, siz, blkNo);

Handling disk reads asynchronously

asyncRead (buf, siz, blkNo, nowHaveBlock);

• need queue so can handle multiple pending requests

imagine if not just on mouse clicks, but for every memory access

nowHaveBlock (buf. siz):

for some devices CPU has no idea when to poll (network traffic, mouse click)

Cannot depend on synchronized execution where result is

• each request has completion routine that should run after interrupt

• either programmers must use explicitly asynchronous programming model

or system can provide abstractions to hide asynchrony from programmers

decoupled event triggering and handling as with event-driven GUI programming

## **Threads**

pc

fetch ()

execute ();

Interrupts

**CPU Interrupts** 

interruptVectorBase

if (isDeviceInterrupting) {
m[r[5]-4] ← r[6];

### Abstraction for execution

- programmer's view
- statements are executed one after another, appearance of sequential flow

• controller can signal the CPU by setting special-purpose registers isDeviceInterrunting set by I/O Controller to signal interrunt interruptControllerID set by I/O Controller to identify interrupting device CPU checks for interrupts on every fetch-execute cycle

polling, but very low overhead of register access: does not slow down computation

• CPU jumps to controller's Interrupt Service Routine to service interrupt

interruptVectorBase [interruptControllerID];

interrupt-handler jump table, initialized at boot time

bar

join

bat

- system reality
  - threads maybe be blocked (stopped)
  - often thread is not running because CPU is running a different thread blocked threads can be restarted
- Using threads

- starts new thread, immediately adds it to queue of threads waiting to run o join

### blocks calling thread until target thread completes common mistakes

- assume that order of joining is order of execution
- assume that order of creating is order of execution
- thread joins runnable queue with create call, not with join call
   scheduler may choose what to run next in any order

## threads, processes, virtual memory Thread Private Data

Challenges of asynchrony

### **Ready Queue** Thread Control Stacks **Blocks TCBa** RUNNING TCB must have pointer to **TCBb** stack **RUNNABLE** otherwise no way to find thread's data 0-Stack must have pointer to TCB otherwise no way to add currently TCBc running thread to ready queue, which RUNNABLE stores TCBs not stacks Top of stack points to TCB forgetting that stack must point back

## **Thread Scheduling Policies**

- · choose highest priority runnable thread to run
- Round-Robin
- equal-priority threads get fair share of processor, in round-robin fashion
- Preemptive
- priority-based
  - lower priority thread preempted as soon as higher priority becomes runnable
- quantum-based (time slices)
- thread preempted when its time quantum expires
- timer device: I/O controller connected to clock, sends interrupts to CPU at regular intervals
- Can be combined

## **Direct Memory Access (DMA)**



- independently of CPU
- process initiated by CPU using PIO send request to controller with addresses and sizes
- data transferred to memory without CPU involvement

Runnable

 controller signals CPU with interrupt when transfer complete can transfer large amounts of data with one request

Schedule

Yield

Unblock

Schedule

Join or Detach

Running

Dead

**Blocked** 

not limited to one word at a time

Thread Status DFA

Create

Nascent

### • DMA: controller calls memory to deliver data • Interrupt: controller calls CPU to inform that data is ready

leaves voicemail that CPLI picks up on the next fetch/execute cycle

• PIO: CPU calls controller to make request, then hangs up

• must stay on the line a looooong time waiting for controller to finish

PIO/DMA/Interrupt combination: sequence of phone calls

PIO: only CPU can make a phone call

## **Implementing Threads**

- Each thread has own copy of stack
- Thread-Control Block (TCB)
- thread status: (NASCENT, RUNNING, RUNNABLE, BLOCKED, or DEAD) · pointers to base of thread's stack base and top of thread's stack
- scheduling parameters such as priority, quantum, pre-emptability, etc.

- ready: list of TCB's of all RUNNABLE threads
- blocked: list of TCB's of BLOCKED threads
- Thread switch (stops Ta and starts Tb)
- save all registers to stack
- save stack pointer to Ta's TCB
- set stack pointer to stack pointer in Tb's TCB
- restore registers from stack

## **Mutual Exclusion**

- Use mutual exclusion to guard critical sections where data shared between multiple threads is accessed
- avoid race conditions where conflicting operations on shared data are interleaved arbitrarily leading to nondeterministic behavior
- example: stack corruption when push and pop interleaved without being guarded

### Mutual exclusion with locks

- spinlock
- thread busy-waits until lock acquired
- use when locks only needed for short time
- blocking locks
- thread blocks if lock not available
- thread returned to runnable state when lock becomes available use when locks may be held for long periods

### **Mutual Exclusion Using Locks**

- lock semantics
- a lock is either held by a thread or available
- at most one thread can hold a lock at a time
- a thread attempting to acquire a lock that is already held is forced to wait
- lock primitives
- lock acquire lock, wait if necessary
- unlock release lock, allowing another thread to acquire if waiting
- using locks for the shared stack

void push\_cs (struct SE\* e) { lock (&aLock); unlock (&aLock):

struct SE\* pop\_cs () {
struct SE\* e; lock (&aLock); e = pop st(): unlock (&aLock) return e

## Spinlocks Require Atomic Read/Write

Impossible when read and write are separate operations



- Need atomic read and write that is single indivisible unit
- with no intervening access to that memory location from any other thread allowed
- Atomic Memory Exchange
- one type of atomic memory instruction (there are other types) · group a load and store together atomically
- exchanging the value of a register and a memory location
- · much higher overhead than standard load or store
- Name Semantics

Assembly chg (ra), rv r[v] ← m[r[a]] m[r[a]] ← r[v]

### Implementing Spinlocks

Spin first on fast normal read, then try slow atomic exchange

• use normal read in loop until lock appears free

• when lock appears free use exchange to try to grab it

• if exchange fails then go back to normal read



assume that atomic exchange always succeeds; could fail!

### **Blocking Locks** Implementing a Blocking Lock **Blocking Lock Example Scenario Busywaiting vs Blocking** Thread A Thread B **Busywait Locks** If a thread may wait a long time Using spinlocks to Blocking Locks void lock (struct blocking\_lock I) { busywait for long time • it should block so that other threads can run 3. calls lock() while (I->held) { . grabs spinlock . grabs blocking lock . tries to grab spinlock, but spins • it will then unblock when it becomes runnable (lock available or event wastes CPU cycles (&waiter queue, uthread self ()): spinlock unlock (&l->spinlock); B. grabs spinlock B does work use for short things uthread\_switch (ready\_queue\_dequeue (), TS\_BLOCKED); . queues itself on waiter list Blocking locks for mutual exclusion spinlock lock (&l->spinlock); including within implementation of A blocks blocking locks • if lock is held, locker puts itself on waiter gueue and blocks , |->held = 1: spinlock unlock (&I->spinlock): Using blocking locks • when lock is unlocked, unlocker restarts one thread on waiter queue 2. calls unlock() 3. grabs spinlock Blocking locks for event notification (condition variables) has high overhead struct blocking\_lock { void unlock (struct blocking\_lock I) { use for long things waiting thread puts itself on a a waiter queue and blocks 5. restarts Thread B uthread t\* waiter thread spinlock\_t A does work • notifying thread restarts one thread on waiter queue (or perhaps all) Common mistake 7. returns from unlock( spinlock lock (&I->spinlock); uthread queue t waiter queue; B does work Implementing blocking locks using spinlocks assume that CPU is A does work waiter thread = dequeue (&I->waiter queue); 19. grabs spinlock 20. grabs blocking lock busywaiting during blocking lock data structure includes a waiter queue and a few other things Spinlock quard A does work thread running waiter\_thread->state = TS\_RUNNABLE; data structure is shared by multiple threads; lock operations are critical sections releases spinlock on for critical sections spinlock held ready\_queue\_enqueue (waiter\_thread); thread does not run again until thus we use spinlocks to guard these sections in blocking lock implementation A does work off before thread blocks after blocking lock is released

## **Locks and Loops Common Mistakes**

### Confusion about spinlocks inside blocking locks use spinlocks in the implementation of blocking locks

- two separate levels of lock!
- holding spinlock guarding variable read/write
- holding actual blocking lock
- Confusion about when spinlocks needed • must turn on to guard access to shared variables
- must turn off before finishing or blocking

### Confusion about loop function

- busvwait
- only inside spinlock
- thread blocked inside loop body, not busywaiting
- yield for blocking lock
- blocking wait for CV, blocking wait for semaphore P implementation

### **Condition Variables**

## Mechanism to transfer control back and forth between

- uses monitors: CV can only be accessed when monitor lock is held
- Primitives
- blocks until a subsequent notify operation on the variable
- notify unblocks one waiter, continues to hold monitor
- notify all unblocks all waiters (broadcast), continues to hold monitor
- Each CV associated with a monitor
- Multiple CVs can be associated with same monitor
- independent conditions, but guarded by same mutex lock

uthread\_monitor\_t\* beer = uthread\_monitor\_create (); uthread\_cv\_t\* not\_empty = uthread\_cv\_create (beer) = uthread\_cv\_create (beer);

## **Synchronization Abstractions**

- Monitors and condition variables
- monitor provides blocking locks guarantees mutual exclusion
- condition variable provides blocking notify
- control transfer among threads with wait/notify
- abstraction supports explicit locking
- Semaphores
- blocking atomic counter, stop thread if counter would go negative
- introduced to coordinate asynchronous resource use

Wait and Notify Semantics

- · abstraction implicitly supports mutex, no need for explicit locking by user
- · could use to implement monitors, barriers (and CVs, sort of)

Monitor automatically exited before block on wait

• before waiter blocks, it exits monitor to allow other threads to enter

monitor can be entered (if monitor lock held by another thread)

Monitor stays locked after notify: does not block

same idea as blocking lock implementation with spinlocks!

Monitor automatically re-entered before return from wait

when trying to return from wait after notify, thread may block again until

Implication: cannot assume desired condition holds after

• other threads may have been in monitor between wait call and return

for (int i=0; i<n; i++) {

alasses++

must explicitly re-check; usually enclose wait in while loop with condition check

### Common mistake:

- confusing three things
  - how to use, how to implement, how one abstraction might be used to implement the other

### Condition Variables

### Common mistakes:

CVs do not have internal storage variables (boolean flags or int counters)

Spin/Block,Lock/Notify: 3YrOld Analogy

block: do not use any CPU resources while waiting, use scheduler blocking mechanis

blocking lock: knock once, step away from the door to wait quietly, walk towards door

blocking notify: after first question, driver says 'no, go to sleep, I'll wake you up when

after it opens. (and somebody else might beat you there, so do check door again!

checking for notification: asking 'are we there yet' on a car trip

spinnotify: keep asking 'are we there yet' every 30 seconds, for 1000km

checking the lock: try washroom door handle to see if it opens

spinlock: keep rattling the door handle and knocking until the door opens

Common mistake: confusing lock and notify

Common mistake: confusing spin and block

. lock: resource only available for single user at once

• spin: actively use CPU resources while waiting

· notify: event has occurred

- CVs are variables: named so can tell them apart from each other

## **Monitors**

- Provides mutual exclusion with blocking lock
- enter exit unlock void doSomething (uthread monitor t\* mon) { uthread\_monitor\_ente
  touchSharedMemory();
- Standard case: assume all threads could overwrite shared
- mutex: only allows access one at a time
- Special case: distinguish read-only access (readers) from threads that change shared memory values (writers). • mutex: allow multiple readers but only one writer

- wait/notify tired vs. wait/notify hungry
- · users of CVs do not have to explicitly block
- wait/notify done within implementation of CVs
- users of CVs do have to hold monitor in order to access CV values

# **Semaphores**

- Atomic counter that can never be less than 0 attempting to make counter negative blocks calling thread
- P(s): acquire
- try to decrement s
- if s would be negative, atomically blocks until s positive, then decrement s
- V(s): release increment s
- atomically unblock any threads waiting in P
- Explicit locking not required when using semaphores since atomicity built in

void refill (int n) {
 for (int i=0; i<n; i++)</pre> uthread\_P (glasses); uthread\_V (glasses);

Virtual Address Translation

uthread semaphore t\* glasses = uthread create semaphore (0):

## Semaphores

- Using semaphores: good building block for implementing many other things
- monitors
- condition variables (almost)
- rendezvous: two threads wait for each other before continuing
- barriers: all threads must arrive at barrier before any can continue
- Implementing semaphores: similar spirit to blocking locks

struct blocking\_lock { struct uthread semaphore { uthread\_queue\_t waiter\_queue; uthread\_queue\_t waiter\_queue; (really should be boolean...)

### Deadlock and Starvation

while (glasses==0)

Solved problem: race conditions

void pour () {

glasses--:

return from blocking wait

- solved by synchronization abstractions: locks, monitors, semaphores
- Unsolved problems when using multiple locks
- · deadlock: nothing completes because multiple competing actions wait for each other
- starvation: some actions never complete
- no abstraction to simply solve problem, major concern intrinsic to • some ways to handle/avoid:
- precedence hierarchy of locks

  - detect and destroy: notice deadlock and terminate threads

# Virtual Memory

- Virtual Address Space
- an abstraction of the physical address space of main (i.e., physical) memory
- programs access memory using virtual addresses memory management unit translates virtual address to physical memory
- MMU hardware performs translation on every memory access by program
- Process
- a program execution with a private virtual address space - may have one or many threads
- private address space required for static address allocation and isolation

each program uses the same virtual address, but they map to different physical addresses



### **Address Space Translation Tradeoffs**

- Single, variable-size, non-expandable segment
- internal fragmentation of segment due to sparse address use

### Multiple, variable-size, non-expandable segments

- internal fragmentation of segments when size isn't know statically
- external fragmentation of memory because segments are variable size
- moving segments would resolve fragmentation, but moving is costly
- Expandable segments
- expansion must by physically contiguous, but there may not be room
- external fragmentation of memory requires moving segments to make room
- Multiple, fixed-size, non-expandable segments
- called pages
- need to be small to avoid internal fragmentation, so there are many of them
- since there are many, need indexed lookup instead of search

## **Demand Paging**

- some application data is not in memory
- transfer from disk to memory, only when needed Page Table
- only stores entries for pages that are in memory
- pages that are only on disk are marked invalid
- access to non-resident page causes a page-fault interrupt

### Page Fault

- . is an exception raised by the CPU
- when a virtual address is invalid

Single System Image

- an exception is just like an interrupt, but generated by CPU not IO device
- page fault handler runs each time a page fault occurs

Summary: Second Half

### Memory Map

Threads

Virtual Memory

- a second data structure managed by the OS
- divides virtual address space into regions, each mapped to a file
- page-fault interrupt handler checks to see if faulted page is mapped
- if so, gets page from disk, update Page Table and restart faulted instruction

• hardware implements a set of instructions needed by compilers

• an abstraction implemented by software to manage asynchrony and

 compilers translate programs into these instructions translation assumes private memory and processor

• provides the illusion of single processor to applications • differs from processor in that it can be stopped and restarted

• an abstraction implemented by software and hardware • provides the illusion of a single, private memory to application • not all data need be in memory, paged in on demand

### **Paging**

### Key idea

- Virtual address space is divided into set of fixed-size segments called pages
- number pages in virtual address order
- virtual page number = virtual address / page size

### Page table

- indexed by virtual page number (vpn)
- stores base physical address (actually address / page size (pfn) to save space)
- stores valid flag



### **Demand Paging**

- Virtual vs Physical Memory Size
- VM can be even larger than available PM with demand paging!

### Page Replacement

- pages can now be removed from memory, transparent to program
- a replacement algorithm choose which pages should be resident and swaps out



return pa;

class AddressSpace { PageTableEntry pte[]

### A context switch is

switching between threads from different processes

Translation: Search vs. Lookup Table

Translate by searching through all segments: too slow!

for (int i=0; i<segments.length; i++) { int offset = va - segment[i].baseVA; if (offset > 0 && offset < segment[i].bounds) {

pa = segment[i].basePA + offset;

throw new IllegalAddressException (va);

int translate (int va) {
 int vpn = va / PAGE\_SIZE;
 int offset = va % PAGE\_SIZE;
 int offset = va % PAGE\_SIZE;
 if (pte[vpn].isvalid)
 return pte[vpn].pfn \* PAGE\_SIZE + offset;

throw new IllegalAddressException (va);

Translate with indexed lookup: Page Table

each process has private virtual address space and thus its own page

- change PTBR to point to new process's page table
- thread switch (save regs, switch stacks, restore regs)

### Context switch vs thread switch

- mainly because caching techniques used to make translation fast

## Address Translation

- The bit-shifty version
- assume that page size is 4-KB = 4096 = 2<sup>12</sup>
- assume addresses are 32 bits
- then, vpn and pfn are 20 bits and offset is 12 bits
- pte is pfn plus valid bit, so 21 bits or so, say 4 bytes



int translate (int va) { int vpn = va >>> 12; int offset = va & 0xfff:

if (pte[vpn].isValid)
return pte[vpn].pfn << 12 | offset;

### Context Switch

### Implementing a context switch

- changing page tables can be considerably slower than just changing threads
- many pages may need reloading from disk because of demand paging

# **Paging Summary**

class PageTableEntry {

boolean isValid; int pfn;

- a way to implement address space translation
- divide virtual address space into small, fixed sized virtual page frames
- page table stores base physical address of every virtual page frame
- page table is indexed by virtual page frame number
- some virtual page frames have no physical page mapping
- some of these get data on demand from disk