OS Note - File-System Implementation

As we saw in Chapter 13, the file system provides the mechanism for on- line storage and access to file contents, including data and programs. File systems usually reside permanently on secondary storage, which is designed to hold a large amount of data. This chapter is primarily concerned with issues surrounding file storage and access on the most common secondary-storage media, hard disk drives and nonvolatile memory devices. We explore ways to structure file use, to allocate storage space, to recover freed space, to track the locations of data, and to interface other parts of the operating system to secondary storage. Performance issues are considered throughout the chapter.

A given general-purpose operating system provides several file systems. Additionally, many operating systems allow administrators or users to add file systems. Why so many? File systems vary in many respects, including features, performance, reliability, and design goals, and different file systems may serve different purposes. For example, a temporary file system is used for fast storage and retrieval of nonpersistent files, while the default secondary storage file system (such as Linux ext4) sacrifices performance for reliability and features. As we’ve seen throughout this study of operating systems, there are plenty of choices and variations, making thorough coverage a challenge. In this chapter, we concentrate on the common denominators.

1. File-System Structure

File system 主要有兩個需要考量的因素：file system 呈現給 user 的樣貌是什麼？這個部分的設計包含了檔案有哪些特性、可以對檔案做哪些 operations、以及 directory structure；另一個則是要如何設計相應的資料結構與演算法將 logical file system 對應到 second storage 中
File system 通常會使用分層設計的方法，每一層都使用上一層提供的 feature 並提供 feature 給下一層使用：
- I/O control layer - 包含了 device driver 與 interrupt handler，負責將 I/O 指令轉換成 device 看得懂的 bit pattern 並進行 I/O
- basic file system (or block I/O subsystem) - 主要工作是將要讀寫的目標 logical block address 傳給 I/O control layer，另外也負責 memory buffer 的管理
- file-organization module 會知道檔案對應到的 logical block，這層也會包含 free-space manager，會去追蹤 unallocated blocks 並在 file-organization module 需要時提供
- logical file system - 存有所有的 metadata information，也存有 directory structure 可以將對應資訊傳給 file-organization module。通常會使用 file control block (FCB) (也就是 UNIX 中的 inode) 來存放 metadata
分層的 file system 設計與其他的 layered implementation 一樣，有方便更新的優點，並且不同的 file system 可以共用同一個 layer 的 code；缺點一樣是多層的傳遞會造成 overhead
一些常見的 file system：
- UNIX - UNIX file system (UFS)
- Windows - FAT, FAT32, NTFS
- Linux - ext3, ext4

2. File-System Operations

File system 包含了一些特殊的 structure 來記錄資訊，分成存在 storage 或存在 memory 兩種。存在 storage 的有：
- boot control block - 每一個 volume 有一個，紀錄如果 OS 要從這個 volume boot up 的話需要的資訊，如果 volume 中沒有 operating system (無法從這個 volume boot up)，那 boot control block 會是空的。這個 sturcture 在 UFS 中叫 boot block，在 NTFS 中叫 partition boot sector
- volume control block - 每一個 volume 有一個，紀錄 volume 的相關資訊，如：volume 中的 block 數量、一個 block 的大小、free-block 的數量及指標、free-FCB 的數量及指標。這個 structure 在 UFS 中叫做 superblock，NTFS 則把它存在 master file table 中
- directory structure - 每個 file system 會有一個，在 UFS 中也負責將 file name 對應到 inode number
- FCB 則是每一個檔案都會有一個
存在 memory 的有：
- mount table - 紀錄每一個 mounted volume 的資訊
- directory struecture cache - 紀錄最近 access 的 directory 資訊，以便增進效率
- system-wide open-file table - 存有目前有開啟的檔案的 FCB copy
- per-process open-file table - 存有目前 process 開啟的檔案 (只存指向對應 system-wide table entry 的 pointer)
- file-system block 的 buffer，在對 file system 讀寫時會用到
創建一個新檔案時，logical file system 會 allocate 一個新的 FCB (或從 free FCBs 中挑選一個)，然後將 file 所處的 directory 讀進 memory，更新 file name 與 FCB 後再寫回 storage

開啟一個檔案通常包含以下幾個步驟：
- open() 會接收一個 file name，並把他傳遞給 logical file system，並先在 system-wide open-file table 中尋找該檔案是否有被其他 process 開啟過，若有的話便在 per-process open-file table 創建一個新的 pointer
- 若沒有的話，會利用 file name 在 directory structure 中尋找對應的 FCB 並把他複製到 system-wide open-file table 中 (部分的 directory structure 會 cache 在 memory 裡，可以加速)，並一樣創建一個 per-process open-file table 的 entry
- per-process table 中的 entry 不只會指到 system-wide table 中的 entry，也會指到一些其他的資訊，如：current location in the file、access mode 等
- 都做完之後，open() 會回傳指向 per-process table entry 的指標，並使用它進行接下來的 operation。這個只俵在 UNIX 中被稱作 file descriptor，在 Windows 中被稱作 file handle

3. Directory Implementation

Linear List

將一個 directory 的 file entry 存在 linear list 中，每一個 entry 都有 file name 以及指向 data blocks 的 pointer
這種方式實作相當簡單，但是效率低落，每次想要讀寫一個檔案都需要耗費 linear search 的時間

Hash Table

一樣使用 linear list 存取 entry，但是使用一個 file name 對應到 index 的 hash table，可以將搜尋時間降到 constant time
缺點是 hash table 因為 hash function 的限制通常是 fixed-size 的，在檔案數量超過 hash table 的大小時，如果不想要 collision 就只能更換 hash function
解決方法是使用 chained-overflow hash table，在 collision 的時候就將 entry 接在 linked-list 的後面

4. Allocation Methods

Contiguous Allocation

同一個 file 會佔據連續的 block，因此用兩個數字：start block number & length 便可確定檔案在 disk 中的位置
這樣的設計對 HDD 很有利，要讀取下一個 data block 時，大部分直接不需要移動讀取頭 (只需要轉動磁碟)，少部分時間移到下一個 track 就行了
兩種 access 方式：Squential and Direct 在使用 contiguous allocation 都表現得很好 (Direct：假設開始的 block 是 $b_{th}$，想要 access 的是檔案的 $i_{th}$ block，那就可直接 access 硬碟中的 $(b+i)_{th}$ block)
Contiguous allocation 很大的一個問題是 external fragmentation，可以透過 compact 將所有 free space 結合成 contiguous space，但需要很大的 overhead
另一個問題是檔案的長度很難決定，如果是只 allocate 剛好的大小，那很可能會遇到檔案無法擴充的問題 (尤其在 best-fist allocation 的情況)，如果預留的空間太大，又可能造成 wasted space。一種解決方法是在原本空間不夠用時，額外在 allocate 一塊連續的 storage，稱作 extend，一個檔案便會分散在多個連續的區塊，可以用區塊的第一個 block address 與 block number 紀錄

Linked Allocation

File entry 存有檔案的第一個 block 的指摽，第一個 block 存有第二個 block 的指摽，以此類推
完全解決 contiguous allocation 的問題，不會有 external fragmentation 檔案也可以任意加長
但很大的一個缺點是，儘管 sequential access 還是很有效率，direct access (尋找檔案的 $i_{th}$ block) 便相當費時。另外，存放 pointer 的空間也是 pure overhead
為了解決 pointer overhead 的問題，可以使用 cluster，也就是將多個 block 合併成一個更大的單位，並且所有的 operation 都用這個單位進行
- 除了減少 pointer 佔的空間比例，還可以增進 HDD 的效率 (讀寫頭在通一個 cluster 中通常不需要移動)，以及減少 free-list 中需要紀錄的資訊
- 但同時也會造成 internal fragmentation (單位變大可能浪費的空間也會變大)，以及 I/O 效率變低 (就算是讀寫少量的 data 也須要 transfer 整個 cluster)
另一個可能的問題是可靠性，如果一個檔案中的某一個 pointer 損壞或者變成其他的值，那就有可能造成其他檔案被覆寫的危險。使用 doubly linked list 可以提供雙重保護，但也會造成更多的 pointer space overhead

一個著名的 linked allocation 變形是 file-allocation table (FAT)。這種設計會在每個 volume 的一開始預留一塊區域當成 table (也就是 FAT)，裡面會有這個 vloume 的 block number 那麼多的 entry，indexed by block number，每個 entry 如果是某個檔案的一部分，就會存有下一個 block 的 block number
這個設計的好處是 direct access 變得比較快速，因為所有「找下一個 block」的動作都可以在 FAT 中完成，不需要移動讀寫頭
FAT 很仰賴 cache 的機制，如果 FAT 不在 memory 中的話，每一次讀寫下一個 block 都必須先回到 volume 一開始確認 block number，再移動到對應的 block

Indexed Allocation

將一個檔案佔用的 block number 全部照順序紀錄在 index block 中，相對於一般的 linked allocation，這種方法可以很快的完成 direct access
但這種方法也會有通常比 linked allocation 還要嚴重的 pointer overhead (linked allocation 不會浪費多餘的空間，但 index block 可能會有 internal fragmentation)

為了能更彈性的 handle index block 的大小，有以下幾種方法：
- Linked scheme - 將每個 index block 的最後一個 entry 保留下來，大檔案時可以指向下一個 index block，小檔案時則是 null
- Multilevel index - 使用多個 level 的 index block 以指向更大的 file space，如：對於 4KB 的 block size，如果每一個 address 是 four-byte long，使用兩層的 index block 可以指向 $1024^2$ 個 block，也就是最大 4GB 的 file space
- Combined scheme - 結合 direct blocks、single/double/triple indirect blocks，如 UNIX 的 inode

5. Free-Space Management

Free-space list 在檔案系統中也是重要的一環，以下為幾種不同的實作方式

Bit Vector

使用單個 bit 來紀錄對應的 block 是否為 free
尋找第一個 free block (透過尋找第一個非零的 word) 及尋找連續 $n$ 個 free blocks 都很有效率
但如果不將 bit vector 存在 memory 的話便會很沒有效率，但 bit vector 的長度會隨著 storage 變大而增長，佔據很大的記憶體空間 (可以透過 clustering 稍微減輕負擔)

Linked List

將一連串的 free block 用 linked list 接起來
要 traverse 整個 list 會花相當多的時間，但這個 operation 很少發生，通常直接選擇第一個 free block 就好了

Grouping

將 $n$ 個 free block 的 addresses 存在第一個 block 裡，而在最後一個 block 會存有下 $n$ 個 free block 的 addresses
相比於 linked list，grouping 可以更快取得大範圍的 free blocks

Counting

利用連續的 block 通常會一起被 allocated 或 freed (尤其是在 contiguous-allocation 或是 clustering 時) 的性質，只存取每個連續 free blocks 區塊的起始點與長度便已足夠
整體需要紀錄的額外資訊通常會比其他方法來的少
可以把 entry 儲存於 balanced tree 中，已達到更快速的 lookup, insertion, and deletion

Space Maps (Ref)

在 Oracle’s ZFS file system 中，為了解決大量的 metadata I/O 造成的效率問題，將 device space 分成好幾個 metaslabs (一個 volume 中可能有好幾百個 metaslabs)
每個 metaslabs 都會有一個 space map，紀錄了照時間排序的 block activity (allocating and freeing)，並且以 counting format 簡化
當系統需要從 metaslabs allocate or free space 時，會先將對應的 source map 從 disk load 到 memory 裡的 balanced-tree structure (原本是全部 free 的狀態，逐一播放 source map 中的 activity log 並對 b-tree 做出相對應的改變後，就會變成目前的 free space 狀態)，這個步驟可以 condense free space (將相鄰但是分布在多個 operation 的 free space 結合) 也可以順便 condense space map (將可以結合的 allocate/free operation 合併成為一筆紀錄)

TRIMing Unused Blocks

對於 NVM 這類無法直接覆寫的記憶體，只存 free-space list 是不夠的，還要在 block 被 freed 掉之後去主動 erase block 原本的內容
TRIM 便是這樣的一種機制，他會在 page 被 freed 掉的時候通知 storage device 可以考慮進行 block erasure
這樣一來，garbage collection 與 erase 便有可能在 block 還未完全滿的時候進行

6. Efficiency and Performance

Efficiency

這裡舉出一些與時間/空間效率有關的實例：

在 UNIX 系統中，會把 inode 分散在 storage 的各處，並傾向將檔案存放在接近他的 inode block 的位置，以減少 seek time
在 BSD UNIX 系統中，為了減少 clustering 產生的 internal fragmentation，會依據檔案的大小決定 cluster size，大檔案用大 cluster，小檔案或檔尾用小 cluster
使用不同大小的 pointer 會造成不一樣的 space efficiency，32-bit pointer 會佔用比較小的空間，但會限制檔案大小 ($2^{32}=4GB$)，64-bit pointer 雖然可以指到很大的空間，但會造成更嚴重的 space overhead

Performance

有些系統會將一部分的 main memory 當成 buffer cache，將某些可能還會再用到的 file block 存在裡面，同時也當作 device I/O 的中繼站
另外也有 page cache 設計，將 file data 當成 virtual memory 中的 pages 來儲存，會比直接 cache physical disk blocks 來得有效率。有些系統會將 process page 與 file data 都用 page cache 的方式儲存，叫做 unified virtual memory
page cache 的出現也會造成 double buffering 的問題 (如下圖)，原始的 file I/O 透過 read() and write() 將資料讀進 buffer cache 再供 process 使用，但因為 virtual memory system 無法與一般的 buffer cache 對接，所以需要另一個 page cache 在中間接力，這樣兩次的 caching 容易造成效率低落

UNIX 跟 Linux 提出了 unified buffer cache 來解決這個問題，讓 memory-mapped I/O 跟一般的 read-write I/O 都可以直接讀取 buffer cache 的內容

Synchronous / Asychronous write 也是會引響 performance 的一個因素。通常大部分的 write 都是 asychrnous 的，只要寫到 buffer cache 裡就可以 return (因此通常 write file 的速度會快於 read file)，但例如說 file metadata write，便可以透過設定為 synchrnous write 來確保一致性
通常在管理 page cache 時，利用 least-recently uesd(LRU) 來決定要將哪一個 page 取代掉是一個不錯的方法，但對於 sequential access 的檔案，most-recently used 的 page 通常才是最不可能被用到的，而下一個要讀取的 page 也不可能會在 cache 中，這時就需要使用 free-behind 與 read-ahead 的機制，優先釋放剛讀取完成的 page 和預先讀取一部分未來可能會使用到的 page

7. Recovery

任何類型的 system crash 都可能導致 file content 損毀或者不同的紀錄發生不一致的情形 (如：FCB count 已經增加但真正的 FCB 還沒 allocate)，因此需要額外的一些檢查機制來防範

Consistency Checking

要發現 file system 中是否有問題，最直接的方法是把所有的 metadata 都看過一遍，但這樣相當的花時間
一個替代方案是在 metadata 中紀錄一個是否正在讀寫的 state，若發現某一個檔案沒有在進行 I/O 但正在讀寫的 state 卻是 on 的，那就表示有錯誤發生
系統中通常會存在 consistency checker，透過比較 directory structure 與其他的 metadata 來尋找錯誤並試圖修正

Log-Structured File System

使用 consistency checker 可能會有以下缺點：
- 如果沒有當時的確切紀錄，某些錯誤可能時無法修復的
- 在自動修復時可能會產生 conflict，需要人為的解除
- 會耗費額外的 system and clock time
因此在多數的系統中 (如：NTFS、ext4、ZFS 等) 目前都使用 log-based trasaction-oriented 或 journaling file system
在這個 file system 中，每一筆 request (稱作 trasaction) 都會被記錄在 log 中，system call 只負責記錄要進行的任務，不需要等待 I/O 完成，因此可以很快的 return。而 log 的 entry 會在真正的 file system 中「重播」，並進行相應的 I/O，已完成的 entry 便會被移除
當 system crash 時，只需要還在 log 中被紀錄的 action (也就是還沒有真的執行的 action) 執行一遍即可。注意這樣的設計不需要再系統沒有出錯的時候耗費額外的時間檢查，只需要在 system crash 的時候執行額外工作就好