my xfce4 dotfiles
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

330 lines
15 KiB

3 years ago
  1. # Fast directory listing
  2. In order to find untracked files in a git repository, [gitstatusd](../README.md) needs to list the
  3. contents of every directory. gitstatusd does it 27% faster than a reasonable implementation that a
  4. seasoned C/C++ practitioner might write. This document explains the optimizations that went into it.
  5. As directory listing is a common operation, many other projects can benefit from applying these
  6. optimizations.
  7. ## v1
  8. Given a path to a directory, `ListDir()` must produce the list of files in that directory. Moreover,
  9. the list must be sorted lexicographically to enable fast comparison with Git index.
  10. The following C++ implementation gets the job done. For simplicity, it returns an empty list on
  11. error.
  12. ```c++
  13. vector<string> ListDir(const char* dirname) {
  14. vector<string> entries;
  15. if (DIR* dir = opendir(dirname)) {
  16. while (struct dirent* ent = (errno = 0, readdir(dir))) {
  17. if (!Dots(ent->d_name)) entries.push_back(ent->d_name);
  18. }
  19. if (errno) entries.clear();
  20. sort(entries.begin(), entries.end());
  21. closedir(dir);
  22. }
  23. return entries;
  24. }
  25. ```
  26. Every directory has entries `"."` and `".."`, which we aren't interested in. We filter them out with
  27. a helper function `Dots()`.
  28. ```c++
  29. bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }
  30. ```
  31. To check how fast `ListDir()` performs, we can run it many times on a typical directory. One million
  32. runs on a directory with 32 files with 16-character names takes 12.7 seconds.
  33. ## v2
  34. Experienced C++ practitioners will scoff at our implementation of `ListDir()`. If it's meant to be
  35. efficient, returning `vector<string>` is an unaffordable convenience. To avoid heap allocations we
  36. can use a simple arena that will allow us to reuse memory between different `ListDir()` calls.
  37. (Changed and added lines are marked with comments.)
  38. ```c++
  39. void ListDir(const char* dirname, string& arena, vector<char*>& entries) { // +
  40. entries.clear(); // +
  41. if (DIR* dir = opendir(dirname)) {
  42. arena.clear(); // +
  43. while (struct dirent* ent = (errno = 0, readdir(dir))) {
  44. if (!Dots(ent->d_name)) {
  45. entries.push_back(reinterpret_cast<char*>(arena.size())); // +
  46. arena.append(ent->d_name, strlen(ent->d_name) + 1); // +
  47. }
  48. }
  49. if (errno) entries.clear();
  50. for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)]; // +
  51. sort(entries.begin(), entries.end(), // +
  52. [](const char* a, const char* b) { return strcmp(a, b) < 0; }); // +
  53. closedir(dir);
  54. }
  55. }
  56. ```
  57. To make performance comparison easier, we can normalize them relative to the baseline. v1 will get
  58. performance score of 100. A twice-as-fast alternative will be 200.
  59. | version | optimization | score |
  60. |---------|----------------------------|----------:|
  61. | v1 | baseline | 100.0 |
  62. | **v2** | **avoid heap allocations** | **112.7** |
  63. Avoiding heap allocations makes `ListDir()` 12.7% faster. Not bad. As an added bonus, those casts
  64. will fend off the occasional frontend developer who accidentally wanders into the codebase.
  65. ## v3
  66. `opendir()` is an expensive call whose performance is linear in the number of subdirectories in the
  67. path because it needs to perform a lookup for every one of them. We can replace it with `openat()`,
  68. which takes a file descriptor to the parent directory and a name of the subdirectory. Just a single
  69. lookup, less CPU time. This optimization assumes that callers already have a descriptor to the
  70. parent directory, which is indeed the case for gitstatusd, and is often the case in other
  71. applications that traverse filesystem.
  72. ```c++
  73. void ListDir(int parent_fd, const char* dirname, string& arena, vector<char*>& entries) { // +
  74. entries.clear();
  75. int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC); // +
  76. if (dir_fd < 0) return; // +
  77. if (DIR* dir = fdopendir(dir_fd)) {
  78. arena.clear();
  79. while (struct dirent* ent = (errno = 0, readdir(dir))) {
  80. if (!Dots(ent->d_name)) {
  81. entries.push_back(reinterpret_cast<char*>(arena.size()));
  82. arena.append(ent->d_name, strlen(ent->d_name) + 1);
  83. }
  84. }
  85. if (errno) entries.clear();
  86. for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)];
  87. sort(entries.begin(), entries.end(),
  88. [](const char* a, const char* b) { return strcmp(a, b) < 0; });
  89. closedir(dir);
  90. } else { // +
  91. close(dir_fd); // +
  92. } // +
  93. }
  94. ```
  95. This is worth about 3.5% in speed.
  96. | version | optimization | score |
  97. |---------|--------------------------------------|----------:|
  98. | v1 | baseline | 100.0 |
  99. | v2 | avoid heap allocations | 112.7 |
  100. | **v3** | **open directories with `openat()`** | **116.2** |
  101. ## v4
  102. Copying file names to the arena isn't free but it doesn't seem like we can avoid it. Poking around
  103. we can see that the POSIX API we are using is implemented on Linux on top of `getdents64` system
  104. call. Its documentation isn't very encouraging:
  105. ```text
  106. These are not the interfaces you are interested in. Look at
  107. readdir(3) for the POSIX-conforming C library interface. This page
  108. documents the bare kernel system call interfaces.
  109. Note: There are no glibc wrappers for these system calls.
  110. ```
  111. Hmm... The API looks like something we can take advantage of, so let's try it anyway.
  112. First, we'll need a simple `Arena` class that can allocate 8KB blocks of memory.
  113. ```c++
  114. class Arena {
  115. public:
  116. enum { kBlockSize = 8 << 10 };
  117. char* Alloc() {
  118. if (cur_ == blocks_.size()) blocks_.emplace_back(kBlockSize, 0);
  119. return blocks_[cur_++].data();
  120. }
  121. void Clear() { cur_ = 0; }
  122. private:
  123. size_t cur_ = 0;
  124. vector<string> blocks_;
  125. };
  126. ```
  127. Next, we need to define `struct dirent64_t` ourselves because there is no wrapper for the system
  128. call we are about to use.
  129. ```c++
  130. struct dirent64_t {
  131. ino64_t d_ino;
  132. off64_t d_off;
  133. unsigned short d_reclen;
  134. unsigned char d_type;
  135. char d_name[];
  136. };
  137. ```
  138. Finally we can get to the implementation of `ListDir()`.
  139. ```c++
  140. void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) { // +
  141. entries.clear();
  142. int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
  143. if (dir_fd < 0) return;
  144. arena.Clear(); // +
  145. while (true) { // +
  146. char* buf = arena.Alloc(); // +
  147. int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize); // +
  148. if (n <= 0) { // +
  149. if (n) entries.clear(); // +
  150. break; // +
  151. } // +
  152. for (int pos = 0; pos < n;) { // +
  153. auto* ent = reinterpret_cast<dirent64_t*>(buf + pos); // +
  154. if (!Dots(ent->d_name)) entries.push_back(ent->d_name); // +
  155. pos += ent->d_reclen; // +
  156. } // +
  157. } // +
  158. sort(entries.begin(), entries.end(),
  159. [](const char* a, const char* b) { return strcmp(a, b) < 0; });
  160. close(dir_fd);
  161. }
  162. ```
  163. How are we doing with this one?
  164. | version | optimization | score |
  165. |---------|----------------------------------|----------:|
  166. | v1 | baseline | 100.0 |
  167. | v2 | avoid heap allocations | 112.7 |
  168. | v3 | open directories with `openat()` | 116.2 |
  169. | **v4** | **call `getdents64()` directly** | **137.8** |
  170. Solid 20% speedup. Worth the trouble. Unfortunately, we now have just one `reinterpret_cast` instead
  171. of two, and it's not nearly as scary-looking. Hopefully with the next iteration we can get back some
  172. of that evil vibe of low-level code.
  173. As a bonus, every element in `entries` has `d_type` at offset -1. This can be useful to the callers
  174. that need to distinguish between regular files and directories (gitstatusd, in fact, needs this).
  175. Note how `ListDir()` implements this feature at zero cost, as a lucky accident of `dirent64_t`
  176. memory layout.
  177. ## v5
  178. The CPU profile of `ListDir()` reveals that almost all userspace CPU time is spent in `strcmp()`.
  179. Digging into the source code of `std::sort()` we can see that it uses Insertion Sort for short
  180. collections. Our 32-element vector falls under the threshold. Insertion Sort makes `O(N^2)`
  181. comparisons, hence a lot of CPU time in `strcmp()`. Switching to `qsort()` or
  182. [Timsort](https://en.wikipedia.org/wiki/Timsort) is of no use as all good sorting algorithms fall
  183. back to Insertion Sort.
  184. If we cannot make fewer comparisons, perhaps we can make each of them faster? `strcmp()` compares
  185. characters one at a time. It cannot read ahead as it can be illegal to touch memory past the first
  186. null byte. But _we_ know that it's safe to read a few extra bytes past the end of `d_name` for every
  187. entry except the last in the buffer. And since we own the buffer, we can overallocate it so that
  188. reading past the end of the last entry is also safe.
  189. Combining these ideas with the fact that file names on Linux are at most 255 bytes long, we can
  190. invoke `getdents64()` like this:
  191. ```c++
  192. int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256);
  193. ```
  194. And then compare entries like this:
  195. ```c++
  196. [](const char* a, const char* b) { return memcmp(a, b, 255) < 0; }
  197. ```
  198. This version doesn't give any speedup compared to the previous but it opens an avenue for another
  199. optimization. The pointers we pass to `memcmp()` aren't aligned. To be more specific, their
  200. numerical values are `N * 8 + 3` for some `N`. When given such a pointer, `memcmp()` will check the
  201. first 5 bytes one by one, and only then switch to comparing 8 bytes at a time. If we can handle the
  202. first 5 bytes ourselves, we can pass aligned memory to `memcmp()` and take full advantage of its
  203. vectorized loop.
  204. Here's the implementation:
  205. ```c++
  206. uint64_t Read64(const void* p) { // +
  207. uint64_t x; // +
  208. memcpy(&x, p, sizeof(x)); // +
  209. return x; // +
  210. } // +
  211. void ByteSwap64(void* p) { // +
  212. uint64_t x = __builtin_bswap64(Read64(p)); // +
  213. memcpy(p, &x, sizeof(x)); // +
  214. } // +
  215. void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) {
  216. entries.clear();
  217. int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
  218. if (dir_fd < 0) return;
  219. arena.Clear();
  220. while (true) {
  221. char* buf = arena.Alloc();
  222. int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256); // +
  223. if (n <= 0) {
  224. if (n) entries.clear();
  225. break;
  226. }
  227. for (int pos = 0; pos < n;) {
  228. auto* ent = reinterpret_cast<dirent64_t*>(buf + pos);
  229. if (!Dots(ent->d_name)) {
  230. ByteSwap64(ent->d_name); // +
  231. entries.push_back(ent->d_name);
  232. }
  233. pos += ent->d_reclen;
  234. }
  235. }
  236. sort(entries.begin(), entries.end(), [](const char* a, const char* b) {
  237. uint64_t x = Read64(a); // +
  238. uint64_t y = Read64(b); // +
  239. return x < y || (x == y && a != b && memcmp(a + 5, b + 5, 256) < 0); // +
  240. });
  241. for (char* p : entries) ByteSwap64(p); // +
  242. close(dir_fd);
  243. }
  244. ```
  245. This is for Little Endian architecture. Big Endian doesn't need `ByteSwap64()`, so it'll be a bit
  246. faster.
  247. | version | optimization | score |
  248. |---------|----------------------------------|----------:|
  249. | v1 | baseline | 100.0 |
  250. | v2 | avoid heap allocations | 112.7 |
  251. | v3 | open directories with `openat()` | 116.2 |
  252. | v4 | call `getdents64()` directly | 137.8 |
  253. | **v5** | **hand-optimize `strcmp()`** | **143.3** |
  254. Fast and respectably arcane.
  255. ## Conclusion
  256. Through a series of incremental improvements we've sped up directory listing by 43.3% compared to a
  257. naive implementation (v1) and 27.2% compared to a reasonable implementation that a seasoned C/C++
  258. practitioner might write (v2).
  259. However, these numbers are based on an artificial benchmark while the real judge is always the real
  260. code. Our goal was to speed up gitstatusd. Benchmark was just a tool. Thankfully, the different
  261. versions of `ListDir()` have the same comparative performance within gitstatusd as in the benchmark.
  262. In truth, the directory chosen for the benchmark wasn't arbitrary. It was picked by sampling
  263. gitstatusd when it runs on [chromium](https://github.com/chromium/chromium) git repository.
  264. The final version of `ListDir()` spends 97% of its CPU time in the kernel. If we assume that it
  265. makes the minimum possible number of system calls and these calls are optimal (true to the best
  266. of my knowledge), it puts the upper bound on possible future performance improvements at just 3%.
  267. There is almost nothing left in `ListDir()` to optimize.
  268. ![ListDir() CPU profile](
  269. https://raw.githubusercontent.com/romkatv/gitstatus/1ac366952366d89980b3f3484f270b4fa5ae4293/cpu-profile-listdir.png)
  270. (The CPU profile was created with [gperftools](https://github.com/gperftools/gperftools) and
  271. rendered with [pprof](https://github.com/google/pprof)).