本文是阅读 Borg 的笔记，主要梳理其概要，并联系实践谈谈一些感受。

Guide

Guide 章节 High Level 概括其如何实现高利用率和高可靠的能力，熟悉 K8S 同学对此不会陌生。

It achieves high utilization by combining:

Admission control.
Efficient task-packing(scheduling).
Over-commitment.
Machine sharing with process-level performance isolation.

It supports high-availability applications with runtime features that minimize:

Fault-recovery time.
Scheduling policies that reduce the probability of correlated failures.

Introduction

本章节简要介绍其特点，它们在 K8S 中体现的淋漓尽致:

Hides the details of resource management and failure handling so its users can focus on application development instead
Operates with very high reliability and availability, and supports applications that do the same;
Lets us run workloads across tens of thousands of machines effectively

The User Perspective

业务类型

Borg 将业务分为两大类型，即 Prod(基本为在线) 和 Non-Prod(基本为离线)；从资源量上来说，不少互联网公司的在离线资源比例也大概在 2:1 左右。

Prod: 70% Alloc, 60% Usage
Non-Prod: 30% Alloc, 40% Usage

机房拓扑

从业内实践来看，很少见跨机房建设 K8S 集群；进一步而言，可以结合网络拓扑来更加的规范集群的建设，如此有利于容灾和弱化不同层次交换机网络亲和性的调度诉求。从规模上而言，规模化的 K8S 集群在数千台节点级别，少部分公司可达上万节点。

The machines in a cell belong to a single cluster, defined by the high-performance datacenter-scale network fabric that connects them.
A cluster lives inside a single datacenter building, hosts one large cell with 10k.
Borg isolates users from most of these differences by determining where in a cell to run tasks.

任务的属性

如下是 Borg 任务携带的属性：

processor architecture, OS version, or an external IP address, resource requirements((CPU cores, RAM, disk space, disk access rate, TCP ports). we don’t impose fixed-sized buckets or slots.
Owner, binary, config etc.

Alloc

A Borg alloc (short for allocation) is a reserved set of resources on a machine.

What’s alloc?

Priority, quota, and admission control

资源类型上，Borg 提供更丰富的层次，并且定义其抢占策略和准入逻辑。K8S 也继承了资源的等级思想，并且针对不同等级的资源在调度、单机管控上拥有不同的优先级。从业内实践来看，大部分玩家对此类特性并不关注，本质是对它们而言，在从虚拟机迈入容器化时代的过程中，需要保证业务的平滑体验和稳定性。只有少量成熟且拥有海量资源的玩家，对优先级、抢占等特性十分关注，因为这是提升资源利用率的重要手段。

Monitoring, production, batch, and best effort.
We disallow tasks in the production priority band to preempt one another.
Quota-checking is part of admission control, not scheduling: jobs with insufficient quota are immediately rejected upon submission.
A low-priority job may be admitted but remain pending (unscheduled) due to insufficient resources.

Naming and monitoring

BNS 命名的体系携带集群的标志，推测其服务治理体系大概率和集群信息挂钩，对于集群维度的容灾等有一定的收益，K8S 在服务治理体系基本是不太关注集群信息。关于状态检查和 Event 事件，K8S 也大同小异；从运维经验来看，如果可以详细记录 K8S 管控面和数据面的各类重要信息，并存入 ES 等引擎中，将大大提升运维效率。

Borg name service (BNS) name for each task that includes the cell name, job name, and task number.
Health-check.
If a job is not running Borg provides a “why pending?” annotation, together with guidance on how to modify the job’s resource requests to better fit the cell.
Borg records all job submissions and task events, as well as detailed per-task resource usage information in infrastore.

Borg architecture

Arch

Borg 的架构和 K8S 总体比较类似，其中 BorgMaster 计算和存储是融合的，而 K8S 将计算和存储分离，采用 Etcd 完成数据的存储。Borg 另一个重要的特色是构建 Fauxmaster 以模拟验证，其中 Simulator 是衡量调度能力的重要一环，也是值得投入优化之处，在验证 Cell Compaction 方面具有重要的作用。

Master
- BorgMaster / Store
- Scheduer
- Fauxmaster: simulator / Capacity Planning / Sanity Check.
BorgLet

Scheduler

Scheduler 模块和 K8S 也非常相近，调度的粒度是 task，调度流程如出一辙，具体如下：

queue pririoty -> feasible checking -> score

其中在 Score 上，其做了如下侧重，重点是可抢占任务的打散、镜像缓存和容灾考虑；在 MostRequest(减少碎片、增加毛刺) 和 LeastRequest(减少毛刺、增大碎片)上做了权衡:

Minimizing the number and priority of preempted tasks,
Picking machines that already have a copy of the task’s packages,
Spreading tasks across power and failure domains
Packing quality including putting a mix of high and low priority E-PVM for scoring

调度的性能指标：

10000 tasks / minute

Borg 在优化性能方面和 K8S 有较大的差异，K8S 为每个 Pod 打分，而 Borg 采取了精度换效率的做法：

Score Caching: Cache Scores Until the Property of machine/task change.
Equivalence classes: feasibility and scoring for one task per equivalence class.
Relax Randomization.

Availability

Borg 支持业务达到 99.99% 的可用性，和 K8S 思路非常类似：

By spreading tasks of a job across failure domains such as machines, racks, and power domains.
PDB.
Reschedule.

Utilization

Evaluation

和资源利用率指标相比，Borg 提出 Cell Compaction 作为衡量标准：

given a workload, we found out how small a cell it could be fitted into by removing machines until the workload no longer fitted

此衡量标准非常新颖，从客观的视角回答了如果要支持所需要的 workload，最佳条件下需要多少机器(成本)。结合 Faux 等模拟能力，将很好的回答如下问题：

当前的调度状态和理论最佳的状态差异性如何？
机器的选型是否有优化空间，是否最佳的适配了 workload？
如何衡量 workload 的必须性？

本节突出在 prod 和 non-prod 融合并池的重要性，二者的割裂将而外增加 20-30% 的资源。

Segregating prod and non-prod work would need 20–30% more machines in the median cell to run our workload. That’s because prod jobs usually reserve resources to handle rare workload spikes, but don’t use these resources most of the time. Borg reclaims the unused resources to run much of the non-prod work。

Large Cell

本节突出大集群大池子对资源成本的优化效果，采用割裂的小集群将额外增加 20-150% 的资源成本。事实上，不少云原生团队开展业务时，为了避免业务之间的互相影响，通常倾向采用独占节点或者集群这类简单而粗暴的方式解决隔离的问题，但是不利于长期的维护效率和优化资源利用率。因此在建设 K8S 集群时，长期需要避免将业务和某个集群/某些节点进行绑定，应当完善单机层面的隔离能力，最终实现统一而共享的资源池。

Fine-Grained Resource

应该采用精细粒度的资源分配，如果取整到 0.5Core 和 1GB 的粒度将带来 30% 的额外开销。X Core Y GB 是 IaaS 时代浓厚的经验习惯，对用户体验友好。在建设精细化的资源配置之前，需要完善资源诊断和建议的能力，即根据应用的历史数据，为客户推荐合适的精细化的资源配置。

Resource Reclaim

如果禁止 Resource Reclaim，将增加 20% 的资源开销。

Isolation

关于隔离方面，熟悉 Docker 对此应不陌生；如果具备对内核、网络具备更强的掌控能力，还可以定制更多维度和更强的计算、存储和网络的隔离能力。

Security Isolation: Chroot.
Performance Isolation: Post-hoc, Cgroup.
CFS + CPUSetd.

Lessons and future work

The bad

在总结经验的过程中，Borg 提到三个教训:

Jobs are restrictive as the only grouping mechanism for tasks(label selector).
One IP address per machine complicates things(hostNetwork).
Optimizing for power users at the expense of casual ones.

针对 1，K8S 抽象 label 以解决；针对 2，K8S 通过支持 CNI，为每个 Pod 分配独立的 IP。关于问题 3 中的 2/8 效应，不仅仅是 Borg，也是 Mesos/Yarn/K8S 在各自大规模场景中均会碰到的问题，即少量的头部在线服务/离线服务将占据主体的资源。以在线应用为例，副本数 10000 和 10 的服务在发布必然在效率、稳定性和成本方便不不同的考量；以离线业务为例，点击率预估的深度学习训练和短平快的 Spark SQL Batch 任务必然也大相径庭。

The good

Borg 亦归纳了 4 个心得：

Allocs are useful.
Cluster management is more than task management.
The master is the kernel of a distributed system.
Introspection is vital: An important design decision in Borg was to surface debugging information to all users rather than hiding it: Borg has thousands of users, so “self-help”has to be the first step in debugging.

其中第 4 点，关于 Infra 对业务呈现形式上，两种不同的思路：

面向产品化的建设，致力抽象底层架构和细节，给用户以最小的感知和使用成本，公有云通常采用此方案。
面向自助式的思路，提供丰富的文档介绍、技术工具基于自助式的排查，Borg 采用此思路。