Stanford CS144 Lab 1

Lab Checkpoint 1: stitching substrings into a byte stream

这个 Checkpoint 为实现一个可靠的数据流做准备,要求我们实现一个 Reassembler,将乱序到来的 substrings 拼接在一起,一旦有连续的部分则写入上一个 Checkpoint 中实现的 ByteStream 中。

2 Hands-on component: a private network for the class #

不是斯坦福学生做不了。

大概就是 Wireshark 抓包,暑假正好在网安交叉实践的时候尝试过。

3 Implementation: putting substrings in sequence #

3.1 What should the Reassembler store internally? #

我增加了几个 private 成员:

1std::map<uint64_t, std::string> buffer_ {};
2uint64_t index_ { 0 };
3uint64_t eof_ = { static_cast<uint64_t>( -1 ) };

具体地,由于插入 substrings 的接口是这样的:

 1/*
 2* Insert a new substring to be reassembled into a ByteStream.
 3*   `first_index`: the index of the first byte of the substring
 4*   `data`: the substring itself
 5*   `is_last_substring`: this substring represents the end of the stream
 6*   `output`: a mutable reference to the Writer
 7*
 8* The Reassembler's job is to reassemble the indexed substrings (possibly out-of-order
 9* and possibly overlapping) back into the original ByteStream. As soon as the Reassembler
10* learns the next byte in the stream, it should write it to the output.
11*
12* If the Reassembler learns about bytes that fit within the stream's available capacity
13* but can't yet be written (because earlier bytes remain unknown), it should store them
14* internally until the gaps are filled in.
15*
16* The Reassembler should discard any bytes that lie beyond the stream's available capacity
17* (i.e., bytes that couldn't be written even if earlier gaps get filled in).
18*
19* The Reassembler should close the stream after writing the last byte.
20*/
21void insert( uint64_t first_index, std::string data, bool is_last_substring );

因此我们考虑,把任何不连续的部分都存储在 std::map 里,后面需要索引的时候,也可以很方便地用 upper_bound lower_bound 进行定位。

另有一个实现细节需要考虑:FAQ 里提到了,

Is it okay for our re-assembly data structure to store overlapping substrings? No. It is possible to implement an “interface-correct” reassembler that stores overlapping substrings. But allowing the re-assembler to do this undermines the notion of “capacity” as a memory limit. If the caller provides redundant knowledge about the same index, the Reassembler should only store one copy of this information.

任何重叠的部分都应该只存储一次。这相当于要求我们,在某次 insert 操作造成子串重叠时,删除掉重叠的部分。问题在于,删除先插入的,还是删除后插入的呢?作业要求中并没有指定。

我询问 LLM 真实世界的 TCP 规范是如何规定的,得到如下回复:

关于重叠分段(Overlapping Segments),TCP的原始规范(RFC 793)并没有严格规定必须保留哪一份数据。它提到这是一个实现决策问题。

然而,后来的规范,特别是 RFC 1122 “Requirements for Internet Hosts”,对此给出了更明确的建议。其中第 4.2.2.11 节指出:

在重组过程中,如果一个新的分段到达,其数据与已在队列中的旧分段的数据有重叠,实现必须保留旧分段中的数据,并丢弃新分段中的重叠数据。 (In reassembly, if a new segment arrives which seems to overlap an old segment, the old data MUST be kept and the new data discarded.)

这个规定的主要原因是出于安全考虑。保留旧数据可以帮助抵御某些类型的攻击,例如攻击者试图通过发送恶意构造的重叠分段来破坏TCP连接。

然而我查询 RFC,发现后面这个引用是纯纯胡扯的。由于懒得进行深入调查,我在写作业的时候实践了不同的删除策略,发现无论怎样(甚至包括混用不同的删除策略),作业框架的 cases 都可以顺利通过。

最后附上我的实现。我选择始终保留旧数据。

 1void Reassembler::insert( uint64_t first_index, string data, bool is_last_substring )
 2{
 3  auto& writer_inst = output_.writer();
 4
 5  if ( is_last_substring ) {
 6    eof_ = first_index + data.size();
 7    if ( eof_ == index_ ) {
 8      writer_inst.close();
 9    }
10  }
11
12  if ( data.empty() ) {
13    return;
14  }
15
16  // ----- Check for out-of-bounds data -----
17  if ( first_index + data.size() <= index_ ) { // ...(...)..[...]...
18    return;
19  }
20  if ( first_index < index_ ) { // ...(..[.)..]...
21    uint64_t overlap = index_ - first_index;
22    data = data.substr( overlap );
23    first_index = index_;
24  }
25  if ( first_index >= index_ + writer_inst.available_capacity() ) { // ...[...]..(...)...
26    return;
27  }
28  if ( first_index + data.size() > index_ + writer_inst.available_capacity() ) { // ...[..(.]..)...
29    uint64_t left = index_ + writer_inst.available_capacity() - first_index;
30    data = data.substr( 0, left );
31  }
32
33  // ----- Check for internal overlaps -----
34  auto it = buffer_.lower_bound( first_index );
35  if ( it == buffer_.end() || it->first > first_index ) {
36    it = buffer_.insert( { first_index, std::move( data ) } ).first;
37    if ( it != buffer_.begin() ) {
38      it--; // now `it` is the previous one
39    }
40  } else {
41    if ( data.size() > it->second.size() ) {
42      // it->second = std::move( data );
43      it->second += data.substr( it->second.size() );
44    } else {
45      // it->second = std::move( data ) + it->second.substr( data.size() );
46    }
47  }
48  auto nxt = std::next( it );
49  while ( nxt != buffer_.end() ) {
50    if ( it->first + it->second.size() >= nxt->first ) {
51      if ( it->first + it->second.size() < nxt->first + nxt->second.size() ) {
52        // overlapping but not fully included, merge them
53        it->second += nxt->second.substr( it->first + it->second.size() - nxt->first );
54      }
55      nxt = buffer_.erase( nxt );
56    } else {
57      break;
58    }
59  }
60
61  // ----- Now we can write out -----
62  while ( !buffer_.empty() && buffer_.begin()->first == index_ ) {
63    it = buffer_.begin();
64    uint64_t len = it->second.size();
65    writer_inst.push( std::move( it->second ) );
66    index_ += len;
67    buffer_.erase( buffer_.begin() );
68  }
69
70  if ( eof_ == index_ ) {
71    writer_inst.close();
72  }
73}
74
75// How many bytes are stored in the Reassembler itself?
76// This function is for testing only; don't add extra state to support it.
77uint64_t Reassembler::count_bytes_pending() const
78{
79  uint64_t count = 0;
80  for ( const auto& [index, str] : buffer_ ) {
81    count += str.size();
82  }
83  return count;
84}

性能数据如下:

      Start 40: reassembler_speed_test
        Reassembler throughput (no overlap):  178.15 Gbit/s
        Reassembler throughput (10x overlap): 22.91 Gbit/s
18/18 Test #40: reassembler_speed_test ...........   Passed    0.26 sec

作业文档里说 A top-of-the-line Reassembler will achieve 10 Gbit/s,可能是我的电脑太新了。