Stanford CS144 Lab 1

Lab Checkpoint 1: stitching substrings into a byte stream

这个 Checkpoint 为实现一个可靠的数据流做准备,要求我们实现一个 Reassembler,将乱序到来的 substrings 拼接在一起,一旦有连续的部分则写入上一个 Checkpoint 中实现的 ByteStream 中。

2 Hands-on component: a private network for the class #

不是斯坦福学生做不了。

大概就是 Wireshark 抓包,暑假正好在网安交叉实践的时候尝试过。

3 Implementation: putting substrings in sequence #

3.1 What should the Reassembler store internally? #

我增加了几个 private 成员:

std::map<uint64_t, std::string> buffer_ {};
uint64_t index_ { 0 };
uint64_t eof_ = { static_cast<uint64_t>( -1 ) };

具体地,由于插入 substrings 的接口是这样的:

/*
* Insert a new substring to be reassembled into a ByteStream.
*   `first_index`: the index of the first byte of the substring
*   `data`: the substring itself
*   `is_last_substring`: this substring represents the end of the stream
*   `output`: a mutable reference to the Writer
*
* The Reassembler's job is to reassemble the indexed substrings (possibly out-of-order
* and possibly overlapping) back into the original ByteStream. As soon as the Reassembler
* learns the next byte in the stream, it should write it to the output.
*
* If the Reassembler learns about bytes that fit within the stream's available capacity
* but can't yet be written (because earlier bytes remain unknown), it should store them
* internally until the gaps are filled in.
*
* The Reassembler should discard any bytes that lie beyond the stream's available capacity
* (i.e., bytes that couldn't be written even if earlier gaps get filled in).
*
* The Reassembler should close the stream after writing the last byte.
*/
void insert( uint64_t first_index, std::string data, bool is_last_substring );

因此我们考虑,把任何不连续的部分都存储在 std::map 里,后面需要索引的时候,也可以很方便地用 upper_bound lower_bound 进行定位。

另有一个实现细节需要考虑:FAQ 里提到了,

Is it okay for our re-assembly data structure to store overlapping substrings? No. It is possible to implement an “interface-correct” reassembler that stores overlapping substrings. But allowing the re-assembler to do this undermines the notion of “capacity” as a memory limit. If the caller provides redundant knowledge about the same index, the Reassembler should only store one copy of this information.

任何重叠的部分都应该只存储一次。这相当于要求我们,在某次 insert 操作造成子串重叠时,删除掉重叠的部分。问题在于,删除先插入的,还是删除后插入的呢?作业要求中并没有指定。

我询问 LLM 真实世界的 TCP 规范是如何规定的,得到如下回复:

关于重叠分段(Overlapping Segments),TCP的原始规范(RFC 793)并没有严格规定必须保留哪一份数据。它提到这是一个实现决策问题。

然而,后来的规范,特别是 RFC 1122 “Requirements for Internet Hosts”,对此给出了更明确的建议。其中第 4.2.2.11 节指出:

在重组过程中,如果一个新的分段到达,其数据与已在队列中的旧分段的数据有重叠,实现必须保留旧分段中的数据,并丢弃新分段中的重叠数据。 (In reassembly, if a new segment arrives which seems to overlap an old segment, the old data MUST be kept and the new data discarded.)

这个规定的主要原因是出于安全考虑。保留旧数据可以帮助抵御某些类型的攻击,例如攻击者试图通过发送恶意构造的重叠分段来破坏TCP连接。

然而我查询 RFC,发现后面这个引用是纯纯胡扯的。由于懒得进行深入调查,我在写作业的时候实践了不同的删除策略,发现无论怎样(甚至包括混用不同的删除策略),作业框架的 cases 都可以顺利通过。

最后附上我的实现。我选择始终保留旧数据。

void Reassembler::insert( uint64_t first_index, string data, bool is_last_substring )
{
  auto& writer_inst = output_.writer();

  if ( is_last_substring ) {
    eof_ = first_index + data.size();
    if ( eof_ == index_ ) {
      writer_inst.close();
    }
  }

  if ( data.empty() ) {
    return;
  }

  // ----- Check for out-of-bounds data -----
  if ( first_index + data.size() <= index_ ) { // ...(...)..[...]...
    return;
  }
  if ( first_index < index_ ) { // ...(..[.)..]...
    uint64_t overlap = index_ - first_index;
    data = data.substr( overlap );
    first_index = index_;
  }
  if ( first_index >= index_ + writer_inst.available_capacity() ) { // ...[...]..(...)...
    return;
  }
  if ( first_index + data.size() > index_ + writer_inst.available_capacity() ) { // ...[..(.]..)...
    uint64_t left = index_ + writer_inst.available_capacity() - first_index;
    data = data.substr( 0, left );
  }

  // ----- Check for internal overlaps -----
  auto it = buffer_.lower_bound( first_index );
  if ( it == buffer_.end() || it->first > first_index ) {
    it = buffer_.insert( { first_index, std::move( data ) } ).first;
    if ( it != buffer_.begin() ) {
      it--; // now `it` is the previous one
    }
  } else {
    if ( data.size() > it->second.size() ) {
      // it->second = std::move( data );
      it->second += data.substr( it->second.size() );
    } else {
      // it->second = std::move( data ) + it->second.substr( data.size() );
    }
  }
  auto nxt = std::next( it );
  while ( nxt != buffer_.end() ) {
    if ( it->first + it->second.size() >= nxt->first ) {
      if ( it->first + it->second.size() < nxt->first + nxt->second.size() ) {
        // overlapping but not fully included, merge them
        it->second += nxt->second.substr( it->first + it->second.size() - nxt->first );
      }
      nxt = buffer_.erase( nxt );
    } else {
      break;
    }
  }

  // ----- Now we can write out -----
  while ( !buffer_.empty() && buffer_.begin()->first == index_ ) {
    it = buffer_.begin();
    uint64_t len = it->second.size();
    writer_inst.push( std::move( it->second ) );
    index_ += len;
    buffer_.erase( buffer_.begin() );
  }

  if ( eof_ == index_ ) {
    writer_inst.close();
  }
}

// How many bytes are stored in the Reassembler itself?
// This function is for testing only; don't add extra state to support it.
uint64_t Reassembler::count_bytes_pending() const
{
  uint64_t count = 0;
  for ( const auto& [index, str] : buffer_ ) {
    count += str.size();
  }
  return count;
}

性能数据如下:

      Start 40: reassembler_speed_test
        Reassembler throughput (no overlap):  178.15 Gbit/s
        Reassembler throughput (10x overlap): 22.91 Gbit/s
18/18 Test #40: reassembler_speed_test ...........   Passed    0.26 sec

作业文档里说 A top-of-the-line Reassembler will achieve 10 Gbit/s,可能是我的电脑太新了。