Stanford CS144 Lab 1
Lab Checkpoint 1: stitching substrings into a byte stream
这个 Checkpoint 为实现一个可靠的数据流做准备,要求我们实现一个 Reassembler,将乱序到来的 substrings 拼接在一起,一旦有连续的部分则写入上一个 Checkpoint 中实现的 ByteStream 中。
2 Hands-on component: a private network for the class #
不是斯坦福学生做不了。
大概就是 Wireshark 抓包,暑假正好在网安交叉实践的时候尝试过。
3 Implementation: putting substrings in sequence #
3.1 What should the Reassembler store internally? #
我增加了几个 private 成员:
std::map<uint64_t, std::string> buffer_ {};
uint64_t index_ { 0 };
uint64_t eof_ = { static_cast<uint64_t>( -1 ) };
具体地,由于插入 substrings 的接口是这样的:
/*
* Insert a new substring to be reassembled into a ByteStream.
* `first_index`: the index of the first byte of the substring
* `data`: the substring itself
* `is_last_substring`: this substring represents the end of the stream
* `output`: a mutable reference to the Writer
*
* The Reassembler's job is to reassemble the indexed substrings (possibly out-of-order
* and possibly overlapping) back into the original ByteStream. As soon as the Reassembler
* learns the next byte in the stream, it should write it to the output.
*
* If the Reassembler learns about bytes that fit within the stream's available capacity
* but can't yet be written (because earlier bytes remain unknown), it should store them
* internally until the gaps are filled in.
*
* The Reassembler should discard any bytes that lie beyond the stream's available capacity
* (i.e., bytes that couldn't be written even if earlier gaps get filled in).
*
* The Reassembler should close the stream after writing the last byte.
*/
void insert( uint64_t first_index, std::string data, bool is_last_substring );
因此我们考虑,把任何不连续的部分都存储在 std::map 里,后面需要索引的时候,也可以很方便地用 upper_bound lower_bound 进行定位。
另有一个实现细节需要考虑:FAQ 里提到了,
Is it okay for our re-assembly data structure to store overlapping substrings? No. It is possible to implement an “interface-correct” reassembler that stores overlapping substrings. But allowing the re-assembler to do this undermines the notion of “capacity” as a memory limit. If the caller provides redundant knowledge about the same index, the Reassembler should only store one copy of this information.
任何重叠的部分都应该只存储一次。这相当于要求我们,在某次 insert 操作造成子串重叠时,删除掉重叠的部分。问题在于,删除先插入的,还是删除后插入的呢?作业要求中并没有指定。
我询问 LLM 真实世界的 TCP 规范是如何规定的,得到如下回复:
关于重叠分段(Overlapping Segments),TCP的原始规范(RFC 793)并没有严格规定必须保留哪一份数据。它提到这是一个实现决策问题。
然而,后来的规范,特别是 RFC 1122 “Requirements for Internet Hosts”,对此给出了更明确的建议。其中第 4.2.2.11 节指出:
在重组过程中,如果一个新的分段到达,其数据与已在队列中的旧分段的数据有重叠,实现必须保留旧分段中的数据,并丢弃新分段中的重叠数据。 (In reassembly, if a new segment arrives which seems to overlap an old segment, the old data MUST be kept and the new data discarded.)
这个规定的主要原因是出于安全考虑。保留旧数据可以帮助抵御某些类型的攻击,例如攻击者试图通过发送恶意构造的重叠分段来破坏TCP连接。
然而我查询 RFC,发现后面这个引用是纯纯胡扯的。由于懒得进行深入调查,我在写作业的时候实践了不同的删除策略,发现无论怎样(甚至包括混用不同的删除策略),作业框架的 cases 都可以顺利通过。
最后附上我的实现。我选择始终保留旧数据。
void Reassembler::insert( uint64_t first_index, string data, bool is_last_substring )
{
auto& writer_inst = output_.writer();
if ( is_last_substring ) {
eof_ = first_index + data.size();
if ( eof_ == index_ ) {
writer_inst.close();
}
}
if ( data.empty() ) {
return;
}
// ----- Check for out-of-bounds data -----
if ( first_index + data.size() <= index_ ) { // ...(...)..[...]...
return;
}
if ( first_index < index_ ) { // ...(..[.)..]...
uint64_t overlap = index_ - first_index;
data = data.substr( overlap );
first_index = index_;
}
if ( first_index >= index_ + writer_inst.available_capacity() ) { // ...[...]..(...)...
return;
}
if ( first_index + data.size() > index_ + writer_inst.available_capacity() ) { // ...[..(.]..)...
uint64_t left = index_ + writer_inst.available_capacity() - first_index;
data = data.substr( 0, left );
}
// ----- Check for internal overlaps -----
auto it = buffer_.lower_bound( first_index );
if ( it == buffer_.end() || it->first > first_index ) {
it = buffer_.insert( { first_index, std::move( data ) } ).first;
if ( it != buffer_.begin() ) {
it--; // now `it` is the previous one
}
} else {
if ( data.size() > it->second.size() ) {
// it->second = std::move( data );
it->second += data.substr( it->second.size() );
} else {
// it->second = std::move( data ) + it->second.substr( data.size() );
}
}
auto nxt = std::next( it );
while ( nxt != buffer_.end() ) {
if ( it->first + it->second.size() >= nxt->first ) {
if ( it->first + it->second.size() < nxt->first + nxt->second.size() ) {
// overlapping but not fully included, merge them
it->second += nxt->second.substr( it->first + it->second.size() - nxt->first );
}
nxt = buffer_.erase( nxt );
} else {
break;
}
}
// ----- Now we can write out -----
while ( !buffer_.empty() && buffer_.begin()->first == index_ ) {
it = buffer_.begin();
uint64_t len = it->second.size();
writer_inst.push( std::move( it->second ) );
index_ += len;
buffer_.erase( buffer_.begin() );
}
if ( eof_ == index_ ) {
writer_inst.close();
}
}
// How many bytes are stored in the Reassembler itself?
// This function is for testing only; don't add extra state to support it.
uint64_t Reassembler::count_bytes_pending() const
{
uint64_t count = 0;
for ( const auto& [index, str] : buffer_ ) {
count += str.size();
}
return count;
}
性能数据如下:
Start 40: reassembler_speed_test
Reassembler throughput (no overlap): 178.15 Gbit/s
Reassembler throughput (10x overlap): 22.91 Gbit/s
18/18 Test #40: reassembler_speed_test ........... Passed 0.26 sec
作业文档里说 A top-of-the-line Reassembler will achieve 10 Gbit/s,可能是我的电脑太新了。