Motivation:
A small number of socket options have values that do not fit into a C
int type. Our current ChannelOption based approach for setting these
simply does not work, and cannot be extended to support the truly arbitrary
types that the setsockopt/getsockopt functions allow here.
This makes it impossible to use some socket options, which is hardly a
good place to be.
There were a number of ways we could have addressed this: we could have
special-cased all socket options with non-integer types in ChannelOption,
but I believe that would be too manual, and risk limiting users that need
to set other socket options. We could also have added a ChannelOption
that simply allows users to pass a buffer to write into or read from,
but that is a very un-Swift-like API that feels pretty gross to hold.
Ultimately, the nicest seemed to be a new protocol users could check
for, and that would provide APIs that let users hold the correct concrete
type. As with setsockopt/getsockopt, while this API is typed it is
not type-safe: ultimately, the struct we have here is treated just as a
buffer by setsockopt/getsockopt. We do not attempt to prevent users from
shooting themselves in the foot here.
This PR does not include an example use case in any server, as I will
provide such an example in a subsequent multicast PR.
Modifications:
- Added a SocketOptionChannel protocol.
- Conformed BaseSocketChannel to SocketOptionChannel.
- Wrote some tests for this.
Result:
Users can set and get sockopts with data that is larger than a C int.
Motivation:
Previously we asserted that a Channel cannot go inactive after calling
out for a read that was triggered by (draining the receive buffer after)
a write error.
This assertion was put in place to guard against `readComplete` events
sent on inactive channels. It did that job just fine but crashes aren't
great so we now conditionally fire the `readComplete` event if the
Channel stays active.
Modifications:
make the readComplete event firing conditional
Result:
fewer crashes, more happy faces
Motivation:
We have one case in the code where we expect a socket error to be set
when we hit a connection that got the `.reset` event. On Linux that's
expected and therefore we just map this to `ECONNRESET`.
On other platforms however that is unexpected. Now I just found an
[Envoy commit](62f8e51bdd (diff-3315c7b2) 91f78a482ebd4cbfc0545362R345)
which suggests this might not work on FreeBSD. This should prompt our
users to let us know if we ever see this error on non-Linux systems.
Modifications:
Add a message that makes it clear this is unexpected and link our
ticket.
Result:
we should learn if this actually happens on macOS.
Motivation:
Swift code that doesn't use explicit `self.` is hard to read, even
harder to search for and quite bug prone. During my analysis I found
quite a number of them, this fixes the ones I came across.
Modifications:
Add a bunch of missing explicit `self.`s
Result:
code easier to read and less bug prone
Motivation:
In Darwin it is possible for readEOF to be received before writable,
if we race connection teardown with registering for writable. See
https://github.com/apple/swift-nio/issues/452 for a more in-depth
explanation.
Modifications:
Widened the assertion.
Result:
People's code won't go bang.
Motivation:
If registration of a Channel fails we need to close that Channel.
Previously we'd just drop it on the floor which triggers an assertion
that an open channel got leaked.
Modifications:
In cases where the Channel registration fails, we need to close that
channel.
Result:
- No crashes if registration fails.
- fixes#417
Motivation:
Previously we would eagerly register any fd with the selector
(epoll/kqueue) which works well with kqueue. With epoll however, you'll
get `EPOLLHUP` immediately if you supplied a socket that hasn't had
connect/bind called on it.
We got away with this because we took quite a bit of care to always make
sure we call connect/bind pretty much immediately after we registered
it. And crucially before we ever asked the selector to tell us about
new events.
Modifications:
made selector registration lazy (when we try activating the channel by
calling connect/bind)
Result:
the user can now register and connect whenever they feel like it.
Motivation:
Once again, we had an extra event loop hop between a channel
registration and its activation. Usually this shows up as `EPOLLHUP` but
not so for accepted channels. What happened instead is that we had a
small race window after we accepted a channel. It was in a state where
it was not marked active _yet_ and therefore we'd not read out its data
in case we received a `.readEOF`. That usually leads to a stale
connection. Fortunately it doesn't happen very often that the client
connects, immediately sends some data and then shuts the write end of
the socket.
Modifications:
prevent the event loop hop between registration and activation
Result:
will always read out the read buffer on .readEOF
Motivation:
@vlm hit an interesting situation which is very likely the sign of
another bug that we have yet to find:
During a close, @vlm hit an error in close which lead to a user callout
which then closed again.
The immediate fix is simple (this PR) is as always not to call out to
the user before reconciling our own state. But hitting this bug also
means that either a `socket.close` or a `selectableEventLoop.deregister`
failed as we only called out in those situations. That definitely hides
a condition that is hard to imagine without another bug that we still
need to find.
Modifications:
in `BaseSocketChannel.close0` follow rule 0 and don't call out before
reconciling state.
Result:
fewer crashes, less potential to hide other bugs.
Motivation:
We can't nil out the connectPromise before we fail it. This totally
slipped through because we never fully waited for it because of the
reasons below.
It's quite easy to misuse `eventLoop.submit` as if you do some
asynchronous operation (will return `EventLoopFuture<T>`) the result of
the submit will be `EventLoopFuture<EventLoopFuture<T>>`. So if you just
do
try eventLoop.submit {
return somethingAsynchronous
}.wait()
then you will have only waited for the asynchronous operation to have
started (it might still be running). Fortunately, the compiler usually
warns you about that unless it returns `Void`. So it naturally tells you
to write
try eventLoop.submit {
return somethingAsynchronous
}.wait().wait()
which will wait for both. Unfortunately, `XCTAssertNoThrow` will just
eat away any type so there it's advisable to write
XCTAssertNoThrow(try eventLoop.submit { ... }.wait.wait() as Void)
to be sure it's actually a void. The Swift compiler however has an
annoying bug where it won't always realise that that's a problem
(https://bugs.swift.org/browse/SR-7478). Fortunately the 4.1 release
behaves okay so CI will catch it.
Modifications:
- replace the bad pattern
_ = try el.submit { ... }
- added `as Void`s inside `XCTAssertNoThrow`s that do submit
- added missing `.wait()`s
Result:
tests should be more correct.
Motivation:
Often its useful to be still be able to access the local / remote address during channelInactive / handlerRemoved callbacks to for example log it. We should ensure its still accessible during it.
Modifications:
- Fallback to slow-path in ChannelHandlerContext.localAddress0 / remoteAddress0 if fast-path fails to try accessing the address via the cache.
- Clear cached addresses after all callbacks are run.
- Add unit test.
Result:
Be able to access addresses while handlers are notified.
Motivation:
Since forever we had a major bug in the Selector: In this condition:
- kqueue/epoll had many events
- in one of the earlier events we unregister a Channel whose fd is on of
the later events
- we subsequently (still in the same event loop tick) register a new
channel which gets the same fd as the previously closed one
then we would deliver an event that was meant for a previous channel to
a newly opened one.
Thanks to @mcdappdev for hitting this bug, helping us debug it and also
providing a repeatedly working repro.
Modifications:
if during event delivery any fd gets unregistered, we stop delivering
the remaining events and rely on the selector to redeliver them
again next time.
Result:
we don't deliver events for previously closed channels to new ones.
Motivation:
When a pending connect is still in process and the Channel is closed we need to ensure we use the correct ordering when notify the promises and ChannelHandlers in the pipeline.
The ordering should be:
- Notify pending connect promise
- Notify close promise
- Call channelInactive(...)
Modifications:
- Correct ordering of notification
- Add unit test
Result:
Ensure correct ordering when connect is still pending
Motivation:
In some contexts it is important to be able to pass file descriptors
around instead of specifying the particular host and port. This allows removing
certain windows for race conditions, and has positive security implications.
Modifications:
Added `withConnectedSocket(_:)` call to the `ClientBootstrap` that can
be used to create a new `Channel` out of existing file descriptor
representing the connected socket.
Result:
Allows initializing `Channel`s off the existing file descriptors.
Motivation:
Since #286, we do eagerly detect connection resets which is good. To
resolve the situation we first try to read everything off the socket. If
something fails there, everything is good and the client gets the
correct error. If that however doesn't lead to any errors, we would
close the connection with `error: ChannelError.eof` which the client
would then see for example on a connect that happened to get
`ECONNRESET` or `ECONNREFUSED`. That's not right so these situations are
fixed here.
Modifications:
- on reset, try to get the socket error if any
- if no socket error was detected, we send a `ECONNRESET` because we
know we have been reset but not why and didn't encounter any errors
- write tests for these situations
Result:
connection resets that don't hit any other errors will now see the
correct error
Motivation:
When a connect() call returns an error other than EINPROGRESS, we
currently leave the FD registered and don't close the channel. This is
wrong, we should take this as a signal to close the socket immediately,
rathern than letting the selector tell us this FD is dead: after all,
we know it's dead.
Modifications:
Call `close0()` in synchronous connect failures.
Result:
Faster failures!
Motivation:
We had a number of problems:
1. We wanted to lazily process input EOFs and connection resets only
when the user actually calls `read()`. On Linux however you cannot
unsubscribe from `EPOLLHUP` so that's not possible.
2. Lazily processing input EOFs/connection resets wastes kernel
resources and that could potentially lead to a DOS
3. The very low-level `Selector` interpreted the eventing mechanism's
events quite a lot so the `EventLoop`/`Channel` only ever saw
`readable` or `writable` without further information what exactly
happened.
4. We completely ignored `EPOLLHUP` until now which on Unix Domain
Socket close leads to a 100% CPU spin (issue #277)
Modifications:
- made the `Selector` interface richer, it now sends the following
events: `readable`, `writable`, `readEOF` (input EOF), `reset`
(connection reset or some error)
- process input EOFs and connection resets/errors eagerly
- change all tests which relied on using unconnected and unbound sockets
to user connected/bound ones as `epoll_wait` otherwise would keep
sending us a stream of `EPOLLHUP`s which would now lead to an eager
close
Result:
- most importantly: fix issue #277
- waste less kernel resources (by dealing with input EOFs/connection
resets eagerly)
- bring kqueue/epoll more in line
Motivation:
Callouts to user code allows the user to make calls that re-enter
channel code. In the case of channel closure, we could call out to the
user before the channel knew it was completely closed, which would
trigger a safety assertion (in debug mode) or hit a fatalError
(in release mode).
Modifications:
Reconciled channel state before we call out to user code.
Added a test for this case.
Result:
Fewer crashes, better channel state management.
Motivation:
When an asynchronous connect failed we would crash because we didn't
correctly close the connection to start tearing everything down.
That correctly made the `SocketChannelLifecycleManager` trip.
This was filed as #302.
Modifications:
Correctly deregister and close the connection if an async connect fails.
Result:
Fewer crashes, fixes#302.
Motivation:
NIO 1.3.1 was TSan clean but `master` regressed. In total I could see
two breakages both proudly presented by `SocketChannelLifecycleManager`.
Sorry :| :
1. accessing the `isActiveAtomic` from arbitrary threads:
this was clearly a bug because `SocketChannelLifecycleManager` is
usually held mutabe. So `isActiveAtomic` needs to be injected into
`SocketChannelLifecycleManager` rather than contstructed there
2. accessing the `ChannelPipeline` through `SocketChannelLifecycleManager`:
same problem as above really.
Modifications:
inject both the ChannelPipeline and the isActiveAtomic in the
appropriate places.
Result:
NIO much more correct, TSan happy
Motivation:
We had a lot of problems with the Channel lifecycle statemachine as it
wasn't explicit, this fixes this. Additionally, it asserts a lot more.
Modifications:
- made Channel lifecycle statemachine explicit
- lots of asserts
Result:
- hopefully the state machine works better
- the asserts should guide our way to work on in corner cases as well
Motivation:
The Socket channels and PendingDatagramWritesManager had a bunch of
overhead left over from when they were handling flush promises. This was
only adding to code complexity and perf cost.
Modifications:
Removed the promise label in markPendingWritePoint.
Reworked the code to stop expecting write promises in
PendingDatagramWritesManager.
Result:
Smaller, faster code.
Motivation:
{,Server,Datagram,Base}SocketChannel hold quite a bit of complexity and
also a many of lines of code. Yet they all shared the very same file
which isn't great.
Modifications:
- moved BaseSocketChannel into its own file
- made the methods that need overriding internal (from `fileprivate`)
Result:
SocketChannel.swift has a lot fewer lines of code.