We are designing a system that offers customers the chance to buy a Playstation 5 from the official Sony webshop. Since the demand is high due to legitimate customers wanting to buy a console as well as people using bots to make bulk purchases in order to sell the items later for a profit on Ebay, we want to offer everyone the chance to buy a Playstation 5 for a limited time on the basis of a first-come-first-served order.
Estimates
- Daily active users: 1,000,000
- Maximum queue capacity: 200,000
- Maximum simultaneous connections per server: 50,000
- Maximum number of servers to deploy: 20
Functional requirements
- Users are allowed to request access to the webshop
- Once the waiting time is up, the user will be automatically redirected to the webshop.
- Once the allocated shopping time is up or the user has disconnected, the user should be removed from the list of active users and another user in the waiting queue will be given access.
High-level design
API
1. Requesting access to the webshop
POST: /users/{userId}/webshop?access=true
If the request is successful (status-code is 200), the websocket connection will be established and the server will response with the approximate waiting duration. If the list of active users and the waiting queue are both full, the status-code 429 (Too Many Requests) will be returned.
2. Monitoring user connections
GET: /users/{userId}/actuator/health
This endpoints notifies the server that the user is still active.
When the user ceases to send this request for a period longer than a pre-configured default (30 seconds), the server assumes that user has left the webshop, therefore the user will be replaced with another one from the waiting queue.
This endpoint might seem redundant since we can rely on the state of the websocket connection to determine if the user is still active or not, however this choice was made to avoid storing the user ID in the list of active sessions.
With the API in place, let us discuss the possible use-case scenarios.
3. Scenarios
3.1. Nominal scenario: there are available spots in the webshop
The user requests access to the webshop, the server sees that there are spots available by checking the size of the list of active users in the Redis cluster. The server then establishes the websocket connection, forwards the user's request to the webshop and returns the response HTML page.
The user shops for a certain duration. When the maximum shopping duration is elapsed, the connection is closed and the user is redirected back to the home page.
3.2. The user is added to the waiting queue
The user tries to access the webshop, the server establishes the websocket connection with a response indicating the approximate waiting duration (let us assume that the waiting queue is empty). The server also adds the user to the waiting queue. When a slot becomes available, the waiting user will be added to the list of active users. On the next health check call, the user will be redirected to the webshop.
3.3. The user is added to the waiting queue but later on he disconnects
The disconnection happened due to connectivity issues (ex. slow rate of health check calls) or simply the user has left the webpage. Either way, the server removes the user from the waiting queue.
3.4. The waiting queue is full
In this scenario, a response code 429 is returned.
Calculating the EWT (Estimated Waiting Time)
For each user, we store the entrance timestamp and the eviction timestamp, where the user will be evicted from the webshop. When the list of active users is full and a new user comes in, the EWT will be calculated based on the user's position in the waiting queue and the eviction timestamps of the active users. The gist of the calculation logic is as follows:
void calculateMaximumEWT() {
// activeUsers is a list sorted by evictionTimestamp in a descending order
int maximumEWT = 0; // in seconds
if (activeUsers.size() == MAX_ACTIVE_USERS) {
maximumEWT = activeUsers[Math.min(MAX_ACTIVE_USERS - 1, position)].evictionTimestamp()
+ (position / (float) waitingQueue.size()) * MAX_SHOPPING_TIME_SECONDS - now();
}
return maximumEWT;
}
The EWT is an upper-bound, it is not exact but precise enough for our requirements. In any case, a waiting user will be polled into the webshop as soon as a slot becomes available.
Scenario 1: an exact number of users started shopping at the same time and an additional user was added to the waiting queue (at position = 0)
The returned value of the calculateMaximumEWT function will be activeUsers[0].evictionTimestamp() - now() = MAX_SHOPPING_TIME_SECONDS
Scenario 2: the list of active users is not full and a user has requested access to the webshop The return value of the calculateMaximumEWT function will be 0.
Scenario 3: the list of active users is empty and a user has requested access to the webshop The return value of the calculateMaximumEWT function will be 0.
Handling failures
Since the user sessions are stored in memory, there is a risk of data loss when one or more Redis nodes go down. To avoid this situation, we have to trade off speed (a small percentage) for reliability.
Our system is special in the sense that our persisted data is ephemeral, in other words, when the user session expires, we can completely discard the data. Thus in theory if we increase the size of our Redis cluster and enable data replication, then our system will work as expected in the majority of cases. However that comes at the cost of extra compute time, storage capacity and network bandwidth.
Luckily Redis offer a few approaches for persisting data. Since we want to lower the probability of data loss to the bare minimum, we can use AOF (Append-Only File). AOF tells Redis to persist every write operation in an append-only file. When a node is restarted, it can replay the sequence of write operations and restore the latest state. To manage the ever-increasing size of the append-only file, we will run the BGREWRITEAOF
in our cluster which starts a background process that optimizes the append-only file size by splitting it into small chunks.
Conclusion
We have a presented a possible solution to the problem of creating a queue-based online webshop. As always, there are lots of other solutions to consider and each one could be considered "the best" based on the supplied requirements. The chosen solution in this article favors simplicity and ease of understanding. Simplicity is important in order to develop a clear conceptual representation of the system and to reason about its behaviour later when it is running on production and serving millions of users. Designing a system iteratively, while it may not always be possible, is highly beneficial as long as the team has time and velocity to iterate fast and improve things.
[References]