Due to an unexpected influx of new traffic, several servers in the MQTT notification cluster became overwhelmed and started timing out, causing intermittent connection errors. The incident lasted approximately 2.5 hours. We have resolved the issue ASAP and are taking steps to prevent it from reoccurring in the future.
We sincerely apologize for any inconvenience caused by this incident.
]]>Thanks for your patience and understanding as we worked to investigate and resolve this issue ASAP.
]]>Thanks for your patience and understanding as we worked to investigate and resolve this issue ASAP.
]]>The primary database server has fully completed its resynchronization following an AWS EC2 / EBS outage in the us-east-1b
region. The impact of this outage was 10 minutes of Pushy API downtime, followed by 14 hours of performance degradation while the primary was resynchronizing. During this time, 99% of API requests were served in 10-50ms, and the remaining 1% timed out.
Thanks for your patience and understanding as we worked to investigate and resolve this issue ASAP.
]]>5 hours later, when AWS has restored connectivity to EBS volumes within the us-east-1b
region, the primary database has gone out of sync, and is currently undergoing a full resync. This is causing slightly degraded performance due to increased resource contention to serve both API requests and to resync the primary database server.
The Pushy API is currently operational, although there are temporary periods of degraded performance while latency spikes. The resync is expected to complete within the next 2 hours, after which API latency should return to expected levels of ~15ms average.
We deeply apologize for the impact and inconvenience this has had on your reliable notification delivery experience. Thanks for your patience and understanding as we work to continue resolving this issue ASAP. We will post an update when it is fully resolved.
]]>After 10 minutes of API downtime, the Pushy API is now back in working order. We apologize for any inconvenience this has had on your reliable notification delivery experience. Thanks for your patience and understanding as we worked to investigate and resolve this issue ASAP.
]]>The API has returned to full operational capacity.
We apologize for any impact on your reliable notification delivery experience as a result of this incident. Thanks for your patience and understanding as we worked to resolve this issue.
]]>Intermittent connection errors were incurred during a period of 30 minutes. We apologize for any impact on your reliable notification delivery experience as a result of this incident. All queued notifications are now being delivered as soon as any affected devices automatically reconnect to our service. Thanks for your patience and understanding as we worked to resolve this issue ASAP.
]]>We apologize for any impact on your reliable notification delivery experience as a result of this incident. Thanks for your patience and understanding as we worked to resolve this issue.
]]>To remediate the incident, we have immediately deployed an update which increased the limit of notification collapse keys from 32 to 100 characters. API error rates are back to normal.
An update: we have also identified an issue with APNs rejecting notifications with long collapse keys, and thus impacting iOS notification delivery resulting in error "InvalidCollapseId". We have rolled out a fix immediately, and iOS notification delivery is working as expected now.
]]>The API has returned to normal operational capacity.
]]>Thanks for your patience and apologies for the e-mail spam.
]]>The Pushy Pro API & notification service have experienced high latency and partial downtime over the past 8 hours due to a large-scale, sophisticated Denial of Service attack on our API. After further investigation and multiple attempts, we have been successful in blocking the attack and ensuring it will be blocked in the future to prevent any similar incidents.
The incident is now resolved and service is restored. Once again, we truly apologize for any inconvenience and negative impact caused by this incident.
We thank you for choosing Pushy. Please contact us if you have any questions or concerns.
]]>Update: Maintenance completed. Avg latency is now ~15ms.
]]>We have stabilized the service and are attempting various latency optimization strategies. Further incidents will be posted if there are any further developments. We apologize for any inconvenience caused.
]]>Service has currently been restored. At this time we are continuing to monitor the attack as it is still ongoing. We are also investigating ways to detect and block Denial of Service attacks automatically to prevent similar incidents in the future.
Further incidents will be posted if there are any future developments. We truly apologize for any inconvenience caused by this Denial of Service attack.
]]>After attempting several remediation strategies, we have been able to apply a solution that appears to be working quite well. We will be closely monitoring latency over the next few days to confirm that the issue is indeed resolved.
]]>After attempting several remediation strategies, we have been able to find a solution that appears to be working quite well, by utilizing a Redis caching mechanism to drastically reduce load on the database cluster. We will be closely monitoring latency over the next few days to confirm that the issue is indeed resolved.
]]>After attempting several remediation strategies, we have been able to find a solution that appears to be working quite well. We will be closely monitoring latency over the next few days to confirm that the issue is indeed resolved.
]]>No API request was dropped during this time. API latency is now back to normal.
]]>API latency and notification delivery are now back to normal.
]]>API latency is back to normal. All API requests during this period have not been dropped or rejected.
]]>API latency is back to normal. All API requests during this period have not been dropped or rejected.
]]>The issue was caused by a single faulty server in our notification service stack which had affected the latency. We have replaced the server and latency is now normal.
]]>API latency is back to normal. All API requests during this period have not been dropped or rejected.
]]>API latency is back to normal. All API requests during this period have not been dropped or rejected.
]]>API latency is back to normal.
]]>API latency is back to normal.
]]>API latency is back to normal. All API requests during this period have not been dropped or rejected.
]]>We will continue to monitor the API servers' memory utilization to verify that the leak has been mitigated.
]]>The API servers' memory has been increased substantially to prevent this from happening again in the future.
]]>Within a few moments of the incident, a secondary database server had taken over as the primary and started accepting read/write queries from the API. It took the secondary database a few minutes to finish processing the queued requests and lower latency back to normal ranges.
Once we made sure the secondary server was stable and latency had become tolerable again, we increased the primary server's available memory and began a full data resync, which took around 40 minutes to complete.
We will be constantly monitoring the database servers' resource utilization to prevent a similar incident from happening in the future.
]]>No downtime was incurred, but notification delivery speed may have been affected. All push notifications sent during this time have been persisted successfully.
]]>Notifications to these devices have been persisted and will be delivered as soon as these devices reconnect to our network.
]]>The latency was caused by a sudden burst of traffic over a short period of time.
The underlying issue has been rectified, and we have taken steps to prevent this kind of situation in the future.
]]>Update: The maintenance is complete.
]]>