cloud-sre
Migrating an AWS S3 File Gateway Without Breaking User Drive Mappings
A practical migration pattern for replacing a legacy S3 File Gateway, rebuilding cache safely, and moving Windows users with SSM State Manager.
Replacing an AWS S3 File Gateway looks simple until the gateway also provides a business-critical SMB drive to a fleet of Windows desktops. The storage backend may be S3, but users still depend on a stable drive letter, directory-based authentication, local cache behavior, and a predictable cutover.
This article describes a migration pattern that keeps the S3 data intact, replaces the gateway appliance cleanly, and automates the Windows drive remapping with AWS Systems Manager.
The migration goal
The environment had four requirements:
- Replace a legacy gateway before its support deadline
- Keep the existing S3 objects and SMB share name
- Preserve the user-facing drive letter
- Avoid manually reconfiguring every Windows desktop
The final architecture was:
Windows desktops
|
| SMB with directory-based authentication
v
New S3 File Gateway
|
| Private VPC endpoints
v
Existing S3 bucket
The gateway instance and cache were replaced. The S3 bucket remained the source of truth.
Protect the rollback path first
Before changing the old gateway, collect enough evidence to recover or explain the state later:
- Export the gateway, SMB, and file-share configuration
- Disable delete-on-termination for the attached volumes
- Take snapshots of the system and cache volumes
- Wait for every snapshot to complete
- Confirm that
CachePercentDirtyhas reached zero
The dirty-cache check matters more than a simple “running” status. A clean cache means there are no pending local writes that still need to reach S3.
Why reusing the old disks was abandoned
An early attempt attached the legacy system and cache disks to a replacement EC2 instance. The network path looked healthy: the expected TCP ports accepted connections. The appliance was not healthy, however:
- The activation endpoint returned an empty HTTP response
- The browser showed
ERR_EMPTY_RESPONSE - SSH exposed a server banner but stalled during key exchange
This is an important diagnostic boundary:
Open TCP port != healthy appliance service
Continuing to debug a legacy boot disk would have increased the outage window without improving the data recovery position. Because the durable files were already in S3 and the dirty cache was zero, a clean gateway deployment was the lower-risk path.
Build a clean gateway against the same S3 data
The successful approach used a current Storage Gateway image and new local disks:
- Deploy a new S3 File Gateway.
- Connect it through the required public or VPC-hosted service endpoint.
- Allocate new cache storage.
- Join the existing directory service.
- Create an SMB share backed by the existing S3 location.
- Reapply logging, alarms, IAM, security groups, and instance protections.
The local cache does not need to be copied to preserve S3 objects. The new gateway rebuilds its cache as files are accessed.
Automate the Windows drive migration with SSM
AWS Systems Manager commands run as SYSTEM, while a mapped network drive belongs to a
user session. A command that runs net use only as SYSTEM may succeed without making
the drive visible to the signed-in user.
The reliable pattern is:
- Use SSM State Manager to install a PowerShell script on each managed Windows node.
- Register the script in an
HKLMRun entry. - Execute the actual mapping when the user signs in.
- Write logs to the user’s
%LOCALAPPDATA%, not to a protected system directory.
The login script should be conservative:
- If the drive letter is unused, map the new share.
- If it points to the legacy share, replace it.
- If it is a local disk or an unrelated network share, do nothing.
This prevents a migration script from overwriting an unrelated user configuration.
Treat offline desktops as a normal state
Many desktop nodes will be stopped when a State Manager association runs. SSM reports
those targets as Undeliverable, which is expected for offline instances.
Use a scheduled association so that configuration is retried after a desktop starts. The user communication should include:
- A realistic propagation window based on the association interval
- The expected drive label and path
- A sign-out and sign-in fallback
- A request for the execution time, screenshot, and local log if the mapping still fails
This turns an infrastructure rollout into an operationally supportable user change.
Validate the complete path
Do not stop after the gateway reports RUNNING. Validate from a normal user desktop:
echo gateway-test > E:\gateway-test.txt
type E:\gateway-test.txt
del E:\gateway-test.txt
Also verify:
- The expected drive appears in File Explorer
- Existing folders are visible
- Create, read, update, and delete operations work
- The SMB session points to the new gateway
- Gateway and file-share logs are receiving events
- Availability and dirty-cache alarms are healthy
Retire the old stack in dependency order
After the new path is verified, remove the old resources in a controlled sequence:
- Delete the old SMB file share without forcing deletion.
- Delete the old gateway.
- Terminate the old EC2 instance.
- Inventory unattached EBS volumes.
- Delete only volumes that are confirmed unused and have the required snapshots.
Keep the new gateway’s active system and cache volumes explicitly excluded from cleanup.
Reusable lessons
- S3 is the durable data layer; gateway cache is replaceable when dirty writes are zero.
- Network reachability is not application health.
- A clean appliance replacement can be safer than reviving an old boot disk.
- User-visible drive mappings must be created in the user session.
- SSM scheduled associations handle intermittent desktop availability better than a one-time command.
- Monitoring, user communication, and cleanup order are part of the migration design, not post-migration chores.