[dry] Tencent operation and maintenance on: 8 million users to save the actual combat experience

一、800W 用户即将停止服务
先给各位看官们唠叨一下整体的背景,在去年的时候我们要开始因 IDC 硬件设施老化需要被整体裁撤掉而带来的业务被动迁移,此次迁移机器数涉及 2K+,业务模块数量涉及 150+,其中手机 QQ 运维团队所负责的部分,无论是机器量还是业务量都是 TOP1,而且裁撤时间进度比较紧张,IDC 将最后会在运营一段时间后,就会断电断网。
搞过大规模业务迁移的看官们都知道,这是一个费时、费力、费心,会产生大量的沟通、评估、实施成本的事情,并且在过程中还伴随着有损业务服务质量的风险。
PC 互联网时代逐步的结束到全面进入移动互联网时代的过程中,也带给了运维团队很多全新的挑战与压力 ,如:
早期 APP 版本的多样性,需要适配”百花齐放”的手机机型,不同架构的可维护性差异较大
移动互联网网络环境更加复杂、就近接入的粒度需要更细
对业务架构的容错、容灾、服务质量、用户调度的策略都有更高的要求
……
手机 QQ 运维团队在此次的业务迁移过程中也面临了上述几点问题,待迁移的业务列表中,有手机 QQ 非智能机的业务。
在2004年—2009年智能机还没有普及流行与平台统一,那个时候用户所使用的非智能机大部分是 MTK、Symbian、Kjava 这三大平台,涉及不同的机型终端有几十种之多,不同机型上所运行的 QQ 版本也是各式各样。
非智能机上的 QQ 各版本之前都是由不同的异地开发团队开发完成,如果一定要说版本之间有共性的话,那就是都具备不可调度的特性,不可调度的特性,不可调度的特性(重要事情说三遍)。
如若支持调度,那么这 800W 用户,团队也只需小手一抖分钟内就可以调度到新的数据中心的服务集群上即可。
想了解手机 QQ 如何调度与登录流程的看官,可以参看《【惊】腾讯:3亿人次的实战演习,如何做到丝般顺滑?》这篇文章,这里就不在赘述了。
也就是说这批不支持调度的 800W 用户,是没有办法做常规业务迁移,将会面临因 IDC 断电后停止服务的情景。
二、挑战与选择
手机 QQ 用户的概要服务流程是, 用户通过客户端 Get 到的 VIP 接入我们的后台服务,后台服务返回请求结果于用户。
根据这个业务流程,首先要保障用户能正常的接入才能访问到我们后台的一系列服务。
1. 我们面临了哪些关键挑战?
HardCode VIP在 2004 年—2009 年的非智能机时代众多的 QQ 版本中,有些版本因为平台框架与当时大的 2G 网络环境的限制只能将用于提供用户接入服务的 VIP HardCode 到版本中,且不同机型、运营商、厂商 HardCode 的 VIP 也不尽相同。同时支撑这批 VIP 的后台接入服务是和所部署IDC网络环境强耦合。非智能机版本的 QQ 是不支持动态下发与测速动态更新接入VIP列表的,我们同样也不能通过调度系统干预用户的接入地址。
客户端版本停止迭代MTK、Symbian、Kjava 这三大类客户端是 2004 年— 2009 年间是由不同的异地团队开发完成,至今已经有7年多没有版本迭代了。随着人员流动,对于这种长尾业务已经找不到当时主要负责开发的同学了。
版本覆盖率假定我们不计人力与时间成本去重构三大类非智能机的众多 QQ 版本,但因为 APP 的升级行为依然是依赖用户主动发起的行为并不能透过厂商的渠道强制更新,所以新版本覆盖到全量 800W 用户也是一个极其漫长的时间。
迫在眉睫业务迁移的最终结束时间点也都是既定好且不能更改,因为到期后 IDC
就会断电断网,停止提供基础服务。
面对这些挑战,我们似乎陷入了一个僵局,既不能调度用户、也不能推送新版本,而且还要让原本负责用户接入服务的 20+VIP 能一直稳定的工作,否则 800W 用户就会停止服务。
2. 800W用户VS 大盘数据
800W 用户对于 QQ 所服务的海量用户中占比有多少呢?从两个运营指标来衡量。DAU(日活跃用户数量), 现在 QQ 的 DAU 是8.3亿,800W 占比有1%,日最大同时在线数量,QQ 大盘日同时在线2.3亿,而 800W 非智能机用户同时在线 175W,占比不到 1%。
从上面两个数据来看,平均不到1%的占比基本是对大盘不会有影响。
3. 我们的选择
从挑战与数据来评估,其实是给运维团队带来个很大的难题 如何成功的解决这个难题?又或者放弃这 800W 用户?
客观说不是没有想过放弃这 800W 用户,因为这是对于团队成本最低甚至可以说是一劳永逸的做法,业界也有类似强行挂公告停止服务的先例: 尊敬的用户 因 XX 原因在 XX 年 XX 月即将停止对 XX 版本用户的服务。
再说,用户换置一台智能机的成本也很低,换置后还能享用到更好的服务质量!
备注:非智能机版本的用户因为受限于机型硬件配置与2G网络的限制,客户端一般只提供消息类的基础服务。
政府网站安全防护薄弱、金融行业网站成攻击重点目标、手机恶意程序泛滥、网民隐私保护成关注要点。
放弃 800W 用户的方案在初始讨论的时候就被否决,原因其实也很简单,这和团队一贯所坚持的价值观不符合:运维的价值在于运用技术方案保障服务的质量,让用户能获得优质体验。

华北电力大学信息系统等级保护测评采购项目中标结果变更公告

运营过程中的每个难题、每次故障,都促使我们多想一点,多做一点,不断的深耕细作锻炼运维能力来服务用户。倘若就这样放弃这 800W 用户,不仅放弃了运维自我成长的机会,更会伤害了一直信赖 QQ 这个产品用户的感情。
既然已经决定了要挽救这 800W 用户,那我们该如何去做呢?
三、乾坤大挪移
1. 浮出水面
如前所述我们遇到的核心问题:
VIP 与待裁撤的 IDC 网络环境强耦合;
客户端 HardCode VIP,且不支持云端更新接入 VIP;
问题的关键点就是用于用户接入的 VIP 要能持续提供服务,并且不能随 IDC 关闭而停止服务,且 VIP 是与 IDC 网络环境强耦合的,此时一个大胆的方案再被拿出来重新评估,为什么会说在被重新评估,是因为这套方案在初期就拿出来概要评估过,就是因为涉及到外部三大运营商与太多的不可控因素,团队认为是很难执行下去的。
这套方案是什么呢?是 IP(网络)平移, 概要介绍就是我们将待裁撤的 IDC 中,VIP 所依托的网络环境整体 1:1 的迁移到新的 IDC 中,也只有将网络环境全部迁移过去才能保证用于接入的 VIP 服务不会中断掉。夸张点说这个方案的整体执行难度可以借鉴下图表达。

这个方案有哪些可预见的关键难度呢?
新建 IDC 和待关闭的 IDC 因年代差距较大,基础架构和网络规划差异非常之大,需要将VIP所在的 2个网段整体迁移到新 IDC。
要分别于三大运营商沟通并极力争取运营商侧都同意配合测试与迁移。
网络安全与负载均衡策略也均要平移到新 IDC。
内部牵扯较多的跨事业群部门联合协作。
2. 持续推进
概要方案确定后,我们就卷入不同部门的同学来评估细节与落地,并也积极与商务同学一起与运营商沟通寻求支持。经过多次沟通,所幸的是运营商侧愿意支持与配合我们做IP(网络)平移。这个项目当中有大量的沟通协调工作。
获得运营商的支持后,整体项目也就进行的比较顺利了,我们依次又确定了具体方案关键节点
在新 IDC 中部署全套非智能机的 VIP 后台接入服务,用于切换使用;
确定了运营商切换方案与切换时间点;
制定应急方案;
给用户推送了相关信息,告知用户;
在某个月黑风高晚上的凌晨2点钟,开始了网络地址切换方案,切换前的心情是这样的
因为运营商此次切换是不能灰度的,只能一刀切的全量切换,这里面也牵扯大量的修改网络层面配置,如果切换失败,在切换到之前的环境费时费力不说,也可能引起其它问题。
整体的切换过程不是百分百有把握的,关键的操作都是在运营商处完成,并且这些操作对于腾讯的团队都是黑盒的。
幸运的是,当晚网络切换很顺利,VIP 平移后,用户自动重新登录与消息收发均成功,服务一切正常,三个月的努力成功了,800W用户在无停止服务的风险了!
这个项目成功的落地执行,挽救了800W 用户正常使用手Q服务。虽然运维团队顺利平安的度过了这次难关,但长尾业务的迁移的困难也成为运维不得不面对的难题。
为此,我们在运维平台的规划中,结合 DevOps 的思考且落地了解决该类问题的运维方案:
有效的纳管运维对象,包括标准化的定义对象和操作对象。将日常运维操作涉及的资源对象通过配置系统记录起来,并且由行之有效的场景化工具管理好,以此做到每个运维操作都是可量化、可管理、可追溯,保障工作的高效和经验的传承。

解放军报:信息安全风险潜伏你我身边

非功能性规范的重要性,历史问题导致了长尾业务迁移的痛,假如运维能够在业务开始之初就规范好业务的非功能管理规范,提出能被执行的运维标准化要求(如无 hardcode IP 等要求),有望极大的降低了历史问题的发生。
规划标准操作流程,对重复度高、价值低、令人痛苦的工作应该及时工具化或自动化。
这正是织云平台所推崇的运维理念,织云提供标准化的运维操作流程,结合操作角色与业务权限管理,实现了无论谁发起变更都能获得同样的操作结果,为自动化操作打下坚实的基础。
四、总结
挽救 800W 用户这个项目与腾讯的业务形态密不可分,未必所有的运维都有遇到这种难题的机会,但是笔者相信有一点所有运维人都是共通的,那便是在面对困难时的态度。
素材来自网络
网络信息安全小曲
“微思网络”成立于2002年,是厦门最早、口碑最好的高端IT培训认证机构和系统集成商,主要从事思科CCIE、CCNP、CCNA、 Redhat RHCE、Oracle OCP、VCP、微软MCSA等IT国际IT认证培训及考试,从事系统集成、解决方案、软硬件销售、IT外包服务等相关业务。
咨询热线:400-881-4699
微思官网:http://www.xmws.cn
腾讯微博:http://t.qq.com/xmwisdom
新浪微博:http://weibo.com/xmwisdom
点击原文链接,让你迎娶白富美走上人生巅峰
长按二维码向我转账
受苹果公司新规定影响,微信 iOS 版的赞赏功能被关闭,可通过二维码转账支持公众号。
微信扫一扫关注该公众号
First, 800W users will stop service
Give you are nagging about the overall background, in the last year when we start by IDC hardware facilities aging business needs to be overall passive migration remove due to the migration of the number of machines involved in 2K, the number of business modules to 150, the mobile phone QQ operation and maintenance team is responsible for the part, whether it is business or machine the amount is TOP1, and the abolition of the schedule is tight, IDC will be the last in operation after a period of time, will power off the network.
Engaged in large-scale business migration Kanguan have all know, this is a time-consuming, laborious, bother, will produce a large amount of communication, assessment, implementation of the cost of things, and also accompanied by risks detrimental to the quality of service in the process.
PC Internet era to the end of the gradual entry into the mobile Internet era, but also brings a lot of new challenges and pressures of the operation and maintenance team, such as:
The diversity of the early APP version, the need to adapt to the flourishing mobile phone models, different architectures of different maintenance
Mobile Internet environment is more complex, the size of the nearest access needs to be thinner
The fault tolerance, disaster tolerance, quality of service, and user scheduling strategies have higher requirements
……
Mobile QQ operation and maintenance team in the course of the business migration is also faced with the above problems, to migrate the business list, there is a mobile phone QQ non intelligent machine business.
In 2004 2009 the intelligent machine is not popular with the use of a unified platform, the user of the non intelligent machines are mostly MTK, Symbian, Kjava of the three models of different terminal platform, involving dozens of different models, the operation of the QQ version is various.
Before each version of the non intelligent machines QQ is completed by the development team while the development of different, if there are common words between versions, that is not schedulable have characteristics, characteristics of unschedulable, characteristics of scheduling (something important to say three times).
If the support scheduling, then this 800W users, the team can only be a small shake a minute can be scheduled to the new data center service cluster can be.
Want to know how the mobile phone QQ scheduling and login process can refer to the reader, [] was Tencent: 300 million people in the combat exercises, how to do silky? This article, here is not to repeat.
That is to say, these do not support the scheduling of 800W users, there is no way to do the regular business migration, will face the scene because of power outages after IDC service.
Two, challenges and choices
Mobile phone users QQ summary of the service process is that the user through the client Get to the VIP access to our background services, back office services to return the request results in the user.
According to the business process, we must first ensure that users can access the normal access to a series of services to our background.
1 what are the key challenges we face?

思华科技:听名字就会让你产生兴趣

HardCode VIP in 2004 2009 – the era of intelligent machines of many versions of QQ, some versions because the platform framework and limited network environment at the time of the 2G will be used to provide user access services to VIP HardCode version, and different models, operators and manufacturers of HardCode VIP are not the same. At the same time supporting these VIP’s background access service is a strong coupling with the deployed IDC network environment. QQ version of the non intelligent machine is not supported by dynamic and dynamic updates VIP access list, we can not interfere with the user’s access address through the scheduling system.
Client version of the iteration to stop the MTK, Symbian, Kjava these three categories of client is 2004 – 2009 years is the development of different teams from different places to complete, so far there are no more than 7 years version iteration. With the flow of personnel, for this long tail business has been unable to find the main responsible for the development of the students.
Version coverage of numerous versions of QQ if we regardless of manpower and time cost to the reconstruction of three categories of non intelligent machines, but because the upgrade behavior of APP is still dependent on the user initiated actions and not through the vendor channel updates are mandatory, so the new version of the full amount of coverage to 800W users is a very long time.
The final end time point of the imminent business migration is also well established and can not be changed, since IDC
Will cut off the network, stop providing basic services.
In the face of these challenges, we seem to fall into a deadlock, neither the user nor push scheduling, the new version, but also let the user access service was originally responsible for the 20 VIP can have a stable job, otherwise the user will stop service 800W.
2 800W user VS market data
800W users of QQ services in the vast number of users in the proportion of how much? From two operating indicators to measure. DAU (daily active users), now QQ DAU is 830 million, 800W accounted for more than 1% days, the maximum number of QQ on the market at the same time online, 230 million online at the same time, 800W and non smartphone users online at the same time 175W, accounting for less than 1%.
From the above two data point of view, an average of less than 1% of the basic market will not be affected.
3 our choice
From the challenge and data to assess, in fact, to the operation and maintenance team is a big problem how to successfully solve this problem? Or give up this 800W user?
The objective is not said did not want to give up this 800W user, because this is the lowest cost for the team and even can be said to be for practice, the industry also has similar to hang announcement to stop the service precedent: Dear user for XX reasons in XX years XX months will stop on the XX version of the user service.
Besides, the user change a smart machine after the replacement cost is very low, but also enjoy a better quality of service!
Note: non intelligent version of the user because of the limitations of the hardware configuration and 2G network constraints, the client generally provides only the basic services of the message class.
When users abandon 800W scheme in the initial discussion was rejected, the reason is very simple, and the team always adhere to the values do not meet: operation value lies in using the technical scheme of the high quality of service, so that users can get high quality experience.
Every time, every problem in the process of operation failure, prompted us to think a little, a little more time, shengengxizuo exercise ability of operation and maintenance to service users. If you give up this 800W user, not only to give up the operation and maintenance of self growth opportunities, but also hurt the feelings of the user has been trusted QQ this product.
Now that we have decided to save the 800W user, how should we do it?
Three, great Shift of the universe
1 surface
As mentioned earlier, we encountered the core issues:
Strong coupling between VIP and IDC network environment;
Client HardCode VIP, and does not support cloud update access VIP;
The key point of the problem is used for user access to VIP can continue to provide services, and not with the IDC off and stop the service, and the VIP and IDC network environment coupling, this time a bold plan to be out of the re evaluation, why would say in being re evaluated, because this program in the early take out a summary evaluation, because it is related to the external three operators with too many uncontrollable factors, the team that is very difficult to carry on the.
What is the plan? IP (Network) translation, an overview is that we will be the abolition of the IDC, VIP based on the network environment overall 1:1 migration to the new IDC, only the network environment to ensure access for all the past migration of VIP services will not be interrupted. Exaggeration that the overall implementation of the program can learn from the following expression.
What are the key difficulty of this program?
New IDC and IDC to be closed due to the large gap between the years, the difference between the infrastructure and network planning is very large, you need to move the 2 segments of the VIP as a whole moved to the new IDC.
To communicate with each of the three operators and strive for the operator side agreed to cooperate with the test and migration.
Network security and load balancing strategies are also translated to the new IDC.
Internal involvement of more inter sectoral collaboration.
2 continue to advance
After the outline of the program, we are involved in different departments of the students to assess the details and landing, but also actively communicate with business students and operators to seek support. After several communications, but fortunately the operator side is willing to support and cooperate with us to do IP (Network) translation. There is a lot of communication and coordination in this project.
After obtaining the support of the operators, the overall project will be carried out smoothly, and we in turn determined the key node of the specific program

随着大量消费者技术——如社交、视频、移动和云计算——的快速商用,信息安全管理者们的工作变得越来越困难。

猜您喜欢

30多家企业抱团维护网络信息安全
网络安全意识教育动画之办公室安全
勿让网络安全人才培养走“中国足球”的老路
越南被扣警察获释挥手感谢村民
HAIRGROWTHSECRETS ENGLISHMUM
安全管理与“伸手不打笑脸人”文化