Media Center
直播数据分析,不要等到大并发时才发现NGINX日志是坨翔
客服接到客户投诉,为毛直播看不了?为毛一直在卡?比较常见的答案是:农民工又挖断光纤了,黑客又攻击系统了。有毛线办法?看过FMS日志的都知道,里面没有任何有用的信息。 俺们公司的大码农当年在某CDN时,就下决心再也不给农民工朋友摸黑,也不动用黑客攻击的大名。光纤挖断是有的,黑客攻击也是有的,问题是光纤没有断,黑客没有光顾时,卡和流不能看的概率也很大,服务器日志能给予原因吗?不能,这就是为何FMS日志,或其他不能告知原因的日志,球一样的原因了。 CDN会用到多台服务器,或者说是服务器集群,再加上调度那就是多个集群之间的调用和依赖运行。SRS(Simple Rtmp Sever)是流媒体服务器,因此我只考虑流媒体系统的日志应该怎么设计,才能发现问题所在? 先考虑常见系统的日志,FMS的日志是这样的: 2015-03-2412:23:58 3409 (s)2641173 Accepted a connection from IP:192.168.1.141, referrer:http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23,pageurl: http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935- 702111234525315439 3130 3448 normal livestream - - rtmp://192.168.1.185:1935/live/livestream rtmp://192.168.1.185:1935/live/livestream - flv - - 0 - 0 0 - - http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935 -1 -1.000000 - crtmpd的日志确实很详细,可惜啥也看不粗来,特别是几千人在线的服务器: /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/iohandlermanager.cpp:120Handlers count changed: 15->16 IOHT_TCP_CARRIER /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/tcpacceptor.cpp:185Client connected: 192.168.1.141:54823 -> 192.168.1.173:1935 /home/winlin/tools/crtmpserver.20130514.794/sources/applications/appselector/src/rtmpappprotocolhandler.cpp:83Selected application: flvplayback (live) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:246Protocol CTCP(17) <-> TCP(18) <-> [IR(19)] unregistered fromapplication: appselector /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:257Stream NR(5) with name `` registered to application `flvplayback` from protocolIR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:268Stream NR(5) with name `` unregistered from application `flvplayback` fromprotocol IR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:257Stream NR(6) with name `` registered to application `flvplayback` from protocolIR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/protocols/rtmp/basertmpappprotocolhandler.cpp:1043Play request for stream name `livestream`. Start: -2000; length: -1000 /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:268Stream NR(6) with name `` unregistered from application `flvplayback` fromprotocol IR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:257Stream ONR4R(7) with name `livestream` registered to application `flvplayback`from protocol IR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/common/src/utils/buffering/iobuffer.cpp:110Unable to read data. Size advertised by network layer was 262144. Permanenterror: 2 /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/tcpcarrier.cpp:78Unable to read data. 192.168.1.141:54823 -> 192.168.1.173:1935 CTCP(17)<-> [TCP(18)] <-> IR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/iohandlermanager.cpp:130Handlers count changed: 16->15 IOHT_TCP_CARRIER /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/protocols/protocolmanager.cpp:45Enqueue for delete for protocol [IR(19)] /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:268Stream ONR4R(7) with name `livestream` unregistered from application`flvplayback` from protocol IR(19) /home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:246Protocol [IR(19)] unregistered from application: flvplayback nginx-rtmp的日志已经有进步了,能按照连接区分: 2015/03/2411:42:01 [info] 7992#0: *3 client connected '192.168.1.141' 2015/03/2411:42:01 [info] 7992#0: *3 connect: app='live' args='' flashver='MAC17,0,0,134'swf_url='http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23'tc_url='rtmp://192.168.1.173:1935/live'page_url='http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935'acodecs=3575 vcodecs=252 object_encoding=3, client: 192.168.1.141, server:0.0.0.0:1935 2015/03/2411:42:01 [info] 7992#0: *3 createStream, client: 192.168.1.141, server:0.0.0.0:1935 2015/03/2411:42:01 [info] 7992#0: *3 play: name='livestream' args='' start=0 duration=0reset=0 silent=0, client: 192.168.1.141, server: 0.0.0.0:1935 2015/03/2411:42:20 [info] 7992#0: *3 disconnect, client: 192.168.1.141, server:0.0.0.0:1935 2015/03/2411:42:20 [info] 7992#0: *3 deleteStream, client: 192.168.1.141, server:0.0.0.0:1935 可惜nginx日志只能知道这个连接的信息,而上下文都不清楚,连接也就这些信息。而SRS(simple rtmp sever)的日志,能知道播放连接对应的推流连接,会在后续打印出连接的摘要信息,并且可以在集群内任意追溯。 假设我知道流名称,或者不知道流名称,反正任意信息,譬如我知道播放的链接会打一个"type=Play"的标记出来,就从这一点开始。假设从源站开始: [winlin@dev6 srs]$ grep -ina"type=Play" objs/srs.origin.log 31:[2014-08-0610:09:34.671][trace][22288][107] client identified, type=Play,stream_name=livestream, duration=-1.00 发现有个107的链接播放了源站信息,查看它的日志: [winlin@dev6 srs]$ grep -ina"\[107\]" objs/srs.origin.log 27:[2014-08-0610:09:34.589][trace][22288][107] RTMP client ip=127.0.0.1 28:[2014-08-0610:09:34.591][trace][22288][107] complex handshake success 29:[2014-08-0610:09:34.631][trace][22288][107] connect app, tcUrl=rtmp://dev:1935/live,pageUrl=http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935,swfUrl=http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23,schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, args=(obj) 30:[2014-08-06 10:09:34.631][trace][22288][107]edge-srs ip=192.168.1.159, version=0.9.190, pid=22314, id=108 31:[2014-08-0610:09:34.671][trace][22288][107] client identified, type=Play,stream_name=livestream, duration=-1.00 32:[2014-08-0610:09:34.671][trace][22288][107] out chunk size to 60000 33:[2014-08-0610:09:34.671][trace][22288][107] source url=__defaultVhost__/live/livestream,ip=127.0.0.1, cache=1, is_edge=0, source_id=105[105] 34:[2014-08-0610:09:34.672][trace][22288][107] dispatch cached gop success. count=307,duration=4515 35:[2014-08-0610:09:34.672][trace][22288][107] create consumer, queue_size=30.00, tba=44100,tbv=25 36:[2014-08-0610:09:34.672][trace][22288][107] ignored. set buffer length to 1000 37:[2014-08-0610:09:34.673][trace][22288][107] input chunk size to 60000 40:[2014-08-0610:09:44.748][trace][22288][107] -> PLA time=10007, msgs=0, okbps=464,0,0,ikbps=3,0,0 41:[2014-08-0610:09:47.805][warn][22288][107][104] client disconnect peer. ret=1004 可以看到源id是105,关键字是source_id=105,查这个源: [winlin@dev6 srs]$ grep --color -ina"\[105\]" objs/srs.origin.log 16:[2014-08-0610:09:30.331][trace][22288][105] RTMP client ip=127.0.0.1 17:[2014-08-0610:09:30.331][trace][22288][105] srand initialized the random. 18:[2014-08-0610:09:30.332][trace][22288][105] simple handshake success. 19:[2014-08-0610:09:30.373][trace][22288][105] connect app,tcUrl=rtmp://127.0.0.1:1936/live?vhost=__defaultVhost__, pageUrl=, swfUrl=,schema=rtmp, vhost=__defaultVhost__, port=1936, app=live, args=null 21:[2014-08-0610:09:30.417][trace][22288][105] client identified, type=publish(FMLEPublish),stream_name=livestream, duration=-1.00 22:[2014-08-0610:09:30.417][trace][22288][105] out chunk size to 60000 23:[2014-08-06 10:09:30.418][trace][22288][105]source url=__defaultVhost__/live/livestream, ip=127.0.0.1, cache=1, is_edge=0,source_id=-1[-1] 24:[2014-08-0610:09:30.466][trace][22288][105] got metadata, width=768, height=320, vcodec=7,acodec=10 25:[2014-08-06 10:09:30.466][trace][22288][105]46B video sh, codec(7, profile=100, level=32, 0x0, 0kbps, 0fps, 0s) 26:[2014-08-0610:09:30.466][trace][22288][105] 4B audio sh, codec(10, profile=1, 2channels,0kbps, 44100HZ), flv(16bits, 2channels, 44100HZ) 33:[2014-08-06 10:09:34.671][trace][22288][107]source url=__defaultVhost__/live/livestream, ip=127.0.0.1, cache=1, is_edge=0,source_id=105[105] 38:[2014-08-0610:09:40.732][trace][22288][105] <- CPB time=10100, okbps=3,0,0,ikbps=332,0,0 可见这个就是ingest的连接,即编码器推流连接。已经查到了源头。同时可以看到107这个其实是srs(simple rtmp sever)的回源链接,关键字是edge-srs: 30:[2014-08-0610:09:34.631][trace][22288][107] edge-srs ip=192.168.1.159, version=0.9.190,pid=22314, id=108 可以去边缘服务器上查它的信息,id是108: [winlin@dev6 srs]$ grep --color -ina"\[108\]" objs/srs.log 29:[2014-08-06 10:09:34.579][trace][22314][108]edge pull connected, can_publish=1, url=rtmp://dev:1935/live/livestream,server=127.0.0.1:1936 30:[2014-08-0610:09:34.591][trace][22314][108] complex handshake success. 31:[2014-08-0610:09:34.671][trace][22314][108] connected, version=0.9.190, ip=127.0.0.1,pid=22288, id=107 32:[2014-08-0610:09:34.672][trace][22314][108] out chunk size to 60000 33:[2014-08-0610:09:34.672][trace][22314][108] ignore the disabled transcode: 34:[2014-08-0610:09:34.672][trace][22314][108] edge change from 100 to state 101 (pull). 35:[2014-08-0610:09:34.672][trace][22314][108] input chunk size to 60000 36:[2014-08-0610:09:34.672][trace][22314][108] got metadata, width=768, height=320, vcodec=7,acodec=10 37:[2014-08-0610:09:34.672][trace][22314][108] 46B video sh, codec(7, profile=100, level=32,0x0, 0kbps, 0fps, 0s) 38:[2014-08-0610:09:34.672][trace][22314][108] 4B audio sh, codec(10, profile=1, 2channels,0kbps, 44100HZ), flv(16bits, 2channels, 44100HZ) 39:[2014-08-0610:09:34.779][trace][22314][107] update source_id=108[108] 46:[2014-08-0610:09:36.853][trace][22314][110] source url=__defaultVhost__/live/livestream,ip=192.168.1.179, cache=1, is_edge=1, source_id=108[108] 50:[2014-08-06 10:09:44.949][trace][22314][108]<- EIG time=10293, okbps=3,0,0, ikbps=441,0,0 53:[2014-08-0610:09:47.805][warn][22314][108][4] origin disconnected, retry. ret=1007 这个边缘服务器上这个回源链接有两个客户端连接上,107和110,关键字是source_id=108: [winlin@dev6 srs]$ grep --color -ina"\[107\]" objs/srs.log 18:[2014-08-0610:09:34.281][trace][22314][107] RTMP client ip=192.168.1.179 19:[2014-08-0610:09:34.282][trace][22314][107] srand initialized the random. 20:[2014-08-0610:09:34.291][trace][22314][107] complex handshake success 21:[2014-08-06 10:09:34.291][trace][22314][107]connect app, tcUrl=rtmp://dev:1935/live,pageUrl=http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935,swfUrl=http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23,schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, args=null 22:[2014-08-0610:09:34.532][trace][22314][107] ignored. set buffer length to 800 23:[2014-08-0610:09:34.568][trace][22314][107] client identified, type=Play,stream_name=livestream, duration=-1.00 24:[2014-08-0610:09:34.568][trace][22314][107] out chunk size to 60000 25:[2014-08-0610:09:34.568][trace][22314][107] source url=__defaultVhost__/live/livestream,ip=192.168.1.179, cache=1, is_edge=1, source_id=-1[-1] 26:[2014-08-06 10:09:34.579][trace][22314][107]dispatch cached gop success. count=0, duration=0 27:[2014-08-0610:09:34.579][trace][22314][107] create consumer, queue_size=30.00, tba=0,tbv=0 28:[2014-08-0610:09:34.579][trace][22314][107] ignored. set buffer length to 800 39:[2014-08-0610:09:34.779][trace][22314][107] update source_id=108[108] 54:[2014-08-0610:09:47.805][trace][22314][107] cleanup when unpublish 55:[2014-08-0610:09:47.805][trace][22314][107] edge change from 101 to state 0 (init). 56:[2014-08-06 10:09:47.805][warn][22314][107][9]client disconnect peer. ret=1004 107是触发回源的连接。查看110这个链接: [winlin@dev6 srs]$ grep --color -ina"\[110\]" objs/srs.log 40:[2014-08-0610:09:36.609][trace][22314][110] RTMP client ip=192.168.1.179 41:[2014-08-0610:09:36.613][trace][22314][110] complex handshake success 42:[2014-08-0610:09:36.613][trace][22314][110] connect app, tcUrl=rtmp://dev:1935/live,pageUrl=http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935,swfUrl=http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23,schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, args=null 43:[2014-08-0610:09:36.835][trace][22314][110] ignored. set buffer length to 800 44:[2014-08-0610:09:36.853][trace][22314][110] client identified, type=Play,stream_name=livestream, duration=-1.00 45:[2014-08-0610:09:36.853][trace][22314][110] out chunk size to 60000 46:[2014-08-0610:09:36.853][trace][22314][110] source url=__defaultVhost__/live/livestream,ip=192.168.1.179, cache=1, is_edge=1, source_id=108[108] 47:[2014-08-0610:09:36.853][trace][22314][110] dispatch cached gop success. count=95,duration=1573 48:[2014-08-0610:09:36.853][trace][22314][110] create consumer, queue_size=30.00, tba=44100,tbv=25 49:[2014-08-0610:09:36.853][trace][22314][110] ignored. set buffer length to 800 51:[2014-08-0610:09:45.919][trace][22314][110] -> PLA time=8759, msgs=21, okbps=461,0,0,ikbps=3,0,0 52:[2014-08-0610:09:46.247][warn][22314][110][104] client disconnect peer. ret=1004 可见110也是个flash播放连接。 解决了排错的问题,如何解决卡的问题?用测速了,SRS(Simple Rtmp Sever)可以支持linux客户端和flash客户端直接对SRS(Simple Rtmp Server)测速,测试到SRS(Simple Rtmp Sever)服务器的上行和下行速度。也就是说,页面推流前,就可以调用js接口来获取到服务器的速度,如果知道服务器地址,还可以自己选择服务器。 配合测速和日志系统,就能知道卡或者故障的用户分发的路径是什么,运维或者高级运维系统可以自己排插这个路径的问题所在。从这以后,挖断光纤的概率小很多了,而黑客攻击也只有真正被攻击时(这时候要用热备)才是公鸡了。 集成了测速客户端的客户,自己就可以排除推流端的带宽过低的故障,这种问题大约占了90%的故障。而且更加致命的是,没有办法证明,除非能测速。泥马其他测速是测试到其他服务器的速度,譬如360测速,那顶多证明用户到其他服务器的下行没有问题,和流媒体服务商的服务器有球关系?而且还是上行! 关于测速,参考:https://github.com/winlinvip/simple-rtmp-server/wiki/v1_CN_BandwidthTestTool
关于SRS可追溯日志,参考:https://github.com/winlinvip/simple-rtmp-server/wiki/v1_CN_SrsLog#tracable-log
欢迎小伙伴们使用我们的开源流媒体软件SRS(Simple Rtmp Server)。开源不够用,需要定制怎么办?亲,您可尝试关注观止创想(www.chnvideo.com)商业版流媒体平台产品哦,商业版产品会领先开源版本12个月左右,包含编转码器、播控、Web P2P等更多功能,啥都有~
北京观止创想科技有限公司
TEL: 010-8233 1879 FAX: 010-8233 1989
EMAIL: chnvideo@chnvideo.com ADD: 北京市海淀区知春路1号学院国际大厦801
Copyright © 2014 BRAVO观止创想,保留所有权利 京ICP备12047633号