今天在测试机安装yum源,发现一直报错
从错误信息来看是解析一个域名失败,查看了这个域名,对应是公司内部域名,通过dig发现确实没有任何响应,难道是dns服务器有问题?不太可能,只有这一台机器有问题,因此初步可以断定是这台机器有问题导致的。
我们strace看下到底是哪里有问题 从strace结果可以看出在访问172...:53时timeout了,172...:53是dns服务端地址。
也就是说这台测试机访问不了dns服务器,我们手工telnet一下发现确实超时了。 接下来需要排查到底是什么原因导致这台测试机访问不了dns服务器。基本上可以确定是本机的问题,所以我们在这台测试机上使用dropwatch看下丢包发生在哪里了。
我们通过dropwatch看下丢包
开一个终端,并执行telnet 172...* 53
另开一个终端执行dropwatch –l kas命令
[root@tst4-dev-1] /home/ansible$ dropwatch-l kas
Initalizing kallsyms db
dropwatch> start
dropwatch>
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at skb_queue_purge+18(0xffffffffa9224658)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
4 drops at unix_dgram_sendmsg+4f8(0xffffffffa92f8b48)
2 drops at nf_hook_slow+f3(0xffffffffa9279663)
4 drops at unix_stream_connect+2da(0xffffffffa92f933a)
5 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at icmp_rcv+125(0xffffffffa92bb945)
2 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at icmp_rcv+125(0xffffffffa92bb945)
5 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
16 drops at unix_dgram_sendmsg+4f8(0xffffffffa92f8b48)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
从上面的显示可以看出丢包基本上发生在nf_hook_slow上,我们查看nf_hook_slow函数在内核的位置。
看下nf_hook_slow代码:
int nf_hook_slow(u_int8_t pf, unsigned inthook, struct sk_buff *skb,
struct net_device *indev,
struct net_device *outdev,
int (*okfn)(struct sk_buff *),
int hook_thresh)
{
struct nf_hook_ops *elem;
unsigned int verdict;
int ret = 0;
/* We may already have this, but read-locks nest anyway */
rcu_read_lock();
elem = list_entry_rcu(&nf_hooks[pf][hook], struct nf_hook_ops,list);
next_hook:
verdict = nf_iterate(&nf_hooks[pf][hook], skb, hook, indev,
outdev, &elem, okfn,hook_thresh);
if (verdict == NF_ACCEPT || verdict == NF_STOP) {
ret = 1;
}else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
kfree_skb(skb);
ret = NF_DROP_GETERR(verdict);
if (ret == 0)
ret = -EPERM;
}else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
int err = nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
verdict >>NF_VERDICT_QBITS);
if (err < 0) {
if (err == -ECANCELED)
goto next_hook;
if (err == -ESRCH &&
(verdict &NF_VERDICT_FLAG_QUEUE_BYPASS))
goto next_hook;
kfree_skb(skb);
}
}
rcu_read_unlock();
return ret;
}
nf_hook_slow基本上可以确定是iptables导致的丢包。
我们看下iptables –nvL
2574 503K ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp spt:22
0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:2285
0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:4588
0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:4580
15381 431K ACCEPT icmp -- * * 0.0.0.0/0 0.0.0.0/0
103K6902K DROP all -- * * 0.0.0.0/0 0.0.0.0/0
把iptables规则清理之后再试下,已经OK
[root@tst4-dev-1] /data0/src/linux-3.10$telnet 172.*.*.* 53
Trying 172.*.*.*...
Connected to 172.*.*.*.
Escape character is '^]'.