记一次dns网络不通排查过程

记一次dns网络不通排查过程

Posted by lwk on July 27, 2021

今天在测试机安装yum源,发现一直报错

image

从错误信息来看是解析一个域名失败,查看了这个域名,对应是公司内部域名,通过dig发现确实没有任何响应,难道是dns服务器有问题?不太可能,只有这一台机器有问题,因此初步可以断定是这台机器有问题导致的。

我们strace看下到底是哪里有问题 image 从strace结果可以看出在访问172...:53时timeout了,172...:53是dns服务端地址。

也就是说这台测试机访问不了dns服务器,我们手工telnet一下发现确实超时了。 image 接下来需要排查到底是什么原因导致这台测试机访问不了dns服务器。基本上可以确定是本机的问题,所以我们在这台测试机上使用dropwatch看下丢包发生在哪里了。

我们通过dropwatch看下丢包

开一个终端,并执行telnet 172...* 53

另开一个终端执行dropwatch –l kas命令

[root@tst4-dev-1] /home/ansible$ dropwatch-l kas
Initalizing kallsyms db
dropwatch> start
dropwatch>
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at skb_queue_purge+18(0xffffffffa9224658)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
4 drops at unix_dgram_sendmsg+4f8(0xffffffffa92f8b48)
2 drops at nf_hook_slow+f3(0xffffffffa9279663)
4 drops at unix_stream_connect+2da(0xffffffffa92f933a)
5 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at icmp_rcv+125(0xffffffffa92bb945)
2 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at icmp_rcv+125(0xffffffffa92bb945)
5 drops at nf_hook_slow+f3(0xffffffffa9279663)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)
16 drops at unix_dgram_sendmsg+4f8(0xffffffffa92f8b48)
1 drops at nf_hook_slow+f3(0xffffffffa9279663)

从上面的显示可以看出丢包基本上发生在nf_hook_slow上,我们查看nf_hook_slow函数在内核的位置。

看下nf_hook_slow代码:


int nf_hook_slow(u_int8_t pf, unsigned inthook, struct sk_buff *skb,
        struct net_device *indev,
        struct net_device *outdev,
        int (*okfn)(struct sk_buff *),
        int hook_thresh)
{
   struct nf_hook_ops *elem;
   unsigned int verdict;
   int ret = 0;
 
   /* We may already have this, but read-locks nest anyway */
   rcu_read_lock();
 
   elem = list_entry_rcu(&nf_hooks[pf][hook], struct nf_hook_ops,list);
next_hook:
   verdict = nf_iterate(&nf_hooks[pf][hook], skb, hook, indev,
                 outdev, &elem, okfn,hook_thresh);
   if (verdict == NF_ACCEPT || verdict == NF_STOP) {
       ret = 1;
    }else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
       kfree_skb(skb);
       ret = NF_DROP_GETERR(verdict);
       if (ret == 0)
           ret = -EPERM;
    }else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
       int err = nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
                        verdict >>NF_VERDICT_QBITS);
       if (err < 0) {
           if (err == -ECANCELED)
                goto next_hook;
           if (err == -ESRCH &&
               (verdict &NF_VERDICT_FLAG_QUEUE_BYPASS))
                goto next_hook;
           kfree_skb(skb);
       }  
   }  
   rcu_read_unlock();
   return ret;
}

nf_hook_slow基本上可以确定是iptables导致的丢包。

我们看下iptables –nvL

2574 503K ACCEPT     tcp  -- *      *       0.0.0.0/0            0.0.0.0/0            tcp spt:22
   0     0 ACCEPT     tcp --  *      *      0.0.0.0/0           0.0.0.0/0            tcp dpt:22
   0     0 ACCEPT     tcp --  *      *      0.0.0.0/0           0.0.0.0/0            tcp dpt:2285
   0     0 ACCEPT     tcp --  *      *      0.0.0.0/0           0.0.0.0/0            tcp dpt:4588
   0     0 ACCEPT     tcp --  *      *      0.0.0.0/0           0.0.0.0/0            tcp dpt:4580
15381  431K ACCEPT    icmp --  *      *      0.0.0.0/0           0.0.0.0/0          
 103K6902K DROP       all  -- *      *       0.0.0.0/0            0.0.0.0/0 

把iptables规则清理之后再试下,已经OK

[root@tst4-dev-1] /data0/src/linux-3.10$telnet 172.*.*.* 53
Trying 172.*.*.*...
Connected to 172.*.*.*.
Escape character is '^]'.