Recently, more and more studies investigated the issue of dealing with the heterogeneity problem on heterogeneous cluster systems consisting of multi-core computing nodes. Previously we have proposed a hybrid MPI and OpenMP based loop self-scheduling approach for this kind of system. The allocation functions of several well-known schemes have been modified for better performance. Though the previous approach can improve system performance significantly, in this paper we present how to enhance the speedup further. First, we exploit the thread-level parallelism on the multi-core master node. Second, we investigate how to design a loop self-scheduling scheme which is able to smartly assign a proper chunk size according to each node's performance. At the beginning of dispatching, we prevent the slow slaves from being assigned too many tasks. On the other hand, the master will not assign too many small chunks to slaves at the end. Experimental results show that our approach could obtain the best speedup of 1.35.