Performance Measurement of an Integrated NIC Architecture with 10GbE - PDF

Please download to get full document.

View again

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Education

Published:

Views: 34 | Pages: 8

Extension: PDF | Download: 0

Share
Related documents
Description
29 17th 17th IEEE IEEE Symposium on on High High Performance Interconnects Performance Measurement of an Integrated NIC Architecture with 1GbE Guangdeng Liao Department of Computer Science and Engineering
Transcript
29 17th 17th IEEE IEEE Symposium on on High High Performance Interconnects Performance Measurement of an Integrated NIC Architecture with 1GbE Guangdeng Liao Department of Computer Science and Engineering University of California, Riverside Riverside, California, USA Laxmi Bhuyan Department of Computer Science and Engineering University of California, Riverside Riverside, California, USA Abstract The deployment of 1 Gigabit Ethernet (1GbE) connections to servers has been hampered by the fast-networkslow-host phenomenon. Recently, the integration of network interfaces (s) is proposed to tackle the performance mismatch. While significant advantages over PCI-based discrete NICs (s) were shown in prior work using simulation methodologies, it is still unclear how s perform on real machines with 1GbE. This paper is the first to study the impact of s by extensive evaluations through micro-benchmarks on a highly threaded Sun Niagara 2 processor. The processor is the industry s first system on a chip, integrating two 1GbE NICs. We observe that the only shows its advantage over the with large I/O sizes. It improves 7.5% network bandwidth while saving 2% relative CPU utilization. We characterize the system behaviors to fully understand the performance benefits with respect to different number of connections, OS overhead, instruction counts, and cache misses etc. All of our studies reveal that there is a benefit of integrating NICs onto CPUs, but the gain is somewhat marginal. More aggressive integrated NIC designs should be adopted for higher speed networks like the upcoming 4GbE and 1GbE. Keywords: 1GbE, Integrated NIC, Discrete NIC, Performace Evaluation, Characterization, Sun Niagara 2. I. INTRODUCTION Ethernet continues to be the most widely used network architecture today for its low component cost and backward compatibility with the existing Ethernet infrastructure. As of 26, Gigabit Ethernet-based clusters make up 176 (or 35.2%) of the top-5 supercomputers [17]. Unfortunately, even as nearly all server platforms completed the transition to Gigabit Ethernet, the adoption of 1 Gigabit Ethernet (1GbE) has been limited to a few niche applications [18]. The use of 1GbE has been constrained by the processing capability of general purpose platforms [7, 11]. Prior work [1, 2, 3, 5, 14] for improving the processing capability broadly falls into two categories: 1) dedicating an embedded CPU to a NIC and 2) integrating a NIC onto a CPU. TCP Offload Engine (TOE), a popular approach for high speed networks, belongs to the first category. It offloads the whole network stack running on operating systems (OS) into NICs in the form of firmware. However, TOE lacks flexibility and the ability to take advantage of technology-driven performance improvements as easily as host CPUs [5]. Recently, an alternative approach to integrating NICs onto CPUs has been shown to be more promising and is gaining more and more popularity in both academia and industry [3, 14, 15]. Compared to TOE, the integration of NICs does not require modifying the legacy network stack, and provides high flexibility and good compatibility with the OS. Existing work on the integration of NICs (s) was evaluated by simulation [1, 2, 3]. Although simulation is flexible, it is hard to fully simulate the bandwidth and latency of memory and system bus protocols in real machines. It is also difficult for simulators to capture the whole OS behaviors. Hence, evaluations on real machines become critically important and are complementary to simulators. Papers [1, 2] claimed that s can significantly improve network processing efficiency in comparison with discrete NICs (s) connected via a PCI-E bus, due to the smaller latency of accessing I/O registers. However, how integrated NICs perform on real machines remains unclear. A detailed performance evaluation and characterization are required to answer it. Sun released the UltraSPARC T2 processor ( a.k.a Niagara 2) [15] by envisioning the benefits of the integration of NICs. The processor is a highly threaded processor consisting of small cores and the industry s first system on a chip integrating two 1GbE NICs. The integration can reduce the latency of accessing I/O registers, though the overhead of accessing network packets is not eliminated because they are still sourced and destined to memory rather than caches [15]. In this paper, we present extensive evaluations with 1GbE, and compare the with the in detail. To make a fair comparison, the has the same design as the except for its proximity to CPUs in our experiments. Our experiments reveal that the improves network efficiency only with large I/O sizes, with 7.5% higher bandwidth and 2% less CPU utilization. Our CPU breakdown confirms the previous observation from the simulator that the driver overhead is largely reduced (5% reduction in our experiments, and up to 8% reduction in [2]). The reduction is contributed to the smaller latency of accessing I/O registers /9 $ IEEE DOI 1.119/HOTI Besides confirming the simulator-based finding above, through a detailed performance characterization, we also unexpectedly observe that the significantly affects the behaviors of the OS scheduler and CPU caches. We notice that the reduces context switches by 4% in comparison with the. Our in-depth analysis shows that more crosscalls (or inter-processor interrupts) are incurred by the OS scheduler, and correspondingly result in more frequent context switches in the. Additionally, the longer packet processing latency in the directly translates to longer residential life cycles of packets in caches. It could result in cache pollution and thus incur higher cache miss rates. With the combination of influence of more context switches on caches, our evaluation shows that the has 25% lower L1 data cache and 7.6% lower L2 cache miss rates. In our experiments, we observe that the smaller latency of accessing I/O registers itself does not help processing by a large extent. The different behaviors of the OS scheduler and CPU caches incurred by the smaller latency mainly contribute to the performance gain. It is in contrary to the previous observation that the reduced driver overhead can lead to the performance improvement up to 58% [2]. To satisfy the processing requirement introduced by higher network traffic rates, more aggressive designs like the new CPU/NIC interaction mechanism should be considered. The remainder of this paper is organized as follows. Section 2 describes the background knowledge about the integration of NICs. Section 3 presents the experimental methodology. Section 4 and 5 show our performance evaluation and detailed characterization. Finally, we conclude our paper in section 6. II. BACKGROUND A. Integration of NICs It is well known that TCP/IP over Ethernet is a dominant overhead for commercial web and data servers [19, 2]. Researchers gradually realized that a comprehensive solution across hardware platforms and software stacks is necessary to eliminate the overhead [1, 2]. Existing work for improving TCP/IP performance falls into two categories: TOE [5] and the integration of NICs [1, 2]. Although TOE reduces the communication overhead between processors and NICs, it lacks scalability due to the limited processing and memory capacity. It also requires extensive modification of OS and development of firmware in NICs. Recently, a counter-toe approach is to integrate NICs onto CPUs. It is envisioned as the next generational network infrastructure. It not only reduces the latency of accessing I/O registers, but also leverages extensive resources in multi-core CPUs. Binkert [1, 2] made a first attempt to couple a simple NIC with a CPU for high bandwidth networks. They claimed that the device driver is one of the dominant overheads for processing high speed networks and an integrated NIC can eliminate the overhead. Additionally, they also go further to redesign the integrated NIC to eliminate the overheads of DMA descriptor management and data copy. Evaluation on their full system simulator M5 [2, 3] showed the driver overhead is reduced up to 8% even without any redesign, thus improving performance up to 58%. The Joint Network Interface Controller (JNIC) [14], a collaborative research project between HP and Intel, also attempted to explore high performance in-data-center communications over Ethernet by integrating a NIC. They built a system prototype by attaching a 1GbE NIC on the front side bus to mimic the integration. Apparently, the integration is drawing more and more attention to eliminating the disparity between host computation capacity and high speed networks. B. Sun Niagara 2 The Niagara 2 processor is the industry's first system on a chip, packing the most small underpowered cores and threads, and integrating all the key functions of a server on a single chip: computing, networking, security and I/O [15]. As shown in Figure 1, it has two 1 GbE NICs (NIU in the figure) with a few features. All the data is sourced from and destined to memory, DMA in the parlance. This means a core sets up the transfer and gets out of the way. The path to memory goes from the Ethernet unit (NIU), to the system interface unit (SIU), directly into the L2 or the crossbar. The CPU sets up DMA for packet transfers from the NIC to memory. Niagara 2, known for its massive amount of parallelism, contains eight small SPARC physical processor cores and each core has full hardware support for eight hardware threads. There are total 64 hardware threads or CPUs from the OS perspective. Additionally, each core has a 64-entry fully associative ITLB, a 128-entry fully associative DTLB, a 16K L1 I (instruction) cache and an 8K L1 D (data) cache with associativity of the Icache upped to eight. The Dcache has four-way associativity and is write-through, and all of the cores share a 4MB L2 cache. This is divided into 8 banks and each bank is 16-way associative. III. Figure 1. Niagara 2 Architecture EXPERIMENT METHODOLOGY A. Testbed Setup Our experimental testbed consists of a Sun T512 server connected to an Intel Quad Core DP Xeon server, which 53 functions as a System Under Test (SUT) and a stressor respectively. The Sun server has a Niagara 2 processor, which has 64 hardware threads and each hardware thread is operating at 1.2GHz. The Intel server is a two-processor platform based on the quad-core Intel Xeon processor 53 series with 8 MB of L2 cache per processor [8]. Both of the machines are equipped with 16GB DRAM. TABLE I. VS In order to compare the integrated NIC with the discrete NIC, we used two 1GbE network adapters in the SUN server: a discrete Sun 1GbE PCI-E NIC (a.k.a Neptune) [16] and an on-chip 1GbE Network Interface Unit (a.k.a NIU) [15]. The on-chip NIU has the same physical design as Neptune except it has half less DMA transmit channels. More information is shown in Table 1. They use the same device driver, and trigger an interrupt after the number of received packets reaches 32 or 8 NIC hardware clocks have elapsed since the last packet was received. We also installed two Intel 1GbE Server Adapters (a.k.a Oplin) [9] in the stressor system to connect two network adapters in the Sun server. All of discrete NICs connect to hosts through PCI-E x8, a Gigabit/s full-duplex I/O fabric that is fast enough to keep up with the 1+1 Gigabit/s full-duplex network port. B. Server Software The SUT runs the Solaris 1 Operating System while the stressor runs Vanilla Linux kernel In Solaris 1, a STREAMS-based network stack is replaced by a new architecture named FireEngine [6] which provided better connection affinity to CPUs, greatly reducing the connection setup cost and the cost of per-packet processing. It merges all protocol layers into one STREAMS module that is fully multithreaded. In order to optimize network processing with the 1GbE network, we use 16 soft rings per 1GbE NIC by setting the parameter ip_soft_rings_cnt for the driver. Soft rings are kernel threads that offload processing of received packets from the interrupt CPU, thus preventing the interrupt CPU from becoming the bottleneck. We also set ddi_msix_alloc_limit to 8 so that received interrupts can target 8 different CPUs. Besides, we retain the default settings in the device driver without specific performance tuning on interrupt coalescing, write combining etc. All protocol and system relevant settings are at default. Micro-benchmarks were used in our experiments to easily identify the performance benefits and avoid system noises from commercial applications [7, 11], We selected Iperf [1] and NetPIPE [13] as micro-benchmarks for measuring bandwidth and ping-pong latency respectively. Because peak bandwidth can be achieved by more than 16 connections, Iperf is run with 32 parallel connections on 64 CPUs for 6 seconds in all our experiments, unless otherwise stated. In our experiments, the utility vmstat is used for capturing the corresponding CPU utilization. We ran profiling tools er_kernel and er_print to collect and analyze the system functions overhead. Meanwhile, tools busstat and cpustat were chosen to obtain memory traffic and hardware performance counter statistical information while running the benchmark. IV. PERFORMANCE EVALUATION In Figure 2, we show how the and the perform with various I/O sizes while receiving packets. The bar in the figure represents achievable network bandwidth, and the line stands for the corresponding CPU utilization. It can be observed that the can achieve 8.97 Gbps bandwidth while consuming 27% CPU utilization with large I/O sizes. Correspondingly, 8.31 Gbps bandwidth is obtained by the with 35% CPU utilization. The obtains 7.5% higher bandwidth and saves 2% relative CPU utilization on average for large I/O sizes ( 1KB). The efficiency of the is close to the with small packets. All of the results reveal that the integration improves network efficiency in the receive side only with large I/O sizes. Bandwidth (Gbps) vs (RX with 32 ) 4% 35% 3% 25% 2% 15% 1% % K 2K 4K 8K 16K 32K 64K I/O Size (BW) (BW) (CPU Util) (CPU Util) Figure 2. Bandwidth & CPU Utilization (RX) We studied the performance comparison of the and the while transmitting packets in Figure 3. Because less time is required in the driver for the to transmit packets, it is expected that the higher transmitting bandwidth could be obtained by the than the. However, the does not show noticeable benefits to the application in terms of network efficiency. It is possibly because: first, the number of transmit DMA channels in NIU is half less than that in the Neptune 1GbE card (8 TX DMA channels in the and 12 TX DMA channels in the DINC). Fewer channels could reduce the capacity of transmitting packets. Second, the 5% CPU Utilization 54 transmit side is much less latency-sensitive than the receive side [12, 19, 2]. Bandwidth (Gbps) Bandwidth (Gbps) vs (TX 32 connections) K 2K 4K 8K 16K 32K 64K I/O Size (BW) (BW) (CPU Util) (CPU Util) Figure 3. Bandwidth & CPU Utilization (TX) vs with Different (BW) (BW) (CPU Util) (CPU Util) Figure 4. Performance with Various 5% 45% 4% 35% 3% 25% 2% 15% 1% 5% % 4% 35% 3% 25% 2% 15% 1% 5% % To ease and expedite our analysis of the above observation in receive side, we conducted experiments for comparing the with the by running Iperf with varying number of connections rather than 32 connections. Figure 4 illustrates the comparison from one single connection to 64 connections with 64KB messages. The following observations can be made from the figure: 1) greater than 16 connections are required for both the and the to achieve peak bandwidth. It is due to low performance of a single hardware thread in Niagara 2; 2) differing from the, the with 64 connections downgrades 1% bandwidth compared to 32 connections; 3) the improves network efficiency only with greater than or equal to 32 connections. Similarly, we also studied the performance comparison by running 32 connections with varying number of CPUs or hardware threads in Figure 5. We observe from the figure that the benefits only come when more than 16 CPUs are used in our experiments. With the combination of Figure 4, we can draw two conclusions: 1) the integration could affect the system behaviors with a large number of connections, and different system behaviors mainly cause the performance difference, and 2) the benefits can only be achieved with large number of CPUs, and thus are tied to the highly threaded Sun system. CPU Utilization CPU Utilization Figure 5. Performance with Various CPUs High bandwidth and low latency are two main metrics in modern networking servers. We also conducted experiments to compare ping-pong latency by configuring the SUT with the or the while retaining the same configuration in the stressor. The micro-benchmark NetPIPE was used to measure the latency. Since large I/Os are segmented into small packets less than MTU (Maximum Transfer Unit, 1.5KB by default), we focus on packets less than MTU for the ping-pong latency test. Our results in Figure 6 show that the can achieve a lower latency by saving 6 s. It is due to the smaller latency of accessing I/O registers and eliminating PCI-E bus latency. Latency (us) vs (Latency) K 1.5K I/O Size(Bytes) Figure 6. Ping-Pong Latency V. DETAILED PERFORMANCE CHARACTERIZATION To further understand the benefits of the, we profiled the system for both the kernel and application function calls as well as the assembly code. We used the test case with a 64KB I/O size and 32 concurrent connections in Figure 2. The data gathered was grouped into the following components to determine their impacts on performance: device driver, socket, buffer management, network stack, kernel, data copy and user application Iperf. CPU overhead breakdown per packet is calculated and presented in Figure 7. We observe that 28 s and 2 s are required for processing one received packet in the and the respectively. 55 CPU's Breakdown per Packet (us) vs (CPU's Breakdown) Figure 7. CPU Overhead Breakdown Kernel Iperf Copy Stack Driver Socket Buffer The comparison in the figure reveals that the CPU overhead on the driver is reduced from 4.7 s to 2.6 s by the integration. Our profiled result shows that the overhead on the interrupt handler nxge_rx_intr, which frequently operates on NIC registers, is reduced by 1X. The copy component remains the same when we switch between the to the configuration. It is because all packets in the are sourced and destined to memory rather than caches. The data copy from kernel to user buffers in both configurations incurs compulsory cache misses to fetch payloads from memory into caches. The overhead on the copy component is eliminated only if packets are delivered to caches. Our findings so far confirm the observations in prior work [1, 2] even though they differ in absolute benefits. We also observe that the also reduces the overheads on network stack, buffer management, socket and kernel. These unexpected improvements comprise up to 75% of the total overhead reduction and thus mainly contribute to the performance benefits. We found that the different behaviors of the OS scheduler and CPU caches lead to these benefits. A. Impacts on the OS Scheduler Since the benefits of the over the changes as the number of connections increases, we carefully characterize the system behaviors with varying number of connections to understand the benefits by the. 1) Instruction Breakdown First, we did an architectural characterization by instruction for packet processing along various connections. In the, instructions are broken down into 5 types of instructions: load, store, atomics, software count instructions and all other instructions as shown in Figure 8 (Note that the received packet is less than 1.5KB because large messages in the sender side are segmented into packets smaller than MTU). As shown in Figure 8, about 35 instructions are required to process a packet with less than 32 connections, but increase to 45 inst
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks