Top Banner
Network Processors Network Processors A generation of multi-core A generation of multi-core processors processors INF5063: Programming Heterogeneous Multi-Core Processors June 11, 2022
52

Network Processors A generation of multi-core processors

Jan 05, 2016

Download

Documents

Taran

INF5063: Programming Heterogeneous Multi-Core Processors. Network Processors A generation of multi-core processors. October 23, 2014. Agere Payload Plus APP550. Classifier buffer. Scheduler buffer. Stream editor memory. to Egress. from Ingress. to co-processor. from. co-processor. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Network ProcessorsA generation of multi-core processorsINF5063: Programming Heterogeneous Multi-Core Processors*

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Agere Payload Plus APP550ClassifiermemoryClassifierbufferSchedulerbufferStream editormemoryfrom Ingressfrom. co-processorto Egressto co-processorSchedulermemoryStatisticsmemoryPCI Bus

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Agere Payload Plus APP550ClassifiermemoryClassifierbufferSchedulerbufferStream editormemoryfrom Ingressfrom. co-processorto Egressto co-processorSchedulermemoryStatisticsmemoryPCI BusPattern Processing Engine patterns specified by programmer programmable using a special high-level language only pattern matching instructions parallelism by hardware using multiple copies and several sets of variables access to different memoriesState Engine gather information (statistics) for scheduling verify flow within bounds provide an interface to the host configure and control other functional unitsPacket (protocol data unit) assembler collect all blocks of a frame not programmableStream Editor (SED) two parallel engines modify outgoing packets (e.g., checksum, TTL, ) configurable, but not programmableReorder Buffer Manager transfers data between classifier and traffic manager ensure packet order due to parallelism and variable processing time in the pattern processing

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    PowerNPIngress queueIngress data storeEgress queueEgressdata store4 Interfaces (IN from net)4 Interfaces (OUT to net)Instruct.memoryPowerPCcore2 Interfaces (OUT to host)2 Interfaces (IN from host)HardwareclassifierDispatchunit

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    PowerNPIngress queueIngress data storeEgress queueEgressdata store4 Interfaces (IN from net)4 Interfaces (OUT to net)Instruct.memoryPowerPCcore2 Interfaces (OUT to host)2 Interfaces (IN from host)HardwareclassifierDispatchunitEmbedded PowerPC GPU no OS on the NPFCoprocessors 8 embedded processors 4 kbytes local memory each 2 cores/processor 2 threads/coreLink layer framing outside the processor

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    IXP1200 ArchitectureRISC processor:- StrongARM running Linux- control, higher layer protocols and exceptions- 232 MHzMicroengines:- low-level devices with limited set of instructions- transfers between memory devices - packet processing- 232 MHzAccess units:- coordinate access to external unitsScratchpad:- on-chip memory- used for IPC and synchronization

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    IXP2400 ArchitectureSRAMcoprocessorFLASHDRAMSRAMaccessSDRAMaccessSCRATCHmemoryPCIaccessMSF accessEmbeddedRISK CPU(XScale)PCI busreceive busDRAM busSRAM busslowportaccesstransmit busRISC processor:- StrongArm XScale- 233 MHz 600 MHzMicroengines- 6 8- 233 MHz 600 MHzMedia Switch Fabric forms fast path for transfers interconnect for several IXP2xxx Receive/transmit buses shared bus separate bussesSlowport shared inteface to external units used for FlashRom during bootstrapCoprocessors hash unit 4 timers general purpose I/O pins external JTAG connections (in-circuit tests) several bulk cyphers (IXP2850 only) checksum (IXP2850 only)

  • Example: SpliceTCPINF5063: Programming Heterogeneous Multi-Core Processors

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of OsloTCP SplicingSome client

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of OsloTCP SplicingSome client

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of OsloTCP SplicingSome client

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of OsloTCP SplicingSome client

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of OsloTCP SplicingData link layerPhysical layerNetwork layerData link layerTransport layerNetwork layerApplication layerTransport layeracceptconnectwhile(1) read writeLinux Netfilter Establish upstream connection Receive entire packet Rewrite headers Forward packetIXP 2400 Establish upstream connection Parse packet headers Rewrite headers Forward packet

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    TCP Splicing - kernel level splicing - e.g. Linux netfilter - PREROUTING - rewrite dest IP - POSTROUTING - write src IP - data still copied to main mem

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    TCP Splicing - IXP 2400 - can start re-writing the header as soon as enough bytes have arrived - all data goes through the card, but only once - no scheduling of any user space - it's an IXP2400, can start processing after first 64 bytes

    - comment on the paper - flow control processing end-to-end etc.

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Throughput vs Request File SizeGraph from the presentation of the paperSpliceNP: A TCP Splicer using a Network Processor, ANCS2005, Princeton, NJ, Oct 27-28, 2005By Li Zhao, Yan Lou, Laxmi Bhuyan (Univ. Calif. Riverside), Ravi Iyer (Intel)Major performance gain at all request sizes

    Graph from the presentation of the paperSpliceNP: A TCP Splicer using a Network Processor, ANCS2005, Princeton, NJ, Oct 27-28, 2005By Li Zhao, Yan Lou, Laxmi Bhuyan (Univ. Calif. Riverside), Ravi Iyer (Intel)

  • Example:Transparent protocol translation and load balancing in a media streaming scenarioslides from an ACM MM 2007 presentationby Espeland, Lunde, Stensland, Griwodz and HalvorsenINF5063: Programming Heterogeneous Multi-Core Processors

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Load Balancer

    Network

    ...ingressegressRSTP/RTP video serverRSTP/RTP video servermplayerclientsRTSP / RTP parserRTSPRTP/UDPRTSPRTP/UDP

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Transport Protocol Translator

    Network

    ...ingressRSTP/RTP video serverRSTP/RTP video servermplayerclientsRTSPHTTP-streaming is frequently used today!!HTTP

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Transport Protocol Translator

    Network

    ...ingressRSTP/RTP video serverRSTP/RTP video servermplayerclientsRTSP/RTPRTP/UDPHTTPRTSP/RTPHTTPRTP/UDPHTTPRTSP

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Machine setupIXP lab

    switchswitchlocal network(192.168.67.xxx)To 192.168.67.6TRANSPARENT MEDIA SERVERLOAD BALANCER and TRANSLATORswitchTo 192.168.67.5192.168.67.5192.168.67.6

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    ResultsThe prototype works and both load balances and translates between HTTP/TCP and RTP/UDP The protocol translation gives a much more stable bandwidth than using HTTP/TCP all the way from the serverprotocol translationHTTP

  • Example: Booster Boxes slide content and structure mainly from the NetGames 2002 presentation by Bauer, Rooney and ScottonINF5063: Programming Heterogeneous Multi-Core Processors

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Client-Server

    backbonenetwork

    local distributionnetworklocal distributionnetworklocal distributionnetwork

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Peer-to-peer

    backbonenetwork

    local distributionnetworklocal distributionnetworklocal distributionnetwork

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    IETFs MiddleboxesMiddleboxnetwork intermediate device that implements middlebox servicesa middlebox function requires application specific intelligence

    Examplespolicy based packet filtering (a.k.a. firewall)network address translation (NAT)intrusion detectionload balancingpolicy based tunnelingIPsec security

    RFC3303 and RFC3304From traditional middleboxesEmbed application intelligence within the deviceTo middleboxes supporting the MIDCOM protocolExternalize application intelligence into MIDCOM agents

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Booster boxesMiddleboxesAttached directly to ISPs access routersLess generic than, e.g. firewalls or NAT

    Assist distributed event-driven applicationsImprove scalability of client-server and peer-to-peer applications

    Application-specific codeBoostersCaching on behalf of a serverAggregation of eventsIntelligent filteringApplication-level routing

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Booster boxesBooster Boxes Middleboxesattached directly to ISPs access routersless generic than, e.g., firewalls or NAT Assist distributed event-driven applicationsimprove scalability of client-server and P2P applications Application-specific code: Boosters

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Overlay networksbackbonenetworkbackbonenetworkbackbonenetworkLANLANLANIP linkOverlay linkOverlay nodeIP pathIP layerOverlay network layerApplication layer

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Booster boxes

    backbonenetwork

    local distributionnetworklocal distributionnetworklocal distributionnetwork

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Booster boxes

    backbonenetwork

    local distributionnetworklocal distributionnetworklocal distributionnetworkLoad redistribution bydelegating server functions

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Booster boxesApplication-specific code Caching on behalf of a serverNon-real time information is cachedBooster boxes answer on behalf of servers Aggregation of eventsInformation from two or more clients within a time window is aggregated into one packet Intelligent filteringOutdated or redundant information is dropped Application-level routingPackets are forward based onPacket contentApplication stateDestination address

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    ArchitectureData Layerbehaves like a layer-2 switch for the bulk of the traffic copies or diverts selected traffic IBMs booster boxes use the packet capture library (pcap) filter specification to select traffic

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    ArchitectureBooster layer BoosterApplication-specific codeExecuted either on the host CPU or the network processor LibraryBoosters can call the data-layer operation Generates a QoS-aware Overlay Network (Booster Overlay Network - BON)

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Data Aggregation Example: Floating Car Data Main booster taskComplex message aggregationStatistical computationsContext informationVery low real-time requirementsTraffic monitoring/predictionsPay-as-you-drive insuranceCar maintenanceCar taxesStatistics gatheringCompressionFilteringTransmission of Position Speed Driven distance

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Interactive TV Game ShowMain booster taskSimple message aggregationLimited real-time requirements1. packet generation2. packet interception3. packet aggregation4. packet forwarding

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Game with large virtual spaceMain booster taskDynamic server selectionbased on current in-game locationRequire application-specific processinghandled byserver 1handled byserver 2server 2server 1Virtual space High real-time requirements

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    SummaryScalability by application-specific knowledgeby network awareness

    Main mechanismsCaching on behalf of a serverAggregation of eventsAttenuationIntelligent filteringApplication-level routing

    Application of mechanism depends onWorkloadReal-time requirements

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Auto-configuration and dynamic link metrics

  • Multimedia Examples INF5063: Programming Heterogeneous Multi-Core Processors

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality Adjustment

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality Adjustment

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality Adjustment

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentSeveral ways to do video-quality adjustmentsframe droppingre-quantizationscalable video codec Yamada et. al. 2002: use low-pass filter to eliminate high-frequency components of the MPEG-2 video signal and thus reduce data ratedetermine a low-pass parameter for each GOPuse low-pass parameter to calculate how many DCT coefficients to remove from each macro block in a pictureby eliminating the specified number of DCT coefficients the video data rate is reduced implemented the low-pass filter on an IXP1200

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentLow-pass filter on IXP1200parallel execution on 200MHz StrongARM and microengines24 MB DRAM devoted to StrongARM only8 MB DRAM and 8 MB SRAM shared test-filtering program on a regular PC determined work-distribution75% of data from the block layer56% of the processing overhead is due to DCT five step algorithm:StrongArm receives packet copy to shared memory areaStrongARM process headers and generate macroblocks (in shared memory)microengines read data and information from shared memory and perform quality adjustments on each blockStrongARM checks if the last macroblock is processed (if not, go to 2)StrongARM rebuilds packet Yamada et. al. 2002

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentSegmentation of MPEG-2 dataslice = 16 bit high stripesmacroblock = 16 x 16 bit squarefour 8 x 8 luminancetwo 8 x 8 chrominanceDCT transformed with coefficients sorted in ascending order Data packetization for video filtering720 x 576 pixels frames and 30 fps36 slices with 45 macroblocks per frame

    Each slice = one packet8 Mbps stream ~7Kb per packetYamada et. al. 2002

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentYamada et. al. 2002

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentYamada et. al. 2002

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentYamada et. al. 2002

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentYamada et. al. 2002

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Multicast Video-Quality AdjustmentEvaluation three scenarios tested StrongARM only 550 kbpsStrongARM + 1 microengine 350 kbpsStrongARM + all microengines 1350 kbps achieved real-time transcoding not enough for practical purposes, but distribution of workload is niceYamada et. al. 2002

  • Parallelism, Pipelining &Workload PartitioningINF5063: Programming Heterogeneous Multi-Core Processors

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Divide and Divide a problem into parts but how?Pipelining:Parallelism:Hybrid:

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Key ConsiderationsSystem topologyprocessor capacities: different processors have different capabilities memory attachments:different memory types have different rates and access timesdifferent memory banks have different access times interconnections: different interconnects/busses have different capabilities

    Requirements of the workload?dependencies Parameters?width of pipeline (level of parallelism)depth of pipeline (number of stages)number of jobs sharing busses

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Network Processor ExamplePipelining vs. Multiprocessor by Ning Weng & Tilman Wolf network processor exampleall pipelining, parallelism and hybrid is possible packet processing scenario what is the performance of the different schemes taking into account? processing dependencies processing demands contention on memory interfaces pipelining and parallelism effects (experimenting with the width and the depth of the pipeline)

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    SimulationsSeveral application examples in the paper giving different DAGs, e.g.,... flow classification: classify flows according to IP addresses and transport protocols Measuring system throughput varying all the parameters# processors in parallel (width)# stages in the pipeline (depth)# memory interfaces (busses) between each stage in the pipelinememory access times

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Results# memory interfaces per stage M = 1Memory service time S = 10 Increases with the pipeline depth DGood scalability proportional to the # processorsIncreases with the width W initially, but tails off for large WPoor scalability due to contention on the memory channel Efficiency per processing engine?

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Lessons learned Memory contention can become a severe system bottleneck the memory interface saturates with about two processing elements per interfaceoff-chip memory access cause significant reduction in throughput and drastic increase in queuing delayperformance increase with more memory channelslower access times Most NP applications are of sequential nature which leads to highly pipelined NP topologies Balancing processing tasks to avoid slow pipeline stages Communication and synchronization are the main contributors to the pipeline stage time, next to the memory access delay Topology has significant impact on performance

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    Some ReferencesTatsuya Yamada, Naoki Wakamiya, Masayuki Murata, and Hideo Miyahara: "Implementation and Evaluation of Video-Quality Adjustment for heterogeneous Video Multicast, 8th Asia-Pacific Conference on Communications, Bandung, September 2002, pp. 454-457 Daniel Bauer, Sean Rooney, Paolo Scotton, Network Infrastructure for Massively Distributed Games, NetGames, Braunschweig, Germany, April 2002J.R. Allen, Jr., et al., IBM PowerNP network processor: hardware, software, and applications, IBM Journal of Research and Development, 47(2/3), pp. 177-193, March/May 2003

    Ning Weng, Tilman Wolf, Profiling and mapping of parallel workloads on network processors, ACM Symposium of Applied Computing (SAC 2005), pp. 890-896Ning Weng, Tilman Wolf, Analytic modeling of network processors for parallel workload mapping, ACM Trans. on Embedded Computing Systems, 8(3), 2009

    Li Zhao, Yan Lou, Laxmi Bhuyan, Ravi Iyer, SpliceNP: A TCP Splicer using a Network Processor, ANCS2005, 2005

    Hvard Espeland, Carl Henrik Lunde, Hkon Stensland, Carsten Griwodz, Pl Halvorsen, Transparent Protocol Translation for Streaming, ACM Multimedia 2007

    INF5063, Carsten Griwodz & Pl HalvorsenUniversity of Oslo

    SummaryTODO

    **************P2P is an old idea, but now it has been popularized by systems like Napster and Gnutella:machines act as both server and client at the same timeshare resources (data, CPU cycles, storage, bandwidth, )edge services (move data closer to clients, e.g., stored at other close by clients)

    *****