DataSimilarity-AwareComputation
InfrastructurefortheCloud
YuHua,SeniorMember,IEEE,XueLiu,Member,IEEE,andDanFeng,Member,IEEE
Abstract—Thecloudisemergingforscalableandefficientcloudservices.Tomeettheneedsofhandlingmassivedataand
decreasingdatamigration,thecomputationinfrastructurerequiresefficientdataplacementandpropermanagementforcacheddata.Inthispaper,weproposeanefficientandcost-effectivemultilevelcachingscheme,calledMERCURY,ascomputationinfrastructureofthecloud.TheideabehindMERCURYistoexploreandexploitdatasimilarityandsupportefficientdataplacement.Toaccuratelyandefficientlycapturethedatasimilarity,weleveragealow-complexitylocality-sensitivehashing(LSH).Inourdesign,inadditiontotheproblemofspaceinefficiency,weidentifythataconventionalLSHschemealsosuffersfromtheproblemofhomogeneousdataplacement.Toaddressthesetwoproblems,wedesignanovelmulticore-enabledlocality-sensitivehashing(MC-LSH)thataccuratelycapturesthedifferentiatedsimilarityacrossdata.Thesimilarity-awareMERCURY,hence,partitionsdataintotheL1cache,L2cache,andmainmemorybasedontheirdistinctlocalities,whichhelpoptimizecacheutilizationandminimizethepollutioninthelast-levelcache.Besidesextensiveevaluationthroughsimulations,wealsoimplementedMERCURYinasystem.Experimentalresultsbasedonreal-worldapplicationsanddatasetsdemonstratetheefficiencyandefficacyofourproposedschemes.IndexTerms—Cloudcomputing,multicoreprocessor,cachemanagement,datasimilarity
Ç
1
INTRODUCTION
E
W
areenteringtheeraofthecloudthatcontainsmassiveandheterogeneousdata.ThedatasetshavethesalientfeatureofavolumeofPetabytesorExabytesanddatastreamswithaspeedofGigabitspersecond.Thesedatasetsoftenhavetobeprocessedandanalyzedinatimelyfashion.AccordingtoarecentInternationalDataCorpora-tion(IDC)study,theamountofinformationcreatedandreplicatedismorethan1.8Zettabytes(1.8trillionGigabytes)in2011[1].From1700responsestoSciencepoll[2],about20percentrespondentsoftenusemorethan100GBdatasets,over63percenthaveeveraskedcolleaguesfordatasharing.Unfortunately,abouthalfofthosepolledstorethedataonlyintheirownlabsduetonofundingtosupportarchiving.Moreover,fromsmallhand-helddevicestohugedatacenters,wearecollectingandanalyzingever-greateramountsofinformation.Somecommercialcompanies,likeGoogle,Microsoft,Yahoo!,andFacebook,generallyhandleTerabytesandevenPetabytesofdataeveryday[3],[4],[5],.Forthecomputationinfrastructureofthecloud,itisimportantandchallengingtoperformefficientprocessingandanalysisuponthesedata.
Thecomputationinfrastructuretypicallyconsistsofmulticoreprocessors.TheincreasingnumberofcoresonachipandthedifferentdegreesofdatasimilarityexhibitedwithintheworkloadspresentthechallengestothedesignofcachehierarchiesinChipmultiprocessors(CMPs).These
.Y.HuaandD.FengarewiththeWuhanNationalLaboratoryforOptoelectronics,SchoolofComputerScienceandTechnology,HuazhongUniversityofScienceandTechnology,Wuhan430074,P.R.China.E-mail:{csyhua,dfeng}@hust.edu.cn.
.X.LiuiswiththeSchoolofComputerScience,McGillUniversity,MontrealH3A0E9,Canada.E-mail:xueliu@cs.mcgill.ca.Manuscriptreceived7Sept.2012;revised10Feb.2013;accepted24Apr.2013;publishedonline3May2013.
Forinformationonobtainingreprintsofthisarticle,pleasesende-mailto:tc@computer.org,andreferenceIEEECSLogNumberTCSI-2012-09-0642.
0018-9340/14/$31.00ß2014IEEE
includetheorganizationandthepoliciesassociatedwiththecachehierarchytomeettheneedsofsystemperfor-manceimprovementsandscalability.Cacheorganizationpresentsmultiplelevelsinthecachehierarchyaswellasthesize,associativity,latencyandbandwidthateachlevel.Suitablepolicieshelpminimizethelatencytofrequentlyaccesseddata[6],[7],[8],[9].Moreover,CMPsareprevalentthesedays.VendorsalreadyshipCMPswithfourtotwelvecoresandhavetheroadmapstoreleasehundredsofcorestothemarket.Forexample,inthecommercialmarkets,thereareTileraTILE64,AmbricAm2045andNvidiaGeForceGT200.Theyarewidelyusedincloudapplications.Despitethepopularity,itisstilladauntingtasktoaccuratelyandefficientlyperformthemulticorecachingforhighperformancecloudsystems.Thefocusofthispaperistooptimizethedataplacementofthemultilevelcachehierarchy(e.g.,L1,L2cachesandmainmemory)toimprovetheoverallcloudsystemperformance.
Efficientcachehierarchyinthecloudneedstoanswerthequestions,suchas“howtosignificantlyimprovethecacheutilizationandhowtoefficientlysupportthedataplacement?”Theseproblemsaremoredifficultandchallengingtoaddress,especiallyinthecaseoflargecorecount.Specifi-cally,weneedtoaddressthefollowingchallenges.
Challenge1:inconsistencygapbetweenCPUandoperatingsystemcaches.TobridgethespeedgapbetweenCPUandmemory,CPUcaches(e.g.,L1andL2)andoperatingsystem(OS)buffercachearewidelyusedinamultilevelcachehierarchy.SincetheCPUcachesareatthehardwarelevel,whilethebuffercacheisapartoftheOS,thesetwolayersareconventionallydesignedindependentlywithouttheawarenessofeachother.Thispossiblyworksforsmall-scalesystems.However,withtheincrementsofmulticoreamountsandincreasinglylargecapacityofmainmemory,severeperformancedegradationmayoccuroncetheinconsistencygapexists.Thesetwolayershence
PublishedbytheIEEEComputerSociety
4IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
becomeinefficienttoworkcooperatively.Moreover,byleveragingasharedcache,athread,whichcooperativelyworkswithmultiplecorunningthreads,caninfluenceeachother.Thisgenerallyleadstosevereperformancedegra-dation.Inthenearfuture,acachewillbesharedbymanycores,andthegapmaydegradetheperformanceevenmoreseriously[10],[11].
Challenge2:performancebottleneckshiftinhighperformancecloudsystems.Multicore-basedhardwareadvancementsbringnewchallengestothedesignandtheimplementa-tionofhigh-performancecloudsystems[12].ThisisbecausetheperformancebottleneckhasbeenshiftedfromslowI/Oaccessspeedstohighmemoryaccesslatency.Theperformancebottleneckofacceleratingtheexecutioniscorrelatedwiththeplacementproblemofcacheddata.Theoptimizationofcacheddataplacementhencebecomesimportanttoimprovetheoverallcloudsystemperfor-mance.Unfortunately,theexistingpoliciesinthemulticoreprocessorsbecomeneitherefficientnorscalabletoaddressthedataplacementproblem.Toefficientlyaddressthisproblem,weneedtocarefullyexploreandexploitthedatasimilaritythatgenerallyhidesbehindaccessbehaviors.Wealsoneedtooptimizethecapacityutilizationofaprivatecache,whilealleviatinguncontrolledinterferenceinasharedcache.
Challenge3:exacerbationoftheLLCpollution.Thelast-levelcache(LLC)[6]isdynamicallysharedamongthecoreswhileeachcorehasitslowestlevelofthecachehierarchy.Cachepollutionreferstothereplacementofacacheelementbyalessusefulone.Itoccurswhenanon-reusablecachelineisinstalledintoacacheset.Theinstalledlinedisplacesareusablecacheline.ToalleviatetheLLCpollution,conventionalapproacheshavethepremisethatrecentorderingservesasthegoodpredictionforsubsequentbehaviorsofcacheaccesses[10],[13].Inpractice,althoughleveragingtheaccesspatternshelpspredictfutureaccesses,thecacheshavetoinstallallcachelinesthatareaccessed.Sinceperformingtheidentificationontheaccesspatternsincursheavytemporalandspatialoverheads,theexistingapproachesgenerallydemonstrateunsatisfactoryperfor-mance.LonglatencyandinformationstalenessfurtherexacerbatetheLLCpollution.Whatweneedisanewschemethatsimplifiestheidentificationofaccesslocalitywithoutthelossofaccuracy.
OurproposedMERCURYalleviatesthelimitationsinthehardwaresolutionsandtheOS-basedschemes.Therationalecomesfromtheobservationthatperformingthestatemaintenanceandreferencepatternanalysisatpagegranularitygenerallyincurslessoverheadthanatblock[6],[10],[13].Moreover,learningdynamicreferencepatternsatpagegranularityrequireslessstateandstoragespacecomparedwiththealreadystudiedblock-grainpolicies.Ourresearchworkhenceisrelatedwithtwoareas:systemarchitectureanddata-intensivecloud.Thetwoareasaretraditionallydistinct,butthegaponcommonsystemconcernsbetweenthemhasbeennarrowedrecently.Thesimilarity-awareMERCURYmeetstheneedsofsuitabledataplacementinthemultilevelcachehierarchy.WeimplementMERCURYandmanagethesimilarityatagranularityofpagesbyleveragingtheoperatingsystemmechanisms.MERCURYiscompatiblewiththeexistingcloudcomputingsystemsandcanfurtherimproveuponthembyprovidingascalableandefficientcachingscheme.
MERCURYplaysasignificantandfruitfulroleinmanagingthemultilevelcachehierarchy.Specifically,wemakethefollowingcontributions.
First,(forChallenge1),tonarrowtheinconsistencygapandquantifythedatacorrelation,MERCURYemploysmultitype,ratherthanconventionalhomogeneous,mem-bershipmanagement.Here,themembershiprefersthatanitembelongstoagivendataset.Thedatainthesimilarity-awaremulticorecachesarejudiciouslyclassifiedintothreetypes,i.e.,Family,Friend,andForeigner,torespectivelyrepresentfrequentlyaccessedandcorrelated,frequentlyaccessedbutnotcorrelated,andinfrequentlyaccessedmemberships.Toguaranteethedataconsistencyandintegrity,wefurtherquantifythesemembershipsusinganewcodingtechnique.Second,(forChallenges2and3),toaddresstheperformancebottleneckandalleviatetheLLCpollution,MERCURYexploresandexploitstheaccesslocalitybyimprovingamulticore-enabledlocality-sensitivehashing(MC-LSH).TheMC-LSHusesaself-containedandspace-efficientsignaturevector,ratherthanmanyhashtablesinastandardlocality-sensitivehashing(LSH),toaccomplishthesignificantspacesavingsandmeanwhileaccuratelymea-surethedatasimilarity.SinceMERCURYminimizescacheconflictsandreducestheamountsofthemigrateddata,itsignificantlyreducesthelow-speedmemoryaccesses.MERCURYcanaccuratelyidentifythedatasimilarityandmitigatethestalenessofcacheddatatomeettheneedsofhigh-performancecloudsystems.
Third,wehaveimplementedthecomponentsandthefunctionalitiesofMERCURYinasoftwarelayer,whichiscompliantwiththeexistinghardwaredevices.Tofurtherexamineandevaluatetheefficacyandefficiencyoftheproposedscheme,wenotonlyexamineMERCURYinamulticoresimulation[7],[14],[15],butalsoimplementitinaproductionsystembypatchingPostgreSQL[16].Theextensiveexperimentsusereal-worldtracesanddatasets,andexaminetheperformanceinmultipleevaluationmetrics.Theremainderofthispaperisorganizedasfollows:Section2presentsthedatasetsanalysisandproblemstatement.Section3describestheproposedMERCURYarchitectureandcachingschemes.Section4showsthecacheddatamanagementschemesinthemultilevelhierarchy.Sections5and6,respectively,demonstratetheperformanceevaluationresultsinsimulationsandimplementations.WepresenttherelatedworkinSection7.Finally,weconcludeourpaperinSection8withsummariesoffindings.
2PROBLEMSTATEMENT
Inthissection,wefirststudyworkloadcharacteristicstoshowtheexistenceofdatasimilarityanddemonstrateitsperformanceimpactsoncachingschemes.Wealsopresenttheproblemstatementandbasicideasofourwork.
2.1AnalysisofReal-WorldTraces
Itiswellrecognizedthatthepropertyofdatasimilarityishelpfultoperformanefficientandscalablecaching[7],[11],[17],[18],[19],[20],[21].MainbenefitsincludethroughputimprovementsandthereductionoftheLLCcachemissrates,querylatency,anddatamigrationoverheads.Hence,themotivationofMERCURYdesigncomesfromthe
HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD5
Fig.1.Averagelocalityratiosfromreal-worlddatasets.
observationsofdatasimilaritywidelyexistinginthereal-worldapplications.Furthermore,wepresentthedefinitionofdatasimilarity.Fortwodatawithpointrepresentationsasaandb,weassumethattheyhaved-dimensional
attributesthatarerepresentedasvectorsa~dand~bd.Ifthe
geometricdistancebetweenvectorsa~dand~bdissmallerthan
apredefinedthreshold,theyaresimilar.Thedatasimilarityoftenhidesbehindthelocalityofaccesspatterns[8].
Westudytypicallarge-scaleapplications[22],[23],[24]andthemainbenchmarksfromtheSPEC2000evaluation[25],i.e.,175.vprand300.twolf.Thepropertiesofusedtracesanddatasetsarelisted:
1.
TheCoverTypedataset[22]contains581,012datapoints,eachofwhichhas54Dattributes.
2.TheEECSNFSserveratHarvard[23]collectedI/O
accesses.Thisdatasetcontainsconcurrentrequestswithatotalof4.4millionoperations.
3.TheHPfilesystemprovidesa10-day500-GBtrace
[24]thatrecordstheaccessesfrom236users.
4.The175.vprand300.twolfbenchmarksshowthe
CPUperformanceintheSPEC2000evaluation[25].175.vprleveragescombinatorialoptimizationtech-niquetoautomaticallysynthesizethemappedcircuits.The300.twolfmakesuseofTimberWolfSCplacementandglobalroutingpackage.
Toobtainexplicitdemonstration,weintensifytheabovetracesandbenchmarksintolargerscalesbyacombinationofspatialscale-upandtemporalscale-up.Specifically,thescale-upmethodneedstofirstdecomposeatraceintosubtraces,wherethetimingrelationshipsarepreservedtofaithfullymaintainthesemanticdependencesamongtracerecords.Thesesubtracesarereplayedconcurrentlybysettingthesamestarttime.Notethat,thecombinedtracemaintainsthesamehistogramofsystemcallsastheoriginaltrace,butpresentsaheavierworkload(higherintensity).Asaresult,datacanbebothspatiallyandtemporallyscaledupbydifferentfactors,dependinguponthenumberofsubtracesreplayedsimultaneously.Weintensifyexperi-mentaldatatobescaledupto2,000millionaccesses.Fig.2.Problemdescription.
Weuselocalityratioasameasuretorepresentthelocalityintheaccesspattern.Fig.1showstheresultsoflocalityratiothatisthepercentageofthetimesaccessingdatawithindefinedtimeintervaltothoseintheentiredataset.Thetimeintervalcomesfromtheusedtraces,inwhichallaccessesarelistedintheorderoftime.Forinstance,adatasetcontainsa20-hourtracerecordandweselecta25percentinterval,i.e.,5-houraccessrecord.Forafile,ifithas8accesseswithinarandomlyselected25percenttimeinterval,andtheaccessedtimesduringtheentirerunningtraceisalso8(i.e.,allaccessestothisfileoccurwithinthis25percentinterval),thelocalityratiobecomes8=8¼100%.Weobservethatthereexistsastrongdataaccesslocalitywithincertainnumberofinstances.Theobservationsalsoconformtotheconclusionsin[7].Accordingtoourexperimentalresultsandobservations,similardatagenerallydemonstratethelocalityofaccesspatterns.Iftheyareplacedtogether,wecanimprovecacheutilizationanddecreasethecomplexityandexecutioncostsofdataaccessoperations.
2.2BasicIdea
Thehardwaredesignofcloudcomputationinfrastructurestillworksforscenariostheyaredesignedfor,butthelackofflexibilitycanbeanunavoidableissueandinherentweakness,particularlyformulticoreormany-coreproces-sorswithanincreasinglylargenumberofcores.Incontrastcacheoptimizationandcacheresourcemanagementatdifferentlevelsofsoftware,suchasoperatingsystems,compilers,andapplicationprogramshaveshowntheireffectivenesstoaddressthelimitationsofhardwaresolu-tions.Withasoftwareapproach,long-termmemoryaccesspatternsofmostcloudapplicationscanbeanalyzedorpredicted,cachemanagementandoptimizationdecisionscanbemademoreeffectively.TherehavebeenseveralsuccessfulexamplesonuniprocessorswithsimpleLRUcachereplacementpolicies[18],[26],[27],[28].However,usingasoftwareschemetomanagecacheresourcesinmulticoreprocessorsismuchmorechallengingthaninuniprocessors,sincehardwareresourcesarealmostnotsharedandcoordinatedformultithreads.ResearchershaveevaluatedOS-basedcachepartitioningmethodsinmulti-coreprocessorswithoutanyadditionalhardwaresupport[29].TheOS-basedmethodofferstheflexibilityinimple-mentingvariousresourceallocationpolicies.However,sincehardwarecachesarenotinthescopeofOSmanage-ment,theOS-basedmethodscaninevitablycausenontrivialsoftwareoverhead,andarenotreadytobeusedascomputationinfrastructureofthecloud.
Toofferefficientcomputationinfrastructureforthecloud,weinvestigatetheproblemofthecacheddataplacementinthemultilevelcachehierarchywhenexecutingmultiple
6IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
parallelinstancesasshowninFig.2.Wetermthisproblemas“cache-member,”whichdeterminesthedatamembershipsineachcachebasedonthegivenconstraints.Theconstraintsincludemigrationcostsanddataaccesslatency.
Fig.2ashowsthecache-memberproblem.Specifically,weneedtofirstidentifyandaggregatesimilardataintothesameoradjacentprivateL1caches,andthenallocatethedataaccessedbymorethanonecoreintoasharedL2cache.We,hence,canmanagethecacheddatainbothL1andL2caches.Moreover,anidealmulticorearchitectureisscalableandflexibletoallowdynamicandadaptivemanagementonthecacheddata.Thepremiseistoaccuratelycapturethesimilardata[7],[17],whichunfortu-natelyisnontrivialduetoexpensiveoperationcostsofcomparingarrivingdatawithalltheexistingcachelines.Weidentifytheproblemofhomogeneousdataplacementthatoverlooksthedistinctpropertiesandmultitypemem-bershipsofcacheddata.Toalleviatethehomogeneousdatamanagement,weleverageadifferentiatedplacementpolicy,inwhichthecachemembershipsareclassifiedintothreetypesasshowninFig.2b.WeplacefrequentlyaccessedandcorrelateddataintoL1cache,called“in-cacheFamily,”frequentlyaccessedbutlooselycorrelateddataintoL2cache,called“shared-cacheFriend”andinfrequentlyaccesseddataintomainmemory,called“in-memoryForeigner.”Inthisway,wecandifferentiatethestrengthofaccesslocalitytofacilitatetheefficientplacementofcacheddata.
Inpractice,capturingdatasimilarityistimeconsumingandcomputation-intensiveworkduetohighdimensionsandheterogeneoustypes.Hence,toaccomplishasuitabletradeoffbetweensimilarityaccuracyandoperationcom-plexity,weproposetouseahash-basedapproach,forexample,theLSH[30],duetoitslocality-awarepropertyandeaseofuse.TheLSHcanidentifyandplacesimilardatatogetherwithlowcomplexity.Therationaleisthatsimilardatacontainstronglocalitytomatchaccesspatternsofmultiplethreads.TheLSH-basedscheme,thuscanimprovesystemperformance.Unfortunately,itiswellrecognizedthatastandardLSHsuffersfromheavyspaceoverheadduetotheuseoftoomanyhashtables[31],[32],[33].Moreover,dataplacementpolicydependsuponbothaccessfrequencyandcorrelation,whichiscurrentlydifficulttoberepre-sentedquantitativelyandmeasuredaccurately.
ThebasicideabehindMERCURYistoleveragetheMC-LSHtoidentifysimilardataandcarryoutdifferentiateddataplacement.MERCURYrepresentsthestrengthofdatasimilarityasFamily,Friend,andForeigner,respectively,asshowninFig.2b.Specifically,theprivateL1cachescontainFamilymembers,whicharetightlycorrelatedandfrequentlyuseddatatofacilitatethefastaccessandmaintaintheaccesslocalityineachcache.Furthermore,asharedL2cachecontainsFriendmembers,whichinfactconsistoftwoparts.OneisthedatafrequentlyaccessedbymultiplecoresandtheotheristhedataevictedfromcorrelatedL1cachesduetospacelimitationorstaleness.Finally,themainmemorycontainsForeignermembersthatarenotincludedintheL1orL2caches.Differentiateddataplacementcomprehensivelyconsidersboththestrengthofdatasimilarityandaccessfrequency,whileallowingtheflexibleadjustmentstosupportdynamicoperations(e.g.,insertion/deletion).The
Fig.3.MERCURYmulticorecachingarchitecture.
similardatathatareplacedcloselycanalsosignificantlyreducethemigrationcosts.MERCURY,hence,offersscalable,flexible,andload-balancedcachingschemesinamultilevelcachehierarchy.
MERCURYisimplementedinahybridschemetoaddressthelimitationsofbothhardwaresolutionsandOS-basedmethods.Specifically,ourmulticoreshared-cache-managementframeworkconsistsoftwolowcostandeffectivecomponents:alightweightmechanismforallocatingcacheresourcesandprovidingcacheusageinformation;andOS-basedresourceallocationpoliciesfordynamiccacheallocation.Withasimpleandlow-overheadcomponent,weenabledirectOScontroloversharedcaches,andthesoftwaresystemoverheadisminimized.WithanOS-basedmanagement,weareabletodesignandimple-mentmultiplepoliciestodealwithcomplicated,difficultcachingscenariosinmulticoresystems.
3MERCURYARCHITECTURE
MERCURYusestheMC-LSHtoidentifysimilardataandleveragesanLRUreplacementineachcachetoupdatestaledata.Fig.3showstheMERCURYarchitectureinthemultilevelhierarchy.WeassumethateachcorehasoneprivateL1cacheandallprocessorcoresshareanL2cache.TheMERCURYschemeistightlyassociatedwithtwoparts.Oneistheprocessorarchitectureandtheotheristheoperatingsystem.Furthermore,toexplicitlyrepresentthedifferentiatedmembershipsidentifiedbytheMC-LSH,weusedifferentflagstolabeleachcachelineandobtainholisticoptimizationinthemultilevelcachehierarchy.
3.1CachesinaMulticoreProcessor
ThecachingschemesinamulticoreprocessorincludeL1andL2cachemanagement,andvirtual-physicaladdresstranslation.
L1cachemanagement.Eachcorehasoneassociatedcachethatcontainsfrequentlyvisiteddatatoincreasetheaccessspeedanddecreasetherequiredbandwidth.Weneedtoupdatethestaleandinfrequentlyaccesseddata.
HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD7
L2cachemanagement.TopartitionthesharedL2cache,weleveragethewell-knownpagecolor[34]duetoitssimplicityandflexibility.PagecoloringisanextensivelyusedOStechniqueforimprovingcacheandmemoryperformance.Aphysicaladdresscontainsseveralcommonbitsbetweenthecacheindexandthephysicalpagenumber,whichisindicatedasapagecolor.Onecandivideaphysicallyaddressedcacheintononintersectingregions(cachecolor)bypagecolor,andthepageswiththesamepagecoloraremappedtothesamecachecolor.AsharedcacheisdividedintoNcolors,whereNcomesfromthearchitecturalsettings.ThecachelinesarerepresentedbyusingoneofNcachecolors.Weassignthecachecolorsofthevirtualpagesbyusingthevirtual-to-physicalpagemapping.
Addresstranslation.Theaddresstranslationcantranslatethevirtualaddressintothephysicaladdressbyreadingpagetable.ThecachecoloristightlyassociatedwiththenumberofpagecolorsintheL2cache.AvirtualTag(v-Tag)helpstoidentifysimilardatabyusingtheresultsfromtheMC-LSHcomputation.
3.2OperatingSystem
TheoperatingsystemfunctionalitiessupporttheMC-LSHcomputationandupdatethelocality-awaredata.
MC-LSH.AstandardLSHhelpsidentifysimilardataandunfortunatelyincursheavyspaceoverhead,i.e.,consumingtoomanyhashtables,toidentifythelocality-awaredata.Thespaceinefficiencyoftenresultsintheoverflowingfromalimited-sizecache.MERCURYproposestouseanMC-LSHtoofferefficiencyandscalabilitytothemulticorecaching.Specifically,theMC-LSHusesaspace-efficientsignaturevectortomaintainthecacheddataandutilizesacodingtechniquetosupportadifferentiatedplacementpolicyforthemultitypedata.WewilldescribethedesigndetailsoftheMC-LSHinSection4.
Updatinglocality-awaredata.Toexecutefastandaccurateupdates,akeyfunctioninMERCURYistoidentifysimilardatawithlowoperationcomplexity.Inpractice,manyhigh-performancecomputingapplicationsdemonstratetheiden-ticaldataatthesamevirtualaddress,butdifferentphysicaladdresses[7].Allrelevantvirtualaddressesthusneedtobemappedtothesamecacheset.WemakeuseoftheMC-LSHtoidentifysimilardataandavoidbrute-forcecheckingbetweenarrivingdataandallvalidcachelines.Thesimilardataarethenplacedinthesameorclose-bycachestofacilitatemulticorecomputationandefficientlyupdatedata.Sincethecacheddataarelocality-aware,MERCURY,hence,decreasesmigrationcostsandminimizescacheconflicts.Tosatisfyqueryrequestsandprovideflexibleuse,wedesignaninterfacebetweenhigh-performanceapplicationsandoperatingsystemasshowninFig.3.Itsmainfunctionistowraphigh-leveloperationrequeststolow-levelsystemcallswiththeaidofthepagecoloringtechnique[34].Pagecolormanagesthebitsbetweenthecacheindexandthephysicalpagenumberinthephysicalmemoryaddress.Specifically,theapplicationsneedtospecifytherequiredspaceintheirrequests.Therequestshelpdecidehowtopartitionavailablecachespaceamongqueryrequests.Queryexecutionprocessesindicatepartitioningresultsbyupdatingapagecolortable.Theoperatingsystemthenreadsthepagecolortabletoknowthecachepartitionsamongthequeryrequests.
Althoughtheoperatingsystemcannotdirectlyallocateon-chipcachespace,itcanmakeuseofvirtual-physicaladdressmappingtocontrolhowtoallocatepagesinthemainmemory.Thememorypagesofthesamecolorcanbemappedtothesamecacheregion.Toefficientlypartitionthecachespace,weallocatedifferentpagecolorstomemorythreads.MERCURYcanhenceleveragethepagecoloringtechniquetocompletecachepartitioningamongdifferentprocessesandsupportthequeries.
4CACHEDDATAMANAGEMENTINMERCURY
Tocapturethedatasimilarity,weproposeanMC-LSHdesigninMERCURY.Aspace-efficientsignaturevectorandasimplecodingtechniquehelpmaintainandrepresentthemultitypememberships.WefinallydescribetheschemeofupdatingdatainMERCURY.
4.1TheMC-LSHScheme
TheMC-LSHisamulticore-enabledschemethatconsistsoftheLSH-basedcomputation,asignaturevectorstructureandthemultitypemembershipcodingtechnique.Itoffersadeterministicmembershipforeachdataitem.Comparedwiththeconventionalclassificationschemesforexactresults,theMC-LSHprovidesanapproximateandfastschemetoobtainsignificanttimeandspace-savings.TheMC-LSHemploystheLSHfunctionstoidentifysimilardatabasedontheaccesspatterns.Toaddresstheproblemofspaceinefficiency(i.e.,toomanyhashtables)inthestandardLSH,weemployasignaturevectorstructure.Furthermore,toofferdifferentiateddataplacement,weuseamultitypemembershipcodingtechnique.
LimitationsofthestandardLSH.AnLSH[30]capturessimilardatabyallowingthemtobeplacedintothesamehashbucketswithahighprobability.
Definition1.GivenadistancefunctionkÃk,adatadomainS,andsomeuniverseU,anLSHfunctionfamily,i.e.,IH¼fh:S!UgiscalledðR;cR;P1;P2Þ-sensitive,iffor8p;q2S:.Ifkp;qk RthenPrIH½hðpÞ¼hðqÞ!P1,.Ifkp;qk>cRthenPrIH½hðpÞ¼hðqÞ P2,wherec>1andP1>P2.
InIH,ha;bðvÞ¼baÁvþb
!c.aisad-dimensionalrandomvectorwithchosenentriesfollowingans-stabledistributionandbisarealnumberchosenuniformlyfromtherange½0;!Þ,where!isaconstant.ByusingtheLSHfunctions,similardatahaveahigherprobabilityofcollidingthanthedatathatarefarapart[35].
AlthoughtheLSHhasbeenrecentlyusedinmanyapplications,itisdifficulttobeusedinthemulticoresystemsduetoheavyspaceoverheadandhomogeneousdataplacement.Theselimitationshaveseverelyhamperedtheuseofthemulticorebenefitsforhigh-performancesystems.Unliketheexistingwork,MERCURYenablestheLSHtobespace-efficientbyleveragingsignaturevectors.Space-efficientsignaturevector.TheMC-LSHleveragesspace-efficientsignaturevectorstostoreandmaintainthelocalityofaccesspatterns.Specifically,asignaturevectorisanm-bitarraywhereeachbitisinitiallysetto0.TherearetotallyLLSHfunctions,gið1 i LÞ,tohashadatapoint
8IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
Fig.4.Signaturevectorformaintainingpage-leveldatasimilarity.
intobits,ratherthanitsoriginalbucketsinhashtables,tosignificantlydecreasespaceoverhead.Adatapointasaninputofeachhashfunctiongiismappedintoabitthatisthussetto1possiblymorethanonceandonlythefirstsettingtakeseffect.
AsignaturevectorisabletomaintainthedatasimilarityasshowninFig.4.Acentralizedbitisthebitthatreceivesmorehitsthanitsleftandrightneighbors.Thehitnumbersasshowninthisfigurearealsomuchlargerthanapredefinedthresholdvalue.ThecentralizedbitsbecomethecentersofcorrelateddataandarefurtherselectedtobemappedandstoredintheL1caches.Whenhashingdataintothesignaturevector,wecountthehitnumbersofbitsandcarefullyselectthecentralizedbits.Moreover,thethresholddemonstratestheclusteringdegreeofdatadistribution,thusdependingupontheaccesspatternsofthereal-worldapplications.Afterselectingthecentralizedbits,wecanconstructamappingbetweenthecentralizedbitsandL1cachestofacilitatethedataplacement.ItisworthnotingthatthenumberofcentralizedbitsisunnecessarilyequaltothatoftheL1caches.IfthenumberofcentralizedbitsislargerthanthatofL1caches,anL1cachemaycontainthedatafrommorethanoneadjacentcentralizedbits.
TheMC-LSHcomputationcanguaranteesimilardatatobehashedintoonebitwithveryhighprobabilitythathoweverisnot100percent,meaningthatsimilardataarestillpossibletobeplacedintoadjacentbits.Falsenegative,hence,occurswhenthehitbitis0andoneofitsneighborsis1.Toavoidpotentialfalsenegatives,asimplesolutionistocheckextraneighboringbitsbesidesthehitone.Althoughextracheckingonneighboringbitspossiblyincursfalsepositives,inpractice,amissfromthefalsenegativegenerallyincursthelargerpenaltythanthefalsepositive.Areasonablesizeofcheckingextrabitsisacceptabletoobtainasuitabletradeoffbetweenfalsenegativesandfalsepositives.MERCURYprobesmorethanonehitbit,i.e.,checkingleftandrightneighbors,besidesthehashedbit.Notethat,theextracheckingoccursonlywhenthehitbitis“0”.OurresultconformstotheconclusionofsamplingdatainthemultiprobeLSH[31].
Toefficientlyupdatethesignaturevectors,MERCURYoffersscalableandflexibleschemesbasedonthechar-acteristicsofthereal-worldworkloads.Specifically,iftheworkloadsexhibitanoperation-intensive(e.g.,write-intensive)characteristic,wecancarryouttheoperationsonthesignaturevectorsandallowthe(re)-initializationin
Fig.5.Differentiatedmembershipcodingtechnique.
theidletime.Moreover,iftheworkloadsbecomeuniform,MERCURYmakesuseof4-bitcounters,ratherthanbits,asthesummaryofBloomfilters[36].Eachindexedcounterincreaseswhenaddinganitemanddecreaseswhenremovinganitem.Inpractice,a4-bitcountercansatisfytherequirementsformostapplications.
Multitypemembershipcoding.ThemembershipsintheMC-LSHincludeFamily,Friend,andForeigner,whichrespectivelyrepresentdifferentsimilaritiesamongcacheddata.TheMC-LSHidentifiesdatamembershipsandplacesdataintoL1cache,L2cache,ormainmemory,respectively.OnekeyissueinthedataplacementishowtodeterminewhetherthehitsinmultipleLSHvectorsindicateasinglecache.Toaddressthisproblem,weuseacodingtechniquetoguaranteemembershipconsistencyandintegrity.
WeuseanexampletoillustratethedifferentiatedmembershipcodingasshowninFig.5.Givenanitem,wefirstcomputeitshashedvaluesbyusinghashfunctionsinthesignaturevectortodeterminewhetheritiscorrelatedtooneoftheexistingL1caches.Basedontheconclusionin[31],ifthehitbitisanyofthecentralizedbit,itsleftandrightneighbors,theitemisconsideredtobecorrelatedwiththecorrespondingcacheandfurtherobtainsanM¼1indicator(i.e.,inthememory),togetherwithFF(Family/Friend)code(e.g.,thelocation)ofthatcentralizedbitinanLSHarray.
Weconstructamappingtablebetweenarrivingdataandmulticorecachestofacilitatedifferentiateddataplacement.IfallMindicatorsfromLLSHarraysshow1foranitembyusingabit-basedANDoperation,wedeterminethatthisitemiscorrelatedwithmulticorecachestoexecutefurthercheckingonthedatamappingtable.Otherwise,theitemisnotconsideredtobecorrelatedanddirectlyinsertedintothemainmemory.ThecheckingonthetableallowstodeterminewhethertheitemisaFamilyorFriend.Sinceperformingdirectsearchingontheentiretableconsumestoomuchtime,wefirsthashtheconcatenatedcodeofthatitemintoastandardBloomfilter[37]thathasalreadystoredthecodeindicators.Ifahitoccurs,wecontinuetoperformthecheckingonthemappingtable.Otherwise,theitemisconsideredasaFriend,andtheninsertedintothesharedL2cache.Furthermore,sincethetablecontainstoomanycodeindicators,thelinearlybrute-forcesearchingwillleadtounacceptablecosts,possiblybecomingtheperformancebottleneck.Toaddressthisissue,wemakeuseofahashtabletomaintainthesecodeindicatorsanddecreasethesearchinglatency.Whenahitoccursinthemappinghashtable,weinsertthisitemintothecorrespondingL1cache.
HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD9
4.2UpdatingData
InthemultilevelhierarchyofMERCURY,weneedtoupdatecacheddataandtheirmembershipsinthesignaturevector.
Forupdatingcacheddata.Toupdateactualdata,wemakeuseofalabel-basedtechniquetoupdatestaledatainmultilevelcaches.Thereasoncomesfromthefactthatsimilardataarepotentiallyreusedbycorrespondingcachesinthenearfuture.Todecreaserecachingcosts,wetemporarilylabelstaledataforcertaintime.Whenthetimeexpires,weupdatethecachesandreplacetheselabeledstaledata.Moreover,theL1cachesbelongingtomultiplecorespossiblycontaindifferentamountsofsimilardata.PerformingtheloadbalancewithinmultipleL1cachesishenceimportanttoobtainperformanceimprovements.Duetothelimited-sizecapacityineachL1cache,MERCURYtemporarilyplacesexcess,butcorrelateddataintothesharedL2cache.ThesecorrelateddatahavebeeninsertedintocorrespondingcountingBloomfilters[37].InthesharedL2cache,welabelthedatabyusingpagecolorsofthecorrelatedcorestoupdatecaches.OncefreespaceisavailableinanL1cache,MERCURYreloadstheselabeleddataintothecorrespondingL1cache.
TheoperationsofupdatingdataareactuallyamultilevelmigrationprocessfromtheL1cache,thentheL2cache,finallytothemainmemory.Theworkflowstepsaredescribedbelow.
1.
UpdatingcacheinMERCURYneedstoreplacestaledatainbothL1andL2caches,whileguaranteeinghighhitratesandlowmaintenancecosts.MERCURYmakesuseoftheMC-LSHtoidentifysimilardatathatarethenplacedintotheL1caches.
2.TheL1cachesemploythesimpleLRUreplacement
toupdatestaledata.
3.WhenthedataintheL1cachesbecomestale,they
aretransferredintothesharedL2cacheamongmultiplecores.
4.WhenthedataintheL2cachebecomestale,they
movetothemainmemory.
Forupdatingmemberships.Toupdatethedatamember-shipinthesignaturevectors,weleveragecountingBloomfilterstofacilitatethedatadeletionandmaintainthemembershipofthedatathathavebeenidentifiedtobecorrelatedandplacedintothecorrespondingL1caches.ThecountingBloomfiltershelpmaintainthemembershipofcacheddatainaspace-efficientway,carryouttheinitializationoftheL1cachesandkeeptheloadbalanceamongmultipleL1caches.EachcountingBloomfilterisassociatedwithoneL1cache.
WhenanitemisinsertedintotheL1cache,itismeanwhileinsertedintothecountingBloomfilter,inwhichthehitcountersareincreasedby1.SinceeachcountingBloomfilteronlyneedstomaintaintheitemsexistinginthecorrespondingL1cacheandthenumberofstoreddataisrelativelysmall,thusnotrequiringtoomuchstoragespace.Moreover,whendeletinganitem,thehitcountersaredecreasedby1.Ifallcountersbecome0,meaningthattherearenocacheddata,weinitializetheassociatedcachesbysamplingdatatodeterminethelocality-awarerepresenta-tioninthesignaturevector.Notethat,thesizeofasignatureTABLE1
SimulationParameters
vectordependsonnotonlytheamountsofdatatobeinserted,butalsotheirdistribution.We,hence,leveragewell-recognizedsamplingmethods[31],[35],[38],[39]toobtainthesuitablesize.
5PERFORMANCEEVALUATION
ThissectionpresentstheperformanceevaluationofourproposedschemebydescribingsimulationframeworkandexaminingthescalabilityofMERCURYcomparedwiththestate-of-the-artwork.
5.1ExperimentConfiguration
WeusesimulationstudyprimarilyfortheevaluationofMERCURY’sscalability.OursimulationisbasedonPolyScalarthatiswidelyusedinthemulticoresimulation[7],[14],[15].WeaddpagetablesintoPolyScalarforeachprocesstoenhanceitsvirtual-to-physicaladdresstransla-tionfunctionality.WefurtherimprovePolyScalarbyaddingthesimilarity-awarefunctionalitiesthatarede-scribedinSections3and4.ThesizeofeachOSpageis8KB.Sinceourstudyfocusesonthelast-levelcache(L2cache)thathasstronginteractionwiththemainmemory,weextendPolyScalartosimulateDDR2DRAMsystems.Thesimulatedmemorytransactionsarepipelined.MERCURYleveragestheMC-LSHtoidentifysimilardatathatareplacedintoL1andL2caches,respectively,withanLRUreplacementpolicy.Specifically,eachprocessorhasitsownprivateL1cache.AnL2cacheissharedbymultiplecores.WeevaluatethescalabilityofMERCURYbyincreasingthenumberofcores.InthepagecolorpolicyoftheL2cache,eachcorehaseightcolorsandeachcolorhas128cachesets.We,hence,allocate1-MBcachefor4-coresystem,2-MBcachefor8-coresystem,and4-MBcachefor16-coresystem.Table1showstheparametersettingsinthesimulations.
TheusedtracesanddatasetsincludeForestCoverTypedataset[22],EECSNFSserveratHarvard[23],HPfilesystemtrace[24],and175.vprand300.twolfinSPEC2000[25].Moreover,byusingtheproposedsamplingapproach[31],[35],[38],[39]describedinSection4.2,thesuitablesizesofsignaturevectorsare7.6KBin175.vpr,7.9KBin300.twolf,8.3KBinCoverType,8.7KBinEECS,and9.2KBinHP.
10IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
TABLE2
PerformanceEvaluationMetrics
Weusethemultiplemetricstoevaluatetheperfor-mance,includingThroughput,Weightedspeedup,andFairspeedupasshowninTable2.Specifically,theThroughputreferstotheabsoluteIPCnumberstoevaluatethesystemutilization.TheWeightedspeedupisthesumofspeedupsofallprogramsoverabaselineschemetoindicatethedecreaseofexecutiontime.TheFairspeedupistheharmonicmeanofthespeedupsoverabaselineschemetoobtainthebalancebetweenfairnessandperformance.Wealsoexaminetheperformanceintermsofcacheupdatelatency,migrationcost,hitrateandtime,andspaceoverheads.
5.2Results
WecompareMERCURYwithbaselineapproaches,i.e.,privateandsharedcaches,andthestate-of-the-work,PCM[12]andMergeable[7]schemes,whichwereimplementedfortheexperiments.
5.2.1Throughput
Fig.6showsthethroughputresultsfromexecutingthereal-worldapplicationswiththeincreaseofmulticorenumberfrom4to16.Theaveragethroughputson4-coresystemswithprivatecache,sharedcache,PCM,Mergeable,andMERCURYare1.352,1.563,1.815,1.925,and2.162,respectively.For8-coresystems,theaveragethroughputsare2.481,2.572,2.953,3.104,and3.305.For16-coresystems,theyare3.281,3.469,3.957,4.152,and4.452.
WeobservethattwotypicalSPEC2000benchmarksobtainthelargerthroughputsonaverageby15.7percentincreasethanotherapplications.ThemainreasonisthattheSPEC2000benchmarkshavebettersimilarityintheaccesspattern,thusallowingtheLSHtoaccuratelyandefficientlycapturecorrelateddata.Inaddition,MERCURYexecutesconstant-scalehashingcomputationtoquicklyandaccu-ratelyidentifycorrelateddata,thusobtainingthelargerthroughputthanthePCMandMergeableschemes.5.2.2WeightedSpeedup
WetakeintoaccountthechangesoftherelativeIPCthatistheratioofabsoluteIPCtothebaselineasthemetricoftheweightedspeedup.Theweightedspeedupsarenormalized
Fig.6.Throughput(sumofIPCs).Fig.7.Normalizedweightedspeedup.
tothosewiththeprivatecachesasshowninFig.7.Thesharedcacheobtainsbetterperformancethantheprivatecacheduetotheabilitytoadapttothedemandsofcompetingprocesses.Comparedwiththeprivatecache,theimprove-mentsofthesharedcacheare9.87,17.52,and23.67percent,respectively,on4-core,8-core,and16-coresystems.
ThePCM,Mergeable,andMERCURYhavemuchbetterperformancethanthesharedcache.Theaveragenormal-izedweightedspeedupsofthePCMschemeare1.263,1.376,and1.482,respectively,on4-core,8-core,and16-coresystems.Mergeableobtains1.372,1.493,and1.718weightedspeedups.MERCURYobtains1.527,1.634,and1.928weightedspeedups,demonstratingbetterperformance.Withtheincreaseofcores,MERCURYfurtherexhibitsitseffectivenessandscalabilitysinceitleveragessimplehashingtoadapttotheworkloadchanges.
5.2.3FairSpeedup
Fairspeedupcomputestheharmonicmeanofthenormal-izedIPCs,whiletakingintoaccountbothfairnessandperformance.Fig.8showstheresultsofcomparingMERCURYwithbaselineschemesandPCMintermsoffairspeedups.Thefairspeedupsarenormalizedtothosewiththeprivatecache.
ComparedwiththePCMscheme,Mergeable,andMERCURYimprovetheperformanceonthismetric,respectively,by7.16and8.35percent(4-core),8.31and9.52percent(8-core),and8.67and9.96percent(16-core).ThemainreasonisthatMERCURYleveragesthediffer-entiatedplacementpolicythatefficientlyallocatesthedataintothecorrelatedcachesandimprovestheutilizationofthemulticoreprocessorbasedonthemultitypememberships.5.2.4CacheUpdateLatency
Cachemanagementneedstoupdatestaleandinfrequentlyaccesseddatatoguaranteehighhitrates.Weevaluatetheupdateefficiencyofprivateandshared,PCM,Mergeable,andMERCURYintermsofoperationlatency.Fig.9shows
Fig.8.Normalizedfairspeedup.
HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD11
Fig.9.Normalizedupdatelatency.theupdatelatenciesthatarenormalizedtothoseinthePCMscheme.Weobservethattheaveragenormalizedlatenciesofprivateandsharedcachesare1.22and1.14,respectively.MergeablerequiresalittlelargerlatencythanPCMmainlyduetomergedwritebacks.ComparedwiththePCMandMergeableschemes,MERCURYusingthesimplehashcomputationrequiresthesmallesttimethanothersanddecreasestheupdatelatencyonaverageby48.26,46.57,and43.82percent,respectively,on4-core,8-core,and16-coresystems.
5.2.5MigrationCost
Hitmissesorupdatesincachesoftenleadtodatamigrationamongmultiplecaches,whichincursrelativelyhighcostsintermsofdatatransmissionandreplacementinthecachesofothercores.Fig.10showsthepercentageofmigrateddatainPCM,Mergeable,andMERCURY.Mergeableisabletodetectandmergesimilardatatoguaranteethatmanycorrelateddataarestoredinasinglecache,thusproducingthesmallernumberofthemigrateddatathanPCM.
Weobservethattheaveragepercentagesofmigrateddataare13.2and11.9percent,respectively,inprivateandsharedcaches.ComparedwithMergeable,MERCURYcanobtainbetterperformanceinthismetricanddecreasethenumberofmigrateddataonaverageby35.26,32.57,and31.73percenton4-core,8-core,and16-coresystems.Themainreasonsaretwofold.OneisthattheMC-LSHprovideshighaccuracyofidentifyingcorrelateddata,thusreducingthenumberofmigrateddata.TheotheristhatthefastidentificationofsimilardatainMERCURYproduceslowcomputationcomplexity.
5.2.6HitRate
Oneofthekeymetricstoevaluatecacheefficiencyisthehitratethatdefinestheprobabilityofobtainingquerieddatawithinlimitedcachespaceforrequests.Fig.11showsthecachehitrateoftheMERCURYschemecomparedwithprivate,shared,PCMandMergeable.Theaveragehitratesinprivate,shared,MERCURY,Mergeable,andPCMare
Fig.10.Percentageofmigrateddata.Fig.11.Hitratesinmulticorecaches.
respectively65.22,67.15,86.92,77.48,and69.27percentonthe4-coresystem,61.68,63.73,83.87,74.16,and65.12percentonthe8-coresystems,and59.51,61.34,81.73,70.52,and62.35percentonthe16-coresystems.MERCURYhasthebetterperformanceinthismetricthanMergeableandPCMsincetheMC-LSHaccuratelyidentifiescorrelateddatawithinconstant-scaleexecutioncomplexity.Theimprovedaccuracysignificantlydecreasespotentialmigrationcoststhatpossiblyoccurduetohitmisses.Thequickidentifica-tionalsoalleviatestheeffectsofstalenessinthecaches.
5.2.7TimeandSpaceOverheads
TheexecutiontimeinourperformanceevaluationincludestheidentificationandplacementofthecorrelateddataintheL1caches.WeevaluatetheMERCURY,PCM,Mergeable,andstandardLSHschemesintermsoftimeoverheadasshowninFig.12a.ThetimeoverheadisnormalizedtothoseinMergeablescheme.MERCURYmakesuseofhashingcomputationtoidentifycorrelateddata,thusrequiringsmallerexecutiontimethanMergeablethatneedstocarryouttheextraoperationsofmergingcacheblocks.ComparedwithMergeable,MERCURYdecreasestheexecutiontimeby28.73,21.84,and19.56percenton4-core,8-core,and16-coresystems.
MERCURYneedstousethesignaturevectortorevealthesimilarityofcorrelateddataandleveragecountingBloomfilterstomaintainthemembershipsofcacheddata,whileMergeablerequirestemporarystoragespace,forexample,
Fig.12.Normalizedtimeandspaceoverheads.
12IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
Fig.13.L2missratesofTPC-Hworkloadunderthreedatasetsizes.
Fig.14.L2missratesofTPC-Cworkloadunderthreedatasetsizes.
content-addressablememorytofindsimilardata.Fig.12bshowsthecomparisonsamongtheMERCURY,PCM,Mergeable,andstandardLSHschemesintermsofspaceoverhead.WeobservethatcomparedwiththeMergeablescheme,MERCURYobtainssignificantspacesavingsanddecreasesthespaceoverheadby47.35,45.26,and40.82per-cent,respectively,on4-core,8-core,and16-coresystems.ThemainreasonisthatMERCURYleveragesthesimplebit-awaresignatureandspace-efficientBloomfilterstodemon-strateandmaintainthemembershipsofcorrelateddata,thusobtainingspacesavings.
6SYSTEMIMPLEMENTATIONSTUDY
WepresenttheexperimentalresultsofrunningstandardworkloadsprovidedbybothTPC-HandTPC-Cbench-marks[40].Specifically,wesetup100clients(i.e.,themaximumnumberofclientsallowedinPostgreSQL)tosendoutqueriesconcurrently.Foreachclient,queriesarerandomlydrawnfromthepoolofallTPC-Hqueries.Werepeatthesameexperimentsunderthreedifferentdatasetsizes:500MB,1GB,and10GBforTPC-H;and1GB,5GB,and10GBforTPC-C.Duetospacelimitation,theevaluationmainlyshowstheresultsoftheTPC-Hwork-load.Werunallexperimentsinthecloud.EachcloudserverhastwoIntelCore2Quad2.66-GHzCPUs,8-GBmemory,andfour250-GBdisks.Eachprocessorhasfourcores,andeverytwocoressharea4-MBL2cache.TheDBMSusedinourexperimentsisthePostgreSQL8.3.1runningonLinuxkernel2.6.14.1.Wemeasuredtheperformancebythreemetrics,L2cachemissrate,queryexecutiontime(cyclesperinstruction),andpatchingcosts.
Foragivendataset,thehashfunctionscomefromtherandomselectionoftheLSHfunctionfamily.Morehash
functionsprovidehigheraccuracyofidentifyingsimilardata,whichhoweverincurshighercomputationcomplexityandspaceoverhead.Inourexperiments,weselectL¼7hashfunctionsbasedonthepresampleestimation,inwhichwerandomlyextractasubsetfromtheusedtraceandmaketheestimation,whichhasbeensuccessfullyusedinthereal-worldapplications[31],[35],[39],[41].Moreover,forthedatawithdifferenttypesordimensionalities,weusethenormal-izationmethodtocomputedatasimilarityinthesamemetricmeasure.Normalizedvalueisequaltothemeasure:(ActualValue-Minimum)/(Maximum-Minimum).Forinstance,fortheattributeoffilesize,weassumethatthesizerangeisfrom10KBto200KB,(i.e.,range:10-200).Forafilewith120KB,itsnormalizedvalueisð120-10Þ=ð200-10Þ¼0:58.WeevaluatetherealimplementationperformancebycomparingwithMCC-DB[17].MCC-DBexploitsaccessandcachingpatternsfromqueryanalysis.Thereasonformakingthiscomparisonisthreefold.First,bothMCC-DBandMERCURYworkwellaspatchestoPostgreSQL[16]forconcurrentqueries.Second,theessentialpropertybehindtwotechniquesistousecachepartitioninginmulti-coreprocessorstoenhancesystemperformance.Third,MCC-DBhasprovidedstandardexperimentalresultsbyusingTPC-H[40]tofacilitatefaircomparisonswithothermethods.TPC-H[40]benchmarkshavelargevolumesofdatafordecisionsupportsystemswhenexecuting22differenttypesofqueries.WeperformextensiveexperimentsonaphysicaltestbedbasedonthePostgreSQLsystemusingtheworkloadsgeneratedfromtheTPCbenchmarks.
6.1L2CacheMissRate
Figs.13and14,respectively,showtheL2missrateswhenusingTPC-HandTPC-Cworkloadswithdifferentdatasetsizes.Wefirstexaminetheratesofthreetypicalqueries,i.e.,
HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD13
Fig.15.L2missratesin10-GBdatasetwith4-MBL2cache.
Fig.16.Executiontimeofqueries.
Q5,Q8,andQ20,inbothMCC-DBandMERCURYschemes.Q5andQ8aredominatedbymultiwayhashjoinsandQ20isdominatedbynestedsubqueryexecutions.WeobservethatMERCURYobtainsonaverage42.6,48.1,and51.2per-centmissdecreasecomparedwithMCC-DBintheTPC-Hworkloadwith500MB,1GMB,and10GBsizes.Thesebenefitscomefromthefastandaccuratehashing-basedcomputationtoidentifysimilardatatoefficientlysupportconcurrentqueries.WefurtherexaminetheL2missratesbyexecutingall22queriesinTPC-H.ComparedwithMCC-DB,MERCURYhasthedecreaseofmissratesfrom16.1to26.7percent,onaverage21.8percentforall22queriesasshowninFig.15.Inaddition,itisalsoobservedthatQ1,Q4,Q14,andQ21showthecomparablevalueswithMCC-DBduetotheirrelativelyweaklocalitycharacteristic.
6.2QueryExecutionTime
SharedL2cacheplaysakeyroleindeterminingqueryexecutionperformance.Whenconcurrentqueryexecutionprocessesaccesscommontuplesandindexdata,MERCURYenablesmultiplecoresfordatasharingtoreduceunneces-sarymemoryaccesses.InMERCURY,aqueryexecutionprocessreusescachelines.Weevaluatequeryexecutiontimebyusingcyclesperinstruction,whichisexaminedbyusingtheperformancetooltocheckhardwarecounters.Figs.16aand16b,respectively,showtheexecutingtimeof
TABLE3
NormalizedTimeandSpaceCostsasaPatch
queriesin1GBand10GBdatasetsizes.Duetodifferentfunctionalitiesofall22queriesinTPC-H,theexecutiontimeshowssignificantfluctuation.MERCURYalsoobtainsonaverage21.7and23.1percentimprovementsuponMCC-DBin1-GBand10-GBsizes.
6.3PatchingCost
BothMCC-DBandMERCURYareimplementedasapatchinPostgreSQL.Thenewcomponentmayintroduceextrapatchingcosts.WeexaminethesecostsintermsoftimeandspaceasshowninTable3bybeingnormalizedtoMCC-DBwhentakingintoaccountbothTPC-CandTPC-Hdatasetsunderdifferentsizes.MERCURYrequireslesstimeandspacecoststhanMCC-DBbecauseMERCURYmakesuseofthefastMC-LSHhashingcomputationtoplacesimilardatatogether.Wealsoobservethatwiththeincreaseofdatasetsizes,MERCURYobtainsmorebenefitsintermsoftimeandspaceoverheadsoverMCC-DB.Inthemeantime,thisobservationdemonstratesthescalabilityoftheMERCURYscheme.
7RELATEDWORK
Multilevelcachehierarchyhasbeenstudiedinthehigh-performancecloudarchitectureandsoftwarecommunities.Thereexistsawiderangeofproposalstoimprovecachingperformance(e.g.,hitrate,accesslatency,andspaceover-head)[11],[19],[20],[42],[43],[44].Wearguethatsuitablemanagementofthemultilevelcachehierarchyisbecomingmoreimportanttodeliverhighperformanceinthecloud.Locality-basedoptimization.Thestate-of-the-artwork,R-NUCA[45],obtainsnear-optimalcacheblockplacementbyclassifyingblocksonlineandplacingdataclosetothecore.Tomitigatethelossofreusingcachedstateswhenreschedul-ingaprocess,affinityscheduling[46]helpsreducecachemissesbyjudiciouslyschedulingaprocessonarecentlyusedCPU.Toimprovetheperformancein“multiexecution”applications,Mergeable[7]capturesdatasimilaritiesandmergesduplicatecachelinesownedbydifferentprocessestoobtainsubstantialcapacitysavings.Nevertheless,perform-ingtheexplicitlymergingoperationsoncacheblocksdemandsrelativelylongerexecutiontimeandincreasescomputationcomplexity.Theprocess-levelcachemanage-mentpolicy(PCM)[12]hastheassumptionthatallmemoryregionsbelongingtoarunningprocessexhibitthesameaccesspattern.MCC-DB[17]makesuseofdifferentlocalitystrengthsandqueryexecutionpatternstominimizecacheconflicts.Thisimprovementworksundertheassumptionwhentherearemultiplecandidateplansthatareaccuratelyestimatedinadvance.However,thisassumptiondoesnotalwaysholdinmanypracticalapplicationsbecauseperform-ingaccurateestimategenerallyrequireshigh-computation
14IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
overheads.Unlikethem,MERCURYexploresandexploitsthelocalitypropertybylightweighthashingapproach,thusobtainingsignificantperformanceimprovements.
Hardwareacceleration.ToreducethecachepollutioncausedbyLRUthatinsertsnon-reusableitemsintothecachewhileevictingreusableones,ROCS[6]employshardwarecounterstocharacterizecachebehaviorsandintroducesapollutebuffertohostnon-reusedcachelinesofpagesbeforeeviction.Moreover,toaddresstheproblemsofincreasedcapacityinterferenceandlongerL2accesslatency,CloudCache[47]leveragesfine-grainedhardwaremonitor-ingandcontroltodynamicallyexpandandshrinkL2cachesforworkingthreadsbyusingdynamicglobalpartitioning,distance-awaredataplacement,andlimitedtargetbroad-cast.Hardware-assistedexecutionthrottling[48]helpsregulatefairnessinmodernmulticoreprocessors,whiledemonstratingtherelativebenefitsofthevariousresourcecontrolmechanisms.Moreover,toreducethelargeamountsofmissesintheLLCbetweentheevictionofablockanditsreuse,Scavenger[49]dividesthetotalcachestorageintoaconventionalcacheandavictimfilearchitecturetoidentifyandretainhigh-prioritycacheblocksthataremorelikelytobereused.MERCURYbridgesthegapbetweenamulticorearchitectureandanoperatingsystems.Theexistinghard-wareaccelerationapproachescanuseMERCURYtosimplifyoperationsandoptimizesystemcalls.
Operationsenhancements.MergeSort[50]performedanefficientmultiwaymergewithoutbeingconstrainedbythememorybandwidthforhigh-throughputdatabaseapplica-tions.Theparallelskylinecomputationcouldbenefitfrommulticorearchitectures,suchasparallelversionofthebranch-and-boundalgorithm.Parketal.[51]presentedaparallelalgorithmbasedonparallelprogrammingthatwasevaluatedasacasestudyofparallelizingdatabaseopera-tions.Acooperation-basedlockingparadigm[52]wasproposedforefficientparallelizationoffrequencycountingandtop-kovermultiplestreamsinthecontextofmulticoreprocessors.Inaddition,adaptiveaggregation[53]demon-stratedthatachipmultiprocessorwithnewdimensionscouldenhanceconcurrentsharingofaggregationdatastructuresandcontentiousaccessestofrequentlyusedvalues.Qiaoetal.[54]introducedaschedulingtechniquetocooperatemultiplememoryscanstoreducetheoverheadonmemorybandwidth.Theseresearchprojectsaimtomakeasinglequerybenefitfromthecache,whichisorthogonaltoourwork.MininglocalitycanimproveparallelqueriesinmulticoreCPU[55]andtree-structureddata[56].Twopopularjoinalgorithms,suchashashjoinandsort-mergejoin,wasre-examinedin[57]tousemulticorecacheblockingtominimizeaccesslatency,increasecomputedensityandbalanceloadamongstcores,evenforheavilyskewedinputdatasets.CATCH[58]canstoreuniquecontentsininstructioncachebymeansofhashing,buttheirproposedsystemdoesnotsupportmodificationsincacheddata.Inaddition,thecachecompressiontechnique[59]compressestheL2dataresultstoreducethecachespaceandtheoff-chipaccesses,thusobtainingbandwidthsavings.Thecoopera-tivecachingtechnique[60]inamultiprocessorcanreduceoff-chipaccessthroughusingacooperativeprivatecacheeitherbystoringasinglecopyofcleanblocksorprovidingacache-like,spill-overmemoryforstoringevictedcachelines.MERCURYcanimprovethequeryperformancebyusinglocality-awaredataplacementstrategy.
Workloadsawareness.AnOS-basedcachepartitioningmechanism[29]presentsexecutionandmeasure-basedstrategiesformulticorecachepartitioninguponmultiplerepresentativeworkloads.Anonuniformcachearchitecture(NUCA)[61]takesadvantageofproximityofdatafromtheaccessingprocessor.TofurtheraddresstheproblemofonchipdatalocalityinlargesharedNUCA,PageNUCA[8]proposedafullyhardwiredcoarse-graindatamigrationmechanismthatdynamicallymonitoredtheaccesspatternsofthecoresatthegranularityofapage.Subsequently,theNuRAPIDproposal[62]decoupledthetaganddataplacementinaNUCAbyaugmentingeachtaganddatablockwithaforwardandreversepointertothecorrespond-ingdatablockandtag,respectively.NUcache[63]makesuseoftheDelinquentPC-Next-Usecharacteristictoimprovetheperformanceofsharedcachesinmulti-cores.TheNUcacheorganizationlogicallypartitionstheassociativewaysofacachesetintoMainWaysandDeliWays.MERCURYisorthogonaltotheexistingschemes.Itleveragesthelight-weightLSH-basedcomputationandobtainssignificantperformanceimprovementsontheLLCbyaccuratelycapturingthedifferentiatedlocalityacrossdata.
Scheduling.Age-basedschedulingforheterogeneousmultiprocessor[64]allowsathreadwiththelargerremain-ingexecutiontimetoruninafastercoregiventhepredictionofremainingexecutiontime.Athread-basedpreloadingtechniqueforsimultaneousmultithreadingprocessorswasproposedin[65]tousethehelperthreadtoperformaggressivedatapreloading.Toimprovetheutilizationofon-chipmemoryandreducetheimpactofexpensiveDRAMandremotecacheaccesses,O2scheduling[66]schedulesobjectsandoperationstocachesandcores.Todecreasetheunnecessarysharingofnetworkcontrolstateatallstacklayers,theIsoStackarchitecture[67]offloadsnetworkstackprocessingtoadedicatedprocessorcore.Moreover,inte-gratedprocessor-cachepartitioning[9]dividesboththeavailableprocessorsandthesharedcacheinachipmulti-processoramongdifferentmultithreadedapplications.TheexistingschedulingstrategiescanfurtherhelpoptimizeMERCURYperformance.
8CONCLUSION
MERCURY,asaninfrastructureofthecloud,playsasignificantroleinmanagingthemultilevelcachehierarchy.Byexploringandexploitingdatasimilaritythatisderivedfromlocality-awareaccesspatterns,MERCURYalleviateshomogeneousdataplacementandimprovessystemper-formancebythelow-complexityMC-LSHcomputation.Thecost-effectiveMERCURYisabletoprovidehybridfunctionalities.Oneistoprovidealightweightmechanismforallocatingcacheresources.TheotheristosupporttheOS-baseddynamiccacheallocationandcapturedatasimilaritywiththeaidofspace-efficientstructures.MER-CURY,hence,allowstheOScontroloverthesharedLLCs,whileminimizingsoftwareoverheads.Experimentsusingthereal-worlddatasetsdemonstratetheMERCURY’sefficiency.
Sincemodernmicroprocessorsincreasinglyincorporatemultiplememorycontrollers[68],[69],ourfutureworkwillconsiderthelocality-awareschemesforimplementingeffi-cientdataplacementinmultiplememorycontrollers.
HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD15
ACKNOWLEDGMENTS
ThisworkwassupportedinpartbyNationalNaturalScienceFoundationofChina(NSFC)underGrant61173043;NationalBasicResearch973ProgramofChinaunderGrant2011CB302301;NSFCunderGrant61025008,61232004;FundamentalResearchFundsforthecentraluniversities,HUST,undergrant2012QN098;theNSERCDiscoveryGrant341823;USNationalScienceFoundationAward1116606.ThisisanextendedversionofourpaperpublishedintheProceedingsofthe20thIEEEInternationalSympo-siumonModeling,Analysis,andSimulationofComputerandTelecommunicationSystems(MASCOTS),2012,pages:371-378.
REFERENCES
[1]J.GantzandD.Reinsel,“DigitalUniverseStudy:ExtractingValuefromChaos,”Proc.Int’lDataCorporation(IDC),June2011.
[2]ScienceStaff,“DealingwithData—ChallengesandOpportu-nities,”Science,vol.331,no.6018,pp.692-693,2011.
[3]M.Armbrustetal.,“AViewofCloudComputing,”Comm.ACM,vol.53,no.4,pp.50-58,2010.
[4]S.Bykov,A.Geller,G.Kliot,J.Larus,R.Pandya,andJ.Thelin,“Orleans:CloudComputingforEveryone,”Proc.ACMSymp.CloudComputing(SOCC),2011.
[5]S.Wu,F.Li,S.Mehrotra,andB.Ooi,“QueryOptimizationforMassivelyParallelDataProcessing,”Proc.ACMSymp.CloudComputing(SOCC),2011.
[6]
L.Soares,D.Tam,andM.Stumm,“ReducingtheHarmfulEffectsofLast-LevelCachePolluterswithanOS-Level,Software-OnlyPolluteBuffer,”Proc.IEEE/ACM41stAnn.Int’lSymp.Microarch-itecture(MICRO),pp.258-269,2009.
[7]
S.Biswas,D.Franklin,A.Savage,R.Dixon,T.Sherwood,andF.Chong,“Multi-Execution:MulticoreCachingforData-SimilarExecutions,”Proc.36thAnn.Int’lSymp.ComputerArchitecture(ISCA),2009.
[8]
M.Chaudhuri,“PageNUCA:SelectedPoliciesforPage-GrainLocalityManagementinLargeSharedChip-MultiprocessorCaches,”Proc.Int’lSymp.High-PerformanceComputerArchitecture(HPCA),pp.227-238,2009.
[9]
S.Srikantaiah,R.Das,A.K.Mishra,C.R.Das,andM.Kandemir,“ACaseforIntegratedProcessor-CachePartitioninginChipMultiprocessors,”Proc.Conf.HighPerformanceComputingNetwork-ing,Storage,andAnalysis(SC),2009.
[10]
X.Ding,K.Wang,andX.Zhang,“SRM-Buffer:AnOSBufferManagementTechniquetoPreventLastLevelCachefromThrashinginMulticores,”Proc.SixthConf.ComputerSystems(EuroSys),2011.
[11]Y.Chen,S.Byna,andX.Sun,“DataAccessHistoryCacheandAssociatedDataPrefetchingMechanisms,”Proc.IEEE/ACMConf.Supercomputing(SC),2007.
[12]
J.Lin,Q.Lu,X.Ding,Z.Zhang,X.Zhang,andP.Sadayappan,“EnablingSoftwareManagementforMulticoreCacheswithaLightweightHardwareSupport,”Proc.Conf.HighPerformanceComputingNetworking,Storage,andAnalysis(SC),2009.
[13]
J.Stuecheli,D.Kaseridis,D.Daly,H.Hunter,andL.John,“TheVirtualWriteQueue:CoordinatingDRAMandLast-LevelCachePolicies,”Proc.37thAnn.Int’lSymp.ComputerArchitecture(ISCA),2010.
[14]A.Forin,B.Neekzad,andN.Lynch,“Giano:TheTwo-HeadedSystemSimulator,”TechnicalReportMSR-TR-2006-130,MicrosoftResearch,2006.
[15]
S.Biswas,D.Franklin,T.Sherwood,andF.Chong,“Conflict-AvoidanceinMulticoreCachingforData-SimilarExecutions,”Proc.10thInt’lSymp.PervasiveSystems,Algorithms,andNetworks(ISPAN),2009.
[16]“PostgreSQL,”http://www.postgresql.org/,2013.
[17]R.Lee,X.Ding,F.Chen,Q.Lu,andX.Zhang,“MCC-DB:MinimizingCacheConflictsinMulti-CoreProcessorsforData-bases,”Proc.VLDBEndowment,vol.2,no.1,pp.373-384,2009.[18]
T.R.B.Bershad,D.Lee,andB.Chen,“AvoidingConflictMissesDynamicallyinLargeDirect-MappedCaches,”Proc.SixthInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),1994.
[19]Y.Yan,X.Zhang,andZ.Zhang,“Cacheminer:ARuntime
ApproachtoExploitCacheLocalityonSMP,”IEEETrans.ParallelandDistributedSystems,vol.11,no.4,pp.357-374,Apr.2000.[20]K.Zhang,Z.Wang,Y.Chen,H.Zhu,andX.Sun,“PAC-PLRU:A
CacheReplacementPolicytoSalvageDiscardedPredictionsfromHardwarePrefetchers,”Proc.IEEE/ACM11thInt’lSymp.Cluster,Cloud,andGridComputing(CCGrid),pp.265-274,2011.
[21]G.Suh,S.Devadas,andL.Rudolph,“AnalyticalCacheModels
withApplicationstoCachePartitioning,”Proc.ACM15thInt’lConf.Supercomputing(ICS),2001.
[22]“UCIMachineLearningRepository,”TheForestCoverTypeData
Set,http://archive.ics.uci.edu/ml/datasets/Covertype,2013.[23]D.Ellard,J.Ledlie,P.Malkani,andM.Seltzer,“PassiveNFS
TracingofEmailandResearchWorkloads,”Proc.SecondUSENIXConf.FileandStorageTechnologies(FAST),2003.
[24]E.Riedel,M.Kallahalla,andR.Swaminathan,“AFrameworkfor
EvaluatingStorageSystemSecurity,”Proc.Conf.FileandStorageTechnologies(FAST),2002.
[25]SPEC2000,http://www.spec.org/cpu2000/,2013.
[26]S.CarrandK.Kennedy,“CompilerBlockabilityofNumerical
Algorithms,”Proc.SupercomputingConf.,1992.
[27]E.E.R.M.S.LamandM.E.Wolf,“TheCachePerformanceand
OptimizationsofBlockedAlgorithms,”Proc.FourthInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),1991.
[28]M.S.L.T.C.MowryandA.Gupta,“DesignandEvaluationofa
CompilerAlgorithmforPrefetching,”Proc.FifthInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),1992.
[29]J.Lin,Q.Lu,X.Ding,Z.Zhang,X.Zhang,andP.
Sadayappan,“GainingInsightsintoMulticoreCachePartition-ing:BridgingtheGapbetweenSimulationandRealSystems,”Proc.IEEE14thInt’lSymp.HighPerformanceComputerArchitecture(HPCA),2008.
[30]P.IndykandR.Motwani,“ApproximateNearestNeighbors:
TowardRemovingtheCurseofDimensionality,”Proc.13thAnn.ACMSymp.TheoryofComputing(STOC),1998.
[31]Q.Lv,W.Josephson,Z.Wang,M.Charikar,andK.Li,“Multi-ProbeLSH:EfficientIndexingforHigh-DimensionalSimilaritySearch,”Proc.33rdInt’lConf.VeryLargeDataBases(VLDB),pp.950-961,2007.
[32]R.Shinde,A.Goel,P.Gupta,andD.Dutta,“SimilaritySearchand
LocalitySensitiveHashingUsingTernaryContentAddressableMemories,”Proc.ACMSIGMODInt’lConf.ManagementofData(SIGMOD),pp.375-386,2010.
[33]A.JolyandO.Buisson,“APosterioriMulti-ProbeLocality
SensitiveHashing,”Proc.ACMInt’lConf.Multimedia,2008.
[34]G.Taylor,P.Davies,andM.Farmwald,“TheTLBSlice-aLow-CostHigh-SpeedAddressTranslationMechanism,”Proc.17thAnn.Int’lSymp.ComputerArchitecture(ISCA),1990.
[35]A.AndoniandP.Indyk,“Near-OptimalHashingAlgorithmsfor
ApproximateNearestNeighborinHighDimensions,”Comm.ACM,vol.51,no.1,pp.117-122,2008.
[36]L.Fan,P.Cao,J.Almeida,andA.Broder,“SummaryCache:A
ScalableWide-AreaWebCacheSharingProtocol,”IEEE/ACMTrans.Networking,vol.8,no.3,pp.281-293,June2000.
[37]B.Bloom,“Space/TimeTrade-OffsinHashCodingwithAllow-ableErrors,”Comm.ACM,vol.13,no.7,pp.422-426,1970.
[38]Y.Tao,K.Yi,C.Sheng,andP.Kalnis,“QualityandEfficiency
inHigh-DimensionalNearestNeighborSearch,”Proc.ACMSIGMODInt’lConf.ManagementofData(SIGMOD),2009.
[39]Y.Hua,B.Xiao,D.Feng,andB.Yu,“BoundedLSHforSimilarity
SearchinPeer-to-PeerFileSystems,”Proc.37thInt’lConf.ParallelProcessing(ICPP),pp.644-651,2008.[40]TPC,http://www.tpc.org/,2013.
[41]Y.Hua,B.Xiao,B.Veeravalli,andD.Feng,“Locality-Sensitive
BloomFilterforApproximateMembershipQuery,”IEEETrans.Computers,vol.61,no.6,pp.817-830,June2012.
[42]Z.Zhang,Z.Zhu,andX.Zhang,“CachedDramforILPProcessor
MemoryAccessLatencyReduction,”IEEEMicro,vol.21,no.4,pp.22-32,July2001.
[43]S.Byna,Y.Chen,X.Sun,R.Thakur,andW.Gropp,“ParallelI/O
PrefetchingUsingMPIFileCachingandI/OSignatures,”Proc.IEEE/ACMConf.Supercomputing(SC),2008.
[44]Z.Zhang,Z.Zhu,andX.Zhang,“DesignandOptimizationof
LargeSizeandLowOverheadOff-ChipCaches,”IEEETrans.Computers,vol.53,no.7,pp.843-855,July2004.
16IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014
[45]N.Hardavellas,M.Ferdman,B.Falsafi,andA.Ailamaki,“Near-OptimalCacheBlockPlacementwithReactiveNonuniformCacheArchitectures,”IEEEMicro,vol.30,no.1,pp.20-28,Jan./Feb.2010.[46]J.Torrellas,A.Tucker,andA.Gupta,“BenefitsofCache-Affinity
SchedulinginShared-MemoryMultiprocessors:ASummary,”Proc.ACMSIGMETRICSConf.MeasurementandModelingofComputerSystems(SIGMETRICS),1993.
[47]H.Lee,S.Cho,andB.Childers,“Cloudcache:Expandingand
ShrinkingPrivateCaches,”Proc.IEEE17thInt’lSymp.HighPerformanceComputerArchitecture(HPCA),pp.219-230,2011.
[48]X.Zhang,S.Dwarkadas,andK.Shen,“HardwareExecution
ThrottlingforMulti-coreResourceManagement,”Proc.USENIXAnn.TechnicalConf.,2009.
[49]A.Basu,N.Kirman,M.Kirman,M.Chaudhuri,andJ.Martinez,
“Scavenger:ANewLastLevelCacheArchitecturewithGlobalBlockPriority,”Proc.40thAnn.IEEE/ACMInt’lSymp.Micro-architecture(MICRO),pp.421-432,2007.
[50]J.Chhugani,A.Nguyen,V.Lee,W.Macy,M.Hagog,Y.Chen,A.
Baransi,S.Kumar,andP.Dubey,“EfficientImplementationofSortingonMulti-CoreSIMDCPUArchitecture,”Proc.VLDBEndowment,vol.1,pp.1313-1324,2008.
[51]S.Park,T.Kim,J.Park,J.Kim,andH.Im,“ParallelSkyline
ComputationonMulticoreArchitectures,”Proc.IEEE25thInt’lConf.DataEng.(ICDE),2009.
[52]S.Das,S.Antony,D.Agrawal,andA.E.Abbadi,“Thread
CooperationinMulticoreArchitecturesforFrequencyCountingoverMultipleDataStreams,”Proc.VLDBEndowment,vol.2,pp.217-228,2009.
[53]J.CieslewiczandK.Ross,“AdaptiveAggregationonChip
Multiprocessors,”Proc.33rdInt’lConf.VeryLargeDataBases(VLDB),2007.
[54]L.Qiao,V.Raman,F.Reiss,P.Haas,andG.Lohman,“Main-MemoryScanSharingforMulti-CoreCPUs,”Proc.VLDBEndowment,vol.1,pp.610-621,2008.
[55]W.HanandJ.Lee,“Dependency-AwareReorderingfor
ParallelizingQueryOptimizationinMulti-CoreCPUs,”Proc.ACMSIGMODInt’lConf.ManagementofData(SIGMOD),2009.
[56]S.TatikondaandS.Parthasarathy,“MiningTree-StructuredData
onMulticoreSystems,”Proc.VLDBEndowment,vol.2,pp.694-705,2009.
[57]C.Kim,T.Kaldewey,V.Lee,E.Sedlar,A.Nguyen,N.Satish,J.
Chhugani,A.D.Blas,andP.Dubey,“Sortvs.HashRevisited:FastJoinImplementationonModernMulti-CoreCPUs,”Proc.VLDBEndowment,vol.2,pp.1378-1389,2009.
[58]M.KleanthousandY.Sazeides,“CATCH:AMechanismfor
DynamicallyDetectingCache-Content-DuplicationandItsAppli-cationtoInstructionCaches,”Proc.Conf.Design,Automation,andTestinEurope(DATE),2008.
[59]A.AlameldeenandD.Wood,“AdaptiveCacheCompressionfor
High-PerformanceProcessors,”Proc.31stAnn.Int’lSymp.Com-puterArchitecture(ISCA),2004.
[60]J.ChangandG.Sohi,“CooperativeCachingforChipMulti-processors,”Proc.33rdAnn.Int’lSymp.ComputerArchitecture(ISCA),2006.
[61]C.Kim,D.Burger,andS.Keckler,“AnAdaptive,Non-Uniform
CacheStructureforWire-DelayDominatedon-ChipCaches,”Proc.10thInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),2002.
[62]Z.Chishti,M.Powell,andT.Vijaykumar,“DistanceAssociativity
forHigh-PerformanceEnergy-EfficientNon-UniformCacheAr-chitectures,”Proc.IEEE/ACM36thAnn.Int’lSymp.Microarchitec-ture(MICRO),2003.
[63]R.Manikantan,K.Rajan,andR.Govindarajan,“NUcache:An
EfficientMulticoreCacheOrganizationBasedonNext-UseDistance,”Proc.IEEE17thInt’lSymp.HighPerformanceComputerArchitecture(HPCA),pp.243-253,2011.
[64]N.Lakshminarayana,J.Lee,andH.Kim,“AgeBasedScheduling
forAsymmetricMultiprocessors,”Proc.IEEE/ACMSupercomput-ingConf.,2009.
[65]J.Zhou,J.Cieslewicz,K.Ross,andM.Shah,“ImprovingDatabase
PerformanceonSimultaneousMultithreadingProcessors,”Proc.31stInt’lConf.VeryLargeDataBases(VLDB),2005.
[66]S.Boyd-Wickizer,R.Morris,andM.F.Kaashoek,“Reinventing
SchedulingforMulticoreSystems,”Proc.12thConf.HotTopicsinOperatingSystems(HotOS),2009.
[67]L.Shalev,J.Satran,E.Borovik,andM.Ben-Yehuda,“IsoStack:
HighlyEfficientNetworkProcessingonDedicatedCores,”Proc.USENIXAnn.TechnicalConf.,2010.
[68]M.Awasthi,D.Nellans,K.Sudan,R.Balasubramonian,andA.
Davis,“ManagingDataPlacementinMemorySystemswithMultipleMemoryControllers,”Int’lJ.ParallelProgramming,vol.40,no.1,pp.57-83,2012.
[69]M.Awasthi,D.Nellans,K.Sudan,R.Balasubramonian,andA.
Davis,“HandlingtheProblemsandOpportunitiesPosedbyMultipleOn-ChipMemoryControllers,”Proc.19thInt’lConf.ParallelArchitecturesandCompilationTechniques(PACT),2010.YuHuareceivedtheBEandPhDdegreesincomputersciencefromtheWuhanUniversity,China,in2001and2005,respectively.HeisanassociateprofessorattheHuazhongUniversityofScienceandTechnology,China.Hisresearchinterestsincludecomputerarchitecture,cloudcomputing,andnetworkstorage.Hehasmorethan40paperstohiscreditinmajorjournalsandinternationalconferencesincludingIEEETrans-actionsonComputers,IEEETransactionson
ParallelandDistributedSystems,USENIXATC,INFOCOM,SC,ICDCS,ICPP,andMASCOTS.Hehasbeenontheprogramcommitteesofmultipleinternationalconferences,includingINFOCOMandICPP.HeisaseniormemberoftheIEEE,andaMemberofACMandUSENIX.XueLiureceivedtheBSandMSdegreesinmathematicsandautomaticcontrol,respec-tively,fromTsinghuaUniversity,China,andthePhDdegreeincomputersciencefromtheUniversityofIllinoisatUrbana-Champaignin2006.HeisanassociateprofessorintheSchoolofComputerScienceatMcGillUniversity.Hisresearchinterestsincludecomputernetworksandcommunications,smartgrid,real-timeandembeddedsystems,cyber-physicalsystems,
datacenters,andsoftwarereliability.HeservesasaneditorforIEEETransactionsonVehicularTechnology,anassociateeditorfortheIEEETransactionsonParallelandDistributedSystems,andaneditorfortheIEEECommunicationsSurveysandTutorials.HisworkhasreceivedtheYear2008BestPaperAwardfromIEEETransactionsonIndustrialInformatics,andtheFirstPlaceBestPaperAwardoftheACMConferenceonWirelessNetworkSecurity(WiSec2011).HeisamemberoftheIEEEandtheACM.
DanFengreceivedtheBE,ME,andPhDdegreesincomputerscienceandtechnologyfromtheHuazhongUniversityofScienceandTechnology(HUST),China,in1991,1994,and1997,respectively.SheisaprofessorandvicedeanoftheSchoolofComputerScienceandTechnology,HUST.Herresearchinterestsincludecomputerarchitecture,massivestoragesystems,andparallelfilesystems.Shehasmorethan100publicationstohercreditin
journalsandinternationalconferences,includingIEEETransactionsParallelandDistributedSystems,JCST,USENIXATC,FAST,ICDCS,HPDC,SC,ICS,andICPP.Sheservesastheprogramcommitteememberofmultipleinternationalconferences,includingSCandMSST.SheisamemberoftheIEEE.
.Formoreinformationonthisoranyothercomputingtopic,pleasevisitourDigitalLibraryatwww.computer.org/publications/dlib.
因篇幅问题不能全部显示,请点此查看更多更全内容