您的当前位置:首页Data Similarity-Aware Computation

Data Similarity-Aware Computation

2024-07-14 来源:乌哈旅游
IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY20143

DataSimilarity-AwareComputation

InfrastructurefortheCloud

YuHua,SeniorMember,IEEE,XueLiu,Member,IEEE,andDanFeng,Member,IEEE

Abstract—Thecloudisemergingforscalableandefficientcloudservices.Tomeettheneedsofhandlingmassivedataand

decreasingdatamigration,thecomputationinfrastructurerequiresefficientdataplacementandpropermanagementforcacheddata.Inthispaper,weproposeanefficientandcost-effectivemultilevelcachingscheme,calledMERCURY,ascomputationinfrastructureofthecloud.TheideabehindMERCURYistoexploreandexploitdatasimilarityandsupportefficientdataplacement.Toaccuratelyandefficientlycapturethedatasimilarity,weleveragealow-complexitylocality-sensitivehashing(LSH).Inourdesign,inadditiontotheproblemofspaceinefficiency,weidentifythataconventionalLSHschemealsosuffersfromtheproblemofhomogeneousdataplacement.Toaddressthesetwoproblems,wedesignanovelmulticore-enabledlocality-sensitivehashing(MC-LSH)thataccuratelycapturesthedifferentiatedsimilarityacrossdata.Thesimilarity-awareMERCURY,hence,partitionsdataintotheL1cache,L2cache,andmainmemorybasedontheirdistinctlocalities,whichhelpoptimizecacheutilizationandminimizethepollutioninthelast-levelcache.Besidesextensiveevaluationthroughsimulations,wealsoimplementedMERCURYinasystem.Experimentalresultsbasedonreal-worldapplicationsanddatasetsdemonstratetheefficiencyandefficacyofourproposedschemes.IndexTerms—Cloudcomputing,multicoreprocessor,cachemanagement,datasimilarity

Ç

1

INTRODUCTION

E

W

areenteringtheeraofthecloudthatcontainsmassiveandheterogeneousdata.ThedatasetshavethesalientfeatureofavolumeofPetabytesorExabytesanddatastreamswithaspeedofGigabitspersecond.Thesedatasetsoftenhavetobeprocessedandanalyzedinatimelyfashion.AccordingtoarecentInternationalDataCorpora-tion(IDC)study,theamountofinformationcreatedandreplicatedismorethan1.8Zettabytes(1.8trillionGigabytes)in2011[1].From1700responsestoSciencepoll[2],about20percentrespondentsoftenusemorethan100GBdatasets,over63percenthaveeveraskedcolleaguesfordatasharing.Unfortunately,abouthalfofthosepolledstorethedataonlyintheirownlabsduetonofundingtosupportarchiving.Moreover,fromsmallhand-helddevicestohugedatacenters,wearecollectingandanalyzingever-greateramountsofinformation.Somecommercialcompanies,likeGoogle,Microsoft,Yahoo!,andFacebook,generallyhandleTerabytesandevenPetabytesofdataeveryday[3],[4],[5],.Forthecomputationinfrastructureofthecloud,itisimportantandchallengingtoperformefficientprocessingandanalysisuponthesedata.

Thecomputationinfrastructuretypicallyconsistsofmulticoreprocessors.TheincreasingnumberofcoresonachipandthedifferentdegreesofdatasimilarityexhibitedwithintheworkloadspresentthechallengestothedesignofcachehierarchiesinChipmultiprocessors(CMPs).These

.Y.HuaandD.FengarewiththeWuhanNationalLaboratoryforOptoelectronics,SchoolofComputerScienceandTechnology,HuazhongUniversityofScienceandTechnology,Wuhan430074,P.R.China.E-mail:{csyhua,dfeng}@hust.edu.cn.

.X.LiuiswiththeSchoolofComputerScience,McGillUniversity,MontrealH3A0E9,Canada.E-mail:xueliu@cs.mcgill.ca.Manuscriptreceived7Sept.2012;revised10Feb.2013;accepted24Apr.2013;publishedonline3May2013.

Forinformationonobtainingreprintsofthisarticle,pleasesende-mailto:tc@computer.org,andreferenceIEEECSLogNumberTCSI-2012-09-0642.

0018-9340/14/$31.00ß2014IEEE

includetheorganizationandthepoliciesassociatedwiththecachehierarchytomeettheneedsofsystemperfor-manceimprovementsandscalability.Cacheorganizationpresentsmultiplelevelsinthecachehierarchyaswellasthesize,associativity,latencyandbandwidthateachlevel.Suitablepolicieshelpminimizethelatencytofrequentlyaccesseddata[6],[7],[8],[9].Moreover,CMPsareprevalentthesedays.VendorsalreadyshipCMPswithfourtotwelvecoresandhavetheroadmapstoreleasehundredsofcorestothemarket.Forexample,inthecommercialmarkets,thereareTileraTILE64,AmbricAm2045andNvidiaGeForceGT200.Theyarewidelyusedincloudapplications.Despitethepopularity,itisstilladauntingtasktoaccuratelyandefficientlyperformthemulticorecachingforhighperformancecloudsystems.Thefocusofthispaperistooptimizethedataplacementofthemultilevelcachehierarchy(e.g.,L1,L2cachesandmainmemory)toimprovetheoverallcloudsystemperformance.

Efficientcachehierarchyinthecloudneedstoanswerthequestions,suchas“howtosignificantlyimprovethecacheutilizationandhowtoefficientlysupportthedataplacement?”Theseproblemsaremoredifficultandchallengingtoaddress,especiallyinthecaseoflargecorecount.Specifi-cally,weneedtoaddressthefollowingchallenges.

Challenge1:inconsistencygapbetweenCPUandoperatingsystemcaches.TobridgethespeedgapbetweenCPUandmemory,CPUcaches(e.g.,L1andL2)andoperatingsystem(OS)buffercachearewidelyusedinamultilevelcachehierarchy.SincetheCPUcachesareatthehardwarelevel,whilethebuffercacheisapartoftheOS,thesetwolayersareconventionallydesignedindependentlywithouttheawarenessofeachother.Thispossiblyworksforsmall-scalesystems.However,withtheincrementsofmulticoreamountsandincreasinglylargecapacityofmainmemory,severeperformancedegradationmayoccuroncetheinconsistencygapexists.Thesetwolayershence

PublishedbytheIEEEComputerSociety

4IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

becomeinefficienttoworkcooperatively.Moreover,byleveragingasharedcache,athread,whichcooperativelyworkswithmultiplecorunningthreads,caninfluenceeachother.Thisgenerallyleadstosevereperformancedegra-dation.Inthenearfuture,acachewillbesharedbymanycores,andthegapmaydegradetheperformanceevenmoreseriously[10],[11].

Challenge2:performancebottleneckshiftinhighperformancecloudsystems.Multicore-basedhardwareadvancementsbringnewchallengestothedesignandtheimplementa-tionofhigh-performancecloudsystems[12].ThisisbecausetheperformancebottleneckhasbeenshiftedfromslowI/Oaccessspeedstohighmemoryaccesslatency.Theperformancebottleneckofacceleratingtheexecutioniscorrelatedwiththeplacementproblemofcacheddata.Theoptimizationofcacheddataplacementhencebecomesimportanttoimprovetheoverallcloudsystemperfor-mance.Unfortunately,theexistingpoliciesinthemulticoreprocessorsbecomeneitherefficientnorscalabletoaddressthedataplacementproblem.Toefficientlyaddressthisproblem,weneedtocarefullyexploreandexploitthedatasimilaritythatgenerallyhidesbehindaccessbehaviors.Wealsoneedtooptimizethecapacityutilizationofaprivatecache,whilealleviatinguncontrolledinterferenceinasharedcache.

Challenge3:exacerbationoftheLLCpollution.Thelast-levelcache(LLC)[6]isdynamicallysharedamongthecoreswhileeachcorehasitslowestlevelofthecachehierarchy.Cachepollutionreferstothereplacementofacacheelementbyalessusefulone.Itoccurswhenanon-reusablecachelineisinstalledintoacacheset.Theinstalledlinedisplacesareusablecacheline.ToalleviatetheLLCpollution,conventionalapproacheshavethepremisethatrecentorderingservesasthegoodpredictionforsubsequentbehaviorsofcacheaccesses[10],[13].Inpractice,althoughleveragingtheaccesspatternshelpspredictfutureaccesses,thecacheshavetoinstallallcachelinesthatareaccessed.Sinceperformingtheidentificationontheaccesspatternsincursheavytemporalandspatialoverheads,theexistingapproachesgenerallydemonstrateunsatisfactoryperfor-mance.LonglatencyandinformationstalenessfurtherexacerbatetheLLCpollution.Whatweneedisanewschemethatsimplifiestheidentificationofaccesslocalitywithoutthelossofaccuracy.

OurproposedMERCURYalleviatesthelimitationsinthehardwaresolutionsandtheOS-basedschemes.Therationalecomesfromtheobservationthatperformingthestatemaintenanceandreferencepatternanalysisatpagegranularitygenerallyincurslessoverheadthanatblock[6],[10],[13].Moreover,learningdynamicreferencepatternsatpagegranularityrequireslessstateandstoragespacecomparedwiththealreadystudiedblock-grainpolicies.Ourresearchworkhenceisrelatedwithtwoareas:systemarchitectureanddata-intensivecloud.Thetwoareasaretraditionallydistinct,butthegaponcommonsystemconcernsbetweenthemhasbeennarrowedrecently.Thesimilarity-awareMERCURYmeetstheneedsofsuitabledataplacementinthemultilevelcachehierarchy.WeimplementMERCURYandmanagethesimilarityatagranularityofpagesbyleveragingtheoperatingsystemmechanisms.MERCURYiscompatiblewiththeexistingcloudcomputingsystemsandcanfurtherimproveuponthembyprovidingascalableandefficientcachingscheme.

MERCURYplaysasignificantandfruitfulroleinmanagingthemultilevelcachehierarchy.Specifically,wemakethefollowingcontributions.

First,(forChallenge1),tonarrowtheinconsistencygapandquantifythedatacorrelation,MERCURYemploysmultitype,ratherthanconventionalhomogeneous,mem-bershipmanagement.Here,themembershiprefersthatanitembelongstoagivendataset.Thedatainthesimilarity-awaremulticorecachesarejudiciouslyclassifiedintothreetypes,i.e.,Family,Friend,andForeigner,torespectivelyrepresentfrequentlyaccessedandcorrelated,frequentlyaccessedbutnotcorrelated,andinfrequentlyaccessedmemberships.Toguaranteethedataconsistencyandintegrity,wefurtherquantifythesemembershipsusinganewcodingtechnique.Second,(forChallenges2and3),toaddresstheperformancebottleneckandalleviatetheLLCpollution,MERCURYexploresandexploitstheaccesslocalitybyimprovingamulticore-enabledlocality-sensitivehashing(MC-LSH).TheMC-LSHusesaself-containedandspace-efficientsignaturevector,ratherthanmanyhashtablesinastandardlocality-sensitivehashing(LSH),toaccomplishthesignificantspacesavingsandmeanwhileaccuratelymea-surethedatasimilarity.SinceMERCURYminimizescacheconflictsandreducestheamountsofthemigrateddata,itsignificantlyreducesthelow-speedmemoryaccesses.MERCURYcanaccuratelyidentifythedatasimilarityandmitigatethestalenessofcacheddatatomeettheneedsofhigh-performancecloudsystems.

Third,wehaveimplementedthecomponentsandthefunctionalitiesofMERCURYinasoftwarelayer,whichiscompliantwiththeexistinghardwaredevices.Tofurtherexamineandevaluatetheefficacyandefficiencyoftheproposedscheme,wenotonlyexamineMERCURYinamulticoresimulation[7],[14],[15],butalsoimplementitinaproductionsystembypatchingPostgreSQL[16].Theextensiveexperimentsusereal-worldtracesanddatasets,andexaminetheperformanceinmultipleevaluationmetrics.Theremainderofthispaperisorganizedasfollows:Section2presentsthedatasetsanalysisandproblemstatement.Section3describestheproposedMERCURYarchitectureandcachingschemes.Section4showsthecacheddatamanagementschemesinthemultilevelhierarchy.Sections5and6,respectively,demonstratetheperformanceevaluationresultsinsimulationsandimplementations.WepresenttherelatedworkinSection7.Finally,weconcludeourpaperinSection8withsummariesoffindings.

2PROBLEMSTATEMENT

Inthissection,wefirststudyworkloadcharacteristicstoshowtheexistenceofdatasimilarityanddemonstrateitsperformanceimpactsoncachingschemes.Wealsopresenttheproblemstatementandbasicideasofourwork.

2.1AnalysisofReal-WorldTraces

Itiswellrecognizedthatthepropertyofdatasimilarityishelpfultoperformanefficientandscalablecaching[7],[11],[17],[18],[19],[20],[21].MainbenefitsincludethroughputimprovementsandthereductionoftheLLCcachemissrates,querylatency,anddatamigrationoverheads.Hence,themotivationofMERCURYdesigncomesfromthe

HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD5

Fig.1.Averagelocalityratiosfromreal-worlddatasets.

observationsofdatasimilaritywidelyexistinginthereal-worldapplications.Furthermore,wepresentthedefinitionofdatasimilarity.Fortwodatawithpointrepresentationsasaandb,weassumethattheyhaved-dimensional

attributesthatarerepresentedasvectorsa~dand~bd.Ifthe

geometricdistancebetweenvectorsa~dand~bdissmallerthan

apredefinedthreshold,theyaresimilar.Thedatasimilarityoftenhidesbehindthelocalityofaccesspatterns[8].

Westudytypicallarge-scaleapplications[22],[23],[24]andthemainbenchmarksfromtheSPEC2000evaluation[25],i.e.,175.vprand300.twolf.Thepropertiesofusedtracesanddatasetsarelisted:

1.

TheCoverTypedataset[22]contains581,012datapoints,eachofwhichhas54Dattributes.

2.TheEECSNFSserveratHarvard[23]collectedI/O

accesses.Thisdatasetcontainsconcurrentrequestswithatotalof4.4millionoperations.

3.TheHPfilesystemprovidesa10-day500-GBtrace

[24]thatrecordstheaccessesfrom236users.

4.The175.vprand300.twolfbenchmarksshowthe

CPUperformanceintheSPEC2000evaluation[25].175.vprleveragescombinatorialoptimizationtech-niquetoautomaticallysynthesizethemappedcircuits.The300.twolfmakesuseofTimberWolfSCplacementandglobalroutingpackage.

Toobtainexplicitdemonstration,weintensifytheabovetracesandbenchmarksintolargerscalesbyacombinationofspatialscale-upandtemporalscale-up.Specifically,thescale-upmethodneedstofirstdecomposeatraceintosubtraces,wherethetimingrelationshipsarepreservedtofaithfullymaintainthesemanticdependencesamongtracerecords.Thesesubtracesarereplayedconcurrentlybysettingthesamestarttime.Notethat,thecombinedtracemaintainsthesamehistogramofsystemcallsastheoriginaltrace,butpresentsaheavierworkload(higherintensity).Asaresult,datacanbebothspatiallyandtemporallyscaledupbydifferentfactors,dependinguponthenumberofsubtracesreplayedsimultaneously.Weintensifyexperi-mentaldatatobescaledupto2,000millionaccesses.Fig.2.Problemdescription.

Weuselocalityratioasameasuretorepresentthelocalityintheaccesspattern.Fig.1showstheresultsoflocalityratiothatisthepercentageofthetimesaccessingdatawithindefinedtimeintervaltothoseintheentiredataset.Thetimeintervalcomesfromtheusedtraces,inwhichallaccessesarelistedintheorderoftime.Forinstance,adatasetcontainsa20-hourtracerecordandweselecta25percentinterval,i.e.,5-houraccessrecord.Forafile,ifithas8accesseswithinarandomlyselected25percenttimeinterval,andtheaccessedtimesduringtheentirerunningtraceisalso8(i.e.,allaccessestothisfileoccurwithinthis25percentinterval),thelocalityratiobecomes8=8¼100%.Weobservethatthereexistsastrongdataaccesslocalitywithincertainnumberofinstances.Theobservationsalsoconformtotheconclusionsin[7].Accordingtoourexperimentalresultsandobservations,similardatagenerallydemonstratethelocalityofaccesspatterns.Iftheyareplacedtogether,wecanimprovecacheutilizationanddecreasethecomplexityandexecutioncostsofdataaccessoperations.

2.2BasicIdea

Thehardwaredesignofcloudcomputationinfrastructurestillworksforscenariostheyaredesignedfor,butthelackofflexibilitycanbeanunavoidableissueandinherentweakness,particularlyformulticoreormany-coreproces-sorswithanincreasinglylargenumberofcores.Incontrastcacheoptimizationandcacheresourcemanagementatdifferentlevelsofsoftware,suchasoperatingsystems,compilers,andapplicationprogramshaveshowntheireffectivenesstoaddressthelimitationsofhardwaresolu-tions.Withasoftwareapproach,long-termmemoryaccesspatternsofmostcloudapplicationscanbeanalyzedorpredicted,cachemanagementandoptimizationdecisionscanbemademoreeffectively.TherehavebeenseveralsuccessfulexamplesonuniprocessorswithsimpleLRUcachereplacementpolicies[18],[26],[27],[28].However,usingasoftwareschemetomanagecacheresourcesinmulticoreprocessorsismuchmorechallengingthaninuniprocessors,sincehardwareresourcesarealmostnotsharedandcoordinatedformultithreads.ResearchershaveevaluatedOS-basedcachepartitioningmethodsinmulti-coreprocessorswithoutanyadditionalhardwaresupport[29].TheOS-basedmethodofferstheflexibilityinimple-mentingvariousresourceallocationpolicies.However,sincehardwarecachesarenotinthescopeofOSmanage-ment,theOS-basedmethodscaninevitablycausenontrivialsoftwareoverhead,andarenotreadytobeusedascomputationinfrastructureofthecloud.

Toofferefficientcomputationinfrastructureforthecloud,weinvestigatetheproblemofthecacheddataplacementinthemultilevelcachehierarchywhenexecutingmultiple

6IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

parallelinstancesasshowninFig.2.Wetermthisproblemas“cache-member,”whichdeterminesthedatamembershipsineachcachebasedonthegivenconstraints.Theconstraintsincludemigrationcostsanddataaccesslatency.

Fig.2ashowsthecache-memberproblem.Specifically,weneedtofirstidentifyandaggregatesimilardataintothesameoradjacentprivateL1caches,andthenallocatethedataaccessedbymorethanonecoreintoasharedL2cache.We,hence,canmanagethecacheddatainbothL1andL2caches.Moreover,anidealmulticorearchitectureisscalableandflexibletoallowdynamicandadaptivemanagementonthecacheddata.Thepremiseistoaccuratelycapturethesimilardata[7],[17],whichunfortu-natelyisnontrivialduetoexpensiveoperationcostsofcomparingarrivingdatawithalltheexistingcachelines.Weidentifytheproblemofhomogeneousdataplacementthatoverlooksthedistinctpropertiesandmultitypemem-bershipsofcacheddata.Toalleviatethehomogeneousdatamanagement,weleverageadifferentiatedplacementpolicy,inwhichthecachemembershipsareclassifiedintothreetypesasshowninFig.2b.WeplacefrequentlyaccessedandcorrelateddataintoL1cache,called“in-cacheFamily,”frequentlyaccessedbutlooselycorrelateddataintoL2cache,called“shared-cacheFriend”andinfrequentlyaccesseddataintomainmemory,called“in-memoryForeigner.”Inthisway,wecandifferentiatethestrengthofaccesslocalitytofacilitatetheefficientplacementofcacheddata.

Inpractice,capturingdatasimilarityistimeconsumingandcomputation-intensiveworkduetohighdimensionsandheterogeneoustypes.Hence,toaccomplishasuitabletradeoffbetweensimilarityaccuracyandoperationcom-plexity,weproposetouseahash-basedapproach,forexample,theLSH[30],duetoitslocality-awarepropertyandeaseofuse.TheLSHcanidentifyandplacesimilardatatogetherwithlowcomplexity.Therationaleisthatsimilardatacontainstronglocalitytomatchaccesspatternsofmultiplethreads.TheLSH-basedscheme,thuscanimprovesystemperformance.Unfortunately,itiswellrecognizedthatastandardLSHsuffersfromheavyspaceoverheadduetotheuseoftoomanyhashtables[31],[32],[33].Moreover,dataplacementpolicydependsuponbothaccessfrequencyandcorrelation,whichiscurrentlydifficulttoberepre-sentedquantitativelyandmeasuredaccurately.

ThebasicideabehindMERCURYistoleveragetheMC-LSHtoidentifysimilardataandcarryoutdifferentiateddataplacement.MERCURYrepresentsthestrengthofdatasimilarityasFamily,Friend,andForeigner,respectively,asshowninFig.2b.Specifically,theprivateL1cachescontainFamilymembers,whicharetightlycorrelatedandfrequentlyuseddatatofacilitatethefastaccessandmaintaintheaccesslocalityineachcache.Furthermore,asharedL2cachecontainsFriendmembers,whichinfactconsistoftwoparts.OneisthedatafrequentlyaccessedbymultiplecoresandtheotheristhedataevictedfromcorrelatedL1cachesduetospacelimitationorstaleness.Finally,themainmemorycontainsForeignermembersthatarenotincludedintheL1orL2caches.Differentiateddataplacementcomprehensivelyconsidersboththestrengthofdatasimilarityandaccessfrequency,whileallowingtheflexibleadjustmentstosupportdynamicoperations(e.g.,insertion/deletion).The

Fig.3.MERCURYmulticorecachingarchitecture.

similardatathatareplacedcloselycanalsosignificantlyreducethemigrationcosts.MERCURY,hence,offersscalable,flexible,andload-balancedcachingschemesinamultilevelcachehierarchy.

MERCURYisimplementedinahybridschemetoaddressthelimitationsofbothhardwaresolutionsandOS-basedmethods.Specifically,ourmulticoreshared-cache-managementframeworkconsistsoftwolowcostandeffectivecomponents:alightweightmechanismforallocatingcacheresourcesandprovidingcacheusageinformation;andOS-basedresourceallocationpoliciesfordynamiccacheallocation.Withasimpleandlow-overheadcomponent,weenabledirectOScontroloversharedcaches,andthesoftwaresystemoverheadisminimized.WithanOS-basedmanagement,weareabletodesignandimple-mentmultiplepoliciestodealwithcomplicated,difficultcachingscenariosinmulticoresystems.

3MERCURYARCHITECTURE

MERCURYusestheMC-LSHtoidentifysimilardataandleveragesanLRUreplacementineachcachetoupdatestaledata.Fig.3showstheMERCURYarchitectureinthemultilevelhierarchy.WeassumethateachcorehasoneprivateL1cacheandallprocessorcoresshareanL2cache.TheMERCURYschemeistightlyassociatedwithtwoparts.Oneistheprocessorarchitectureandtheotheristheoperatingsystem.Furthermore,toexplicitlyrepresentthedifferentiatedmembershipsidentifiedbytheMC-LSH,weusedifferentflagstolabeleachcachelineandobtainholisticoptimizationinthemultilevelcachehierarchy.

3.1CachesinaMulticoreProcessor

ThecachingschemesinamulticoreprocessorincludeL1andL2cachemanagement,andvirtual-physicaladdresstranslation.

L1cachemanagement.Eachcorehasoneassociatedcachethatcontainsfrequentlyvisiteddatatoincreasetheaccessspeedanddecreasetherequiredbandwidth.Weneedtoupdatethestaleandinfrequentlyaccesseddata.

HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD7

L2cachemanagement.TopartitionthesharedL2cache,weleveragethewell-knownpagecolor[34]duetoitssimplicityandflexibility.PagecoloringisanextensivelyusedOStechniqueforimprovingcacheandmemoryperformance.Aphysicaladdresscontainsseveralcommonbitsbetweenthecacheindexandthephysicalpagenumber,whichisindicatedasapagecolor.Onecandivideaphysicallyaddressedcacheintononintersectingregions(cachecolor)bypagecolor,andthepageswiththesamepagecoloraremappedtothesamecachecolor.AsharedcacheisdividedintoNcolors,whereNcomesfromthearchitecturalsettings.ThecachelinesarerepresentedbyusingoneofNcachecolors.Weassignthecachecolorsofthevirtualpagesbyusingthevirtual-to-physicalpagemapping.

Addresstranslation.Theaddresstranslationcantranslatethevirtualaddressintothephysicaladdressbyreadingpagetable.ThecachecoloristightlyassociatedwiththenumberofpagecolorsintheL2cache.AvirtualTag(v-Tag)helpstoidentifysimilardatabyusingtheresultsfromtheMC-LSHcomputation.

3.2OperatingSystem

TheoperatingsystemfunctionalitiessupporttheMC-LSHcomputationandupdatethelocality-awaredata.

MC-LSH.AstandardLSHhelpsidentifysimilardataandunfortunatelyincursheavyspaceoverhead,i.e.,consumingtoomanyhashtables,toidentifythelocality-awaredata.Thespaceinefficiencyoftenresultsintheoverflowingfromalimited-sizecache.MERCURYproposestouseanMC-LSHtoofferefficiencyandscalabilitytothemulticorecaching.Specifically,theMC-LSHusesaspace-efficientsignaturevectortomaintainthecacheddataandutilizesacodingtechniquetosupportadifferentiatedplacementpolicyforthemultitypedata.WewilldescribethedesigndetailsoftheMC-LSHinSection4.

Updatinglocality-awaredata.Toexecutefastandaccurateupdates,akeyfunctioninMERCURYistoidentifysimilardatawithlowoperationcomplexity.Inpractice,manyhigh-performancecomputingapplicationsdemonstratetheiden-ticaldataatthesamevirtualaddress,butdifferentphysicaladdresses[7].Allrelevantvirtualaddressesthusneedtobemappedtothesamecacheset.WemakeuseoftheMC-LSHtoidentifysimilardataandavoidbrute-forcecheckingbetweenarrivingdataandallvalidcachelines.Thesimilardataarethenplacedinthesameorclose-bycachestofacilitatemulticorecomputationandefficientlyupdatedata.Sincethecacheddataarelocality-aware,MERCURY,hence,decreasesmigrationcostsandminimizescacheconflicts.Tosatisfyqueryrequestsandprovideflexibleuse,wedesignaninterfacebetweenhigh-performanceapplicationsandoperatingsystemasshowninFig.3.Itsmainfunctionistowraphigh-leveloperationrequeststolow-levelsystemcallswiththeaidofthepagecoloringtechnique[34].Pagecolormanagesthebitsbetweenthecacheindexandthephysicalpagenumberinthephysicalmemoryaddress.Specifically,theapplicationsneedtospecifytherequiredspaceintheirrequests.Therequestshelpdecidehowtopartitionavailablecachespaceamongqueryrequests.Queryexecutionprocessesindicatepartitioningresultsbyupdatingapagecolortable.Theoperatingsystemthenreadsthepagecolortabletoknowthecachepartitionsamongthequeryrequests.

Althoughtheoperatingsystemcannotdirectlyallocateon-chipcachespace,itcanmakeuseofvirtual-physicaladdressmappingtocontrolhowtoallocatepagesinthemainmemory.Thememorypagesofthesamecolorcanbemappedtothesamecacheregion.Toefficientlypartitionthecachespace,weallocatedifferentpagecolorstomemorythreads.MERCURYcanhenceleveragethepagecoloringtechniquetocompletecachepartitioningamongdifferentprocessesandsupportthequeries.

4CACHEDDATAMANAGEMENTINMERCURY

Tocapturethedatasimilarity,weproposeanMC-LSHdesigninMERCURY.Aspace-efficientsignaturevectorandasimplecodingtechniquehelpmaintainandrepresentthemultitypememberships.WefinallydescribetheschemeofupdatingdatainMERCURY.

4.1TheMC-LSHScheme

TheMC-LSHisamulticore-enabledschemethatconsistsoftheLSH-basedcomputation,asignaturevectorstructureandthemultitypemembershipcodingtechnique.Itoffersadeterministicmembershipforeachdataitem.Comparedwiththeconventionalclassificationschemesforexactresults,theMC-LSHprovidesanapproximateandfastschemetoobtainsignificanttimeandspace-savings.TheMC-LSHemploystheLSHfunctionstoidentifysimilardatabasedontheaccesspatterns.Toaddresstheproblemofspaceinefficiency(i.e.,toomanyhashtables)inthestandardLSH,weemployasignaturevectorstructure.Furthermore,toofferdifferentiateddataplacement,weuseamultitypemembershipcodingtechnique.

LimitationsofthestandardLSH.AnLSH[30]capturessimilardatabyallowingthemtobeplacedintothesamehashbucketswithahighprobability.

Definition1.GivenadistancefunctionkÃk,adatadomainS,andsomeuniverseU,anLSHfunctionfamily,i.e.,IH¼fh:S!UgiscalledðR;cR;P1;P2Þ-sensitive,iffor8p;q2S:.Ifkp;qk RthenPrIH½hðpÞ¼hðqÞ󰀄!P1,.Ifkp;qk>cRthenPrIH½hðpÞ¼hðqÞ󰀄 P2,wherec>1andP1>P2.

InIH,ha;bðvÞ¼baÁvþb

!c.aisad-dimensionalrandomvectorwithchosenentriesfollowingans-stabledistributionandbisarealnumberchosenuniformlyfromtherange½0;!Þ,where!isaconstant.ByusingtheLSHfunctions,similardatahaveahigherprobabilityofcollidingthanthedatathatarefarapart[35].

AlthoughtheLSHhasbeenrecentlyusedinmanyapplications,itisdifficulttobeusedinthemulticoresystemsduetoheavyspaceoverheadandhomogeneousdataplacement.Theselimitationshaveseverelyhamperedtheuseofthemulticorebenefitsforhigh-performancesystems.Unliketheexistingwork,MERCURYenablestheLSHtobespace-efficientbyleveragingsignaturevectors.Space-efficientsignaturevector.TheMC-LSHleveragesspace-efficientsignaturevectorstostoreandmaintainthelocalityofaccesspatterns.Specifically,asignaturevectorisanm-bitarraywhereeachbitisinitiallysetto0.TherearetotallyLLSHfunctions,gið1 i LÞ,tohashadatapoint

8IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

Fig.4.Signaturevectorformaintainingpage-leveldatasimilarity.

intobits,ratherthanitsoriginalbucketsinhashtables,tosignificantlydecreasespaceoverhead.Adatapointasaninputofeachhashfunctiongiismappedintoabitthatisthussetto1possiblymorethanonceandonlythefirstsettingtakeseffect.

AsignaturevectorisabletomaintainthedatasimilarityasshowninFig.4.Acentralizedbitisthebitthatreceivesmorehitsthanitsleftandrightneighbors.Thehitnumbersasshowninthisfigurearealsomuchlargerthanapredefinedthresholdvalue.ThecentralizedbitsbecomethecentersofcorrelateddataandarefurtherselectedtobemappedandstoredintheL1caches.Whenhashingdataintothesignaturevector,wecountthehitnumbersofbitsandcarefullyselectthecentralizedbits.Moreover,thethresholddemonstratestheclusteringdegreeofdatadistribution,thusdependingupontheaccesspatternsofthereal-worldapplications.Afterselectingthecentralizedbits,wecanconstructamappingbetweenthecentralizedbitsandL1cachestofacilitatethedataplacement.ItisworthnotingthatthenumberofcentralizedbitsisunnecessarilyequaltothatoftheL1caches.IfthenumberofcentralizedbitsislargerthanthatofL1caches,anL1cachemaycontainthedatafrommorethanoneadjacentcentralizedbits.

TheMC-LSHcomputationcanguaranteesimilardatatobehashedintoonebitwithveryhighprobabilitythathoweverisnot100percent,meaningthatsimilardataarestillpossibletobeplacedintoadjacentbits.Falsenegative,hence,occurswhenthehitbitis0andoneofitsneighborsis1.Toavoidpotentialfalsenegatives,asimplesolutionistocheckextraneighboringbitsbesidesthehitone.Althoughextracheckingonneighboringbitspossiblyincursfalsepositives,inpractice,amissfromthefalsenegativegenerallyincursthelargerpenaltythanthefalsepositive.Areasonablesizeofcheckingextrabitsisacceptabletoobtainasuitabletradeoffbetweenfalsenegativesandfalsepositives.MERCURYprobesmorethanonehitbit,i.e.,checkingleftandrightneighbors,besidesthehashedbit.Notethat,theextracheckingoccursonlywhenthehitbitis“0”.OurresultconformstotheconclusionofsamplingdatainthemultiprobeLSH[31].

Toefficientlyupdatethesignaturevectors,MERCURYoffersscalableandflexibleschemesbasedonthechar-acteristicsofthereal-worldworkloads.Specifically,iftheworkloadsexhibitanoperation-intensive(e.g.,write-intensive)characteristic,wecancarryouttheoperationsonthesignaturevectorsandallowthe(re)-initializationin

Fig.5.Differentiatedmembershipcodingtechnique.

theidletime.Moreover,iftheworkloadsbecomeuniform,MERCURYmakesuseof4-bitcounters,ratherthanbits,asthesummaryofBloomfilters[36].Eachindexedcounterincreaseswhenaddinganitemanddecreaseswhenremovinganitem.Inpractice,a4-bitcountercansatisfytherequirementsformostapplications.

Multitypemembershipcoding.ThemembershipsintheMC-LSHincludeFamily,Friend,andForeigner,whichrespectivelyrepresentdifferentsimilaritiesamongcacheddata.TheMC-LSHidentifiesdatamembershipsandplacesdataintoL1cache,L2cache,ormainmemory,respectively.OnekeyissueinthedataplacementishowtodeterminewhetherthehitsinmultipleLSHvectorsindicateasinglecache.Toaddressthisproblem,weuseacodingtechniquetoguaranteemembershipconsistencyandintegrity.

WeuseanexampletoillustratethedifferentiatedmembershipcodingasshowninFig.5.Givenanitem,wefirstcomputeitshashedvaluesbyusinghashfunctionsinthesignaturevectortodeterminewhetheritiscorrelatedtooneoftheexistingL1caches.Basedontheconclusionin[31],ifthehitbitisanyofthecentralizedbit,itsleftandrightneighbors,theitemisconsideredtobecorrelatedwiththecorrespondingcacheandfurtherobtainsanM¼1indicator(i.e.,inthememory),togetherwithFF(Family/Friend)code(e.g.,thelocation)ofthatcentralizedbitinanLSHarray.

Weconstructamappingtablebetweenarrivingdataandmulticorecachestofacilitatedifferentiateddataplacement.IfallMindicatorsfromLLSHarraysshow1foranitembyusingabit-basedANDoperation,wedeterminethatthisitemiscorrelatedwithmulticorecachestoexecutefurthercheckingonthedatamappingtable.Otherwise,theitemisnotconsideredtobecorrelatedanddirectlyinsertedintothemainmemory.ThecheckingonthetableallowstodeterminewhethertheitemisaFamilyorFriend.Sinceperformingdirectsearchingontheentiretableconsumestoomuchtime,wefirsthashtheconcatenatedcodeofthatitemintoastandardBloomfilter[37]thathasalreadystoredthecodeindicators.Ifahitoccurs,wecontinuetoperformthecheckingonthemappingtable.Otherwise,theitemisconsideredasaFriend,andtheninsertedintothesharedL2cache.Furthermore,sincethetablecontainstoomanycodeindicators,thelinearlybrute-forcesearchingwillleadtounacceptablecosts,possiblybecomingtheperformancebottleneck.Toaddressthisissue,wemakeuseofahashtabletomaintainthesecodeindicatorsanddecreasethesearchinglatency.Whenahitoccursinthemappinghashtable,weinsertthisitemintothecorrespondingL1cache.

HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD9

4.2UpdatingData

InthemultilevelhierarchyofMERCURY,weneedtoupdatecacheddataandtheirmembershipsinthesignaturevector.

Forupdatingcacheddata.Toupdateactualdata,wemakeuseofalabel-basedtechniquetoupdatestaledatainmultilevelcaches.Thereasoncomesfromthefactthatsimilardataarepotentiallyreusedbycorrespondingcachesinthenearfuture.Todecreaserecachingcosts,wetemporarilylabelstaledataforcertaintime.Whenthetimeexpires,weupdatethecachesandreplacetheselabeledstaledata.Moreover,theL1cachesbelongingtomultiplecorespossiblycontaindifferentamountsofsimilardata.PerformingtheloadbalancewithinmultipleL1cachesishenceimportanttoobtainperformanceimprovements.Duetothelimited-sizecapacityineachL1cache,MERCURYtemporarilyplacesexcess,butcorrelateddataintothesharedL2cache.ThesecorrelateddatahavebeeninsertedintocorrespondingcountingBloomfilters[37].InthesharedL2cache,welabelthedatabyusingpagecolorsofthecorrelatedcorestoupdatecaches.OncefreespaceisavailableinanL1cache,MERCURYreloadstheselabeleddataintothecorrespondingL1cache.

TheoperationsofupdatingdataareactuallyamultilevelmigrationprocessfromtheL1cache,thentheL2cache,finallytothemainmemory.Theworkflowstepsaredescribedbelow.

1.

UpdatingcacheinMERCURYneedstoreplacestaledatainbothL1andL2caches,whileguaranteeinghighhitratesandlowmaintenancecosts.MERCURYmakesuseoftheMC-LSHtoidentifysimilardatathatarethenplacedintotheL1caches.

2.TheL1cachesemploythesimpleLRUreplacement

toupdatestaledata.

3.WhenthedataintheL1cachesbecomestale,they

aretransferredintothesharedL2cacheamongmultiplecores.

4.WhenthedataintheL2cachebecomestale,they

movetothemainmemory.

Forupdatingmemberships.Toupdatethedatamember-shipinthesignaturevectors,weleveragecountingBloomfilterstofacilitatethedatadeletionandmaintainthemembershipofthedatathathavebeenidentifiedtobecorrelatedandplacedintothecorrespondingL1caches.ThecountingBloomfiltershelpmaintainthemembershipofcacheddatainaspace-efficientway,carryouttheinitializationoftheL1cachesandkeeptheloadbalanceamongmultipleL1caches.EachcountingBloomfilterisassociatedwithoneL1cache.

WhenanitemisinsertedintotheL1cache,itismeanwhileinsertedintothecountingBloomfilter,inwhichthehitcountersareincreasedby1.SinceeachcountingBloomfilteronlyneedstomaintaintheitemsexistinginthecorrespondingL1cacheandthenumberofstoreddataisrelativelysmall,thusnotrequiringtoomuchstoragespace.Moreover,whendeletinganitem,thehitcountersaredecreasedby1.Ifallcountersbecome0,meaningthattherearenocacheddata,weinitializetheassociatedcachesbysamplingdatatodeterminethelocality-awarerepresenta-tioninthesignaturevector.Notethat,thesizeofasignatureTABLE1

SimulationParameters

vectordependsonnotonlytheamountsofdatatobeinserted,butalsotheirdistribution.We,hence,leveragewell-recognizedsamplingmethods[31],[35],[38],[39]toobtainthesuitablesize.

5PERFORMANCEEVALUATION

ThissectionpresentstheperformanceevaluationofourproposedschemebydescribingsimulationframeworkandexaminingthescalabilityofMERCURYcomparedwiththestate-of-the-artwork.

5.1ExperimentConfiguration

WeusesimulationstudyprimarilyfortheevaluationofMERCURY’sscalability.OursimulationisbasedonPolyScalarthatiswidelyusedinthemulticoresimulation[7],[14],[15].WeaddpagetablesintoPolyScalarforeachprocesstoenhanceitsvirtual-to-physicaladdresstransla-tionfunctionality.WefurtherimprovePolyScalarbyaddingthesimilarity-awarefunctionalitiesthatarede-scribedinSections3and4.ThesizeofeachOSpageis8KB.Sinceourstudyfocusesonthelast-levelcache(L2cache)thathasstronginteractionwiththemainmemory,weextendPolyScalartosimulateDDR2DRAMsystems.Thesimulatedmemorytransactionsarepipelined.MERCURYleveragestheMC-LSHtoidentifysimilardatathatareplacedintoL1andL2caches,respectively,withanLRUreplacementpolicy.Specifically,eachprocessorhasitsownprivateL1cache.AnL2cacheissharedbymultiplecores.WeevaluatethescalabilityofMERCURYbyincreasingthenumberofcores.InthepagecolorpolicyoftheL2cache,eachcorehaseightcolorsandeachcolorhas128cachesets.We,hence,allocate1-MBcachefor4-coresystem,2-MBcachefor8-coresystem,and4-MBcachefor16-coresystem.Table1showstheparametersettingsinthesimulations.

TheusedtracesanddatasetsincludeForestCoverTypedataset[22],EECSNFSserveratHarvard[23],HPfilesystemtrace[24],and175.vprand300.twolfinSPEC2000[25].Moreover,byusingtheproposedsamplingapproach[31],[35],[38],[39]describedinSection4.2,thesuitablesizesofsignaturevectorsare7.6KBin175.vpr,7.9KBin300.twolf,8.3KBinCoverType,8.7KBinEECS,and9.2KBinHP.

10IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

TABLE2

PerformanceEvaluationMetrics

Weusethemultiplemetricstoevaluatetheperfor-mance,includingThroughput,Weightedspeedup,andFairspeedupasshowninTable2.Specifically,theThroughputreferstotheabsoluteIPCnumberstoevaluatethesystemutilization.TheWeightedspeedupisthesumofspeedupsofallprogramsoverabaselineschemetoindicatethedecreaseofexecutiontime.TheFairspeedupistheharmonicmeanofthespeedupsoverabaselineschemetoobtainthebalancebetweenfairnessandperformance.Wealsoexaminetheperformanceintermsofcacheupdatelatency,migrationcost,hitrateandtime,andspaceoverheads.

5.2Results

WecompareMERCURYwithbaselineapproaches,i.e.,privateandsharedcaches,andthestate-of-the-work,PCM[12]andMergeable[7]schemes,whichwereimplementedfortheexperiments.

5.2.1Throughput

Fig.6showsthethroughputresultsfromexecutingthereal-worldapplicationswiththeincreaseofmulticorenumberfrom4to16.Theaveragethroughputson4-coresystemswithprivatecache,sharedcache,PCM,Mergeable,andMERCURYare1.352,1.563,1.815,1.925,and2.162,respectively.For8-coresystems,theaveragethroughputsare2.481,2.572,2.953,3.104,and3.305.For16-coresystems,theyare3.281,3.469,3.957,4.152,and4.452.

WeobservethattwotypicalSPEC2000benchmarksobtainthelargerthroughputsonaverageby15.7percentincreasethanotherapplications.ThemainreasonisthattheSPEC2000benchmarkshavebettersimilarityintheaccesspattern,thusallowingtheLSHtoaccuratelyandefficientlycapturecorrelateddata.Inaddition,MERCURYexecutesconstant-scalehashingcomputationtoquicklyandaccu-ratelyidentifycorrelateddata,thusobtainingthelargerthroughputthanthePCMandMergeableschemes.5.2.2WeightedSpeedup

WetakeintoaccountthechangesoftherelativeIPCthatistheratioofabsoluteIPCtothebaselineasthemetricoftheweightedspeedup.Theweightedspeedupsarenormalized

Fig.6.Throughput(sumofIPCs).Fig.7.Normalizedweightedspeedup.

tothosewiththeprivatecachesasshowninFig.7.Thesharedcacheobtainsbetterperformancethantheprivatecacheduetotheabilitytoadapttothedemandsofcompetingprocesses.Comparedwiththeprivatecache,theimprove-mentsofthesharedcacheare9.87,17.52,and23.67percent,respectively,on4-core,8-core,and16-coresystems.

ThePCM,Mergeable,andMERCURYhavemuchbetterperformancethanthesharedcache.Theaveragenormal-izedweightedspeedupsofthePCMschemeare1.263,1.376,and1.482,respectively,on4-core,8-core,and16-coresystems.Mergeableobtains1.372,1.493,and1.718weightedspeedups.MERCURYobtains1.527,1.634,and1.928weightedspeedups,demonstratingbetterperformance.Withtheincreaseofcores,MERCURYfurtherexhibitsitseffectivenessandscalabilitysinceitleveragessimplehashingtoadapttotheworkloadchanges.

5.2.3FairSpeedup

Fairspeedupcomputestheharmonicmeanofthenormal-izedIPCs,whiletakingintoaccountbothfairnessandperformance.Fig.8showstheresultsofcomparingMERCURYwithbaselineschemesandPCMintermsoffairspeedups.Thefairspeedupsarenormalizedtothosewiththeprivatecache.

ComparedwiththePCMscheme,Mergeable,andMERCURYimprovetheperformanceonthismetric,respectively,by7.16and8.35percent(4-core),8.31and9.52percent(8-core),and8.67and9.96percent(16-core).ThemainreasonisthatMERCURYleveragesthediffer-entiatedplacementpolicythatefficientlyallocatesthedataintothecorrelatedcachesandimprovestheutilizationofthemulticoreprocessorbasedonthemultitypememberships.5.2.4CacheUpdateLatency

Cachemanagementneedstoupdatestaleandinfrequentlyaccesseddatatoguaranteehighhitrates.Weevaluatetheupdateefficiencyofprivateandshared,PCM,Mergeable,andMERCURYintermsofoperationlatency.Fig.9shows

Fig.8.Normalizedfairspeedup.

HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD11

Fig.9.Normalizedupdatelatency.theupdatelatenciesthatarenormalizedtothoseinthePCMscheme.Weobservethattheaveragenormalizedlatenciesofprivateandsharedcachesare1.22and1.14,respectively.MergeablerequiresalittlelargerlatencythanPCMmainlyduetomergedwritebacks.ComparedwiththePCMandMergeableschemes,MERCURYusingthesimplehashcomputationrequiresthesmallesttimethanothersanddecreasestheupdatelatencyonaverageby48.26,46.57,and43.82percent,respectively,on4-core,8-core,and16-coresystems.

5.2.5MigrationCost

Hitmissesorupdatesincachesoftenleadtodatamigrationamongmultiplecaches,whichincursrelativelyhighcostsintermsofdatatransmissionandreplacementinthecachesofothercores.Fig.10showsthepercentageofmigrateddatainPCM,Mergeable,andMERCURY.Mergeableisabletodetectandmergesimilardatatoguaranteethatmanycorrelateddataarestoredinasinglecache,thusproducingthesmallernumberofthemigrateddatathanPCM.

Weobservethattheaveragepercentagesofmigrateddataare13.2and11.9percent,respectively,inprivateandsharedcaches.ComparedwithMergeable,MERCURYcanobtainbetterperformanceinthismetricanddecreasethenumberofmigrateddataonaverageby35.26,32.57,and31.73percenton4-core,8-core,and16-coresystems.Themainreasonsaretwofold.OneisthattheMC-LSHprovideshighaccuracyofidentifyingcorrelateddata,thusreducingthenumberofmigrateddata.TheotheristhatthefastidentificationofsimilardatainMERCURYproduceslowcomputationcomplexity.

5.2.6HitRate

Oneofthekeymetricstoevaluatecacheefficiencyisthehitratethatdefinestheprobabilityofobtainingquerieddatawithinlimitedcachespaceforrequests.Fig.11showsthecachehitrateoftheMERCURYschemecomparedwithprivate,shared,PCMandMergeable.Theaveragehitratesinprivate,shared,MERCURY,Mergeable,andPCMare

Fig.10.Percentageofmigrateddata.Fig.11.Hitratesinmulticorecaches.

respectively65.22,67.15,86.92,77.48,and69.27percentonthe4-coresystem,61.68,63.73,83.87,74.16,and65.12percentonthe8-coresystems,and59.51,61.34,81.73,70.52,and62.35percentonthe16-coresystems.MERCURYhasthebetterperformanceinthismetricthanMergeableandPCMsincetheMC-LSHaccuratelyidentifiescorrelateddatawithinconstant-scaleexecutioncomplexity.Theimprovedaccuracysignificantlydecreasespotentialmigrationcoststhatpossiblyoccurduetohitmisses.Thequickidentifica-tionalsoalleviatestheeffectsofstalenessinthecaches.

5.2.7TimeandSpaceOverheads

TheexecutiontimeinourperformanceevaluationincludestheidentificationandplacementofthecorrelateddataintheL1caches.WeevaluatetheMERCURY,PCM,Mergeable,andstandardLSHschemesintermsoftimeoverheadasshowninFig.12a.ThetimeoverheadisnormalizedtothoseinMergeablescheme.MERCURYmakesuseofhashingcomputationtoidentifycorrelateddata,thusrequiringsmallerexecutiontimethanMergeablethatneedstocarryouttheextraoperationsofmergingcacheblocks.ComparedwithMergeable,MERCURYdecreasestheexecutiontimeby28.73,21.84,and19.56percenton4-core,8-core,and16-coresystems.

MERCURYneedstousethesignaturevectortorevealthesimilarityofcorrelateddataandleveragecountingBloomfilterstomaintainthemembershipsofcacheddata,whileMergeablerequirestemporarystoragespace,forexample,

Fig.12.Normalizedtimeandspaceoverheads.

12IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

Fig.13.L2missratesofTPC-Hworkloadunderthreedatasetsizes.

Fig.14.L2missratesofTPC-Cworkloadunderthreedatasetsizes.

content-addressablememorytofindsimilardata.Fig.12bshowsthecomparisonsamongtheMERCURY,PCM,Mergeable,andstandardLSHschemesintermsofspaceoverhead.WeobservethatcomparedwiththeMergeablescheme,MERCURYobtainssignificantspacesavingsanddecreasesthespaceoverheadby47.35,45.26,and40.82per-cent,respectively,on4-core,8-core,and16-coresystems.ThemainreasonisthatMERCURYleveragesthesimplebit-awaresignatureandspace-efficientBloomfilterstodemon-strateandmaintainthemembershipsofcorrelateddata,thusobtainingspacesavings.

6SYSTEMIMPLEMENTATIONSTUDY

WepresenttheexperimentalresultsofrunningstandardworkloadsprovidedbybothTPC-HandTPC-Cbench-marks[40].Specifically,wesetup100clients(i.e.,themaximumnumberofclientsallowedinPostgreSQL)tosendoutqueriesconcurrently.Foreachclient,queriesarerandomlydrawnfromthepoolofallTPC-Hqueries.Werepeatthesameexperimentsunderthreedifferentdatasetsizes:500MB,1GB,and10GBforTPC-H;and1GB,5GB,and10GBforTPC-C.Duetospacelimitation,theevaluationmainlyshowstheresultsoftheTPC-Hwork-load.Werunallexperimentsinthecloud.EachcloudserverhastwoIntelCore2Quad2.66-GHzCPUs,8-GBmemory,andfour250-GBdisks.Eachprocessorhasfourcores,andeverytwocoressharea4-MBL2cache.TheDBMSusedinourexperimentsisthePostgreSQL8.3.1runningonLinuxkernel2.6.14.1.Wemeasuredtheperformancebythreemetrics,L2cachemissrate,queryexecutiontime(cyclesperinstruction),andpatchingcosts.

Foragivendataset,thehashfunctionscomefromtherandomselectionoftheLSHfunctionfamily.Morehash

functionsprovidehigheraccuracyofidentifyingsimilardata,whichhoweverincurshighercomputationcomplexityandspaceoverhead.Inourexperiments,weselectL¼7hashfunctionsbasedonthepresampleestimation,inwhichwerandomlyextractasubsetfromtheusedtraceandmaketheestimation,whichhasbeensuccessfullyusedinthereal-worldapplications[31],[35],[39],[41].Moreover,forthedatawithdifferenttypesordimensionalities,weusethenormal-izationmethodtocomputedatasimilarityinthesamemetricmeasure.Normalizedvalueisequaltothemeasure:(ActualValue-Minimum)/(Maximum-Minimum).Forinstance,fortheattributeoffilesize,weassumethatthesizerangeisfrom10KBto200KB,(i.e.,range:10-200).Forafilewith120KB,itsnormalizedvalueisð120-10Þ=ð200-10Þ¼0:58.WeevaluatetherealimplementationperformancebycomparingwithMCC-DB[17].MCC-DBexploitsaccessandcachingpatternsfromqueryanalysis.Thereasonformakingthiscomparisonisthreefold.First,bothMCC-DBandMERCURYworkwellaspatchestoPostgreSQL[16]forconcurrentqueries.Second,theessentialpropertybehindtwotechniquesistousecachepartitioninginmulti-coreprocessorstoenhancesystemperformance.Third,MCC-DBhasprovidedstandardexperimentalresultsbyusingTPC-H[40]tofacilitatefaircomparisonswithothermethods.TPC-H[40]benchmarkshavelargevolumesofdatafordecisionsupportsystemswhenexecuting22differenttypesofqueries.WeperformextensiveexperimentsonaphysicaltestbedbasedonthePostgreSQLsystemusingtheworkloadsgeneratedfromtheTPCbenchmarks.

6.1L2CacheMissRate

Figs.13and14,respectively,showtheL2missrateswhenusingTPC-HandTPC-Cworkloadswithdifferentdatasetsizes.Wefirstexaminetheratesofthreetypicalqueries,i.e.,

HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD13

Fig.15.L2missratesin10-GBdatasetwith4-MBL2cache.

Fig.16.Executiontimeofqueries.

Q5,Q8,andQ20,inbothMCC-DBandMERCURYschemes.Q5andQ8aredominatedbymultiwayhashjoinsandQ20isdominatedbynestedsubqueryexecutions.WeobservethatMERCURYobtainsonaverage42.6,48.1,and51.2per-centmissdecreasecomparedwithMCC-DBintheTPC-Hworkloadwith500MB,1GMB,and10GBsizes.Thesebenefitscomefromthefastandaccuratehashing-basedcomputationtoidentifysimilardatatoefficientlysupportconcurrentqueries.WefurtherexaminetheL2missratesbyexecutingall22queriesinTPC-H.ComparedwithMCC-DB,MERCURYhasthedecreaseofmissratesfrom16.1to26.7percent,onaverage21.8percentforall22queriesasshowninFig.15.Inaddition,itisalsoobservedthatQ1,Q4,Q14,andQ21showthecomparablevalueswithMCC-DBduetotheirrelativelyweaklocalitycharacteristic.

6.2QueryExecutionTime

SharedL2cacheplaysakeyroleindeterminingqueryexecutionperformance.Whenconcurrentqueryexecutionprocessesaccesscommontuplesandindexdata,MERCURYenablesmultiplecoresfordatasharingtoreduceunneces-sarymemoryaccesses.InMERCURY,aqueryexecutionprocessreusescachelines.Weevaluatequeryexecutiontimebyusingcyclesperinstruction,whichisexaminedbyusingtheperformancetooltocheckhardwarecounters.Figs.16aand16b,respectively,showtheexecutingtimeof

TABLE3

NormalizedTimeandSpaceCostsasaPatch

queriesin1GBand10GBdatasetsizes.Duetodifferentfunctionalitiesofall22queriesinTPC-H,theexecutiontimeshowssignificantfluctuation.MERCURYalsoobtainsonaverage21.7and23.1percentimprovementsuponMCC-DBin1-GBand10-GBsizes.

6.3PatchingCost

BothMCC-DBandMERCURYareimplementedasapatchinPostgreSQL.Thenewcomponentmayintroduceextrapatchingcosts.WeexaminethesecostsintermsoftimeandspaceasshowninTable3bybeingnormalizedtoMCC-DBwhentakingintoaccountbothTPC-CandTPC-Hdatasetsunderdifferentsizes.MERCURYrequireslesstimeandspacecoststhanMCC-DBbecauseMERCURYmakesuseofthefastMC-LSHhashingcomputationtoplacesimilardatatogether.Wealsoobservethatwiththeincreaseofdatasetsizes,MERCURYobtainsmorebenefitsintermsoftimeandspaceoverheadsoverMCC-DB.Inthemeantime,thisobservationdemonstratesthescalabilityoftheMERCURYscheme.

7RELATEDWORK

Multilevelcachehierarchyhasbeenstudiedinthehigh-performancecloudarchitectureandsoftwarecommunities.Thereexistsawiderangeofproposalstoimprovecachingperformance(e.g.,hitrate,accesslatency,andspaceover-head)[11],[19],[20],[42],[43],[44].Wearguethatsuitablemanagementofthemultilevelcachehierarchyisbecomingmoreimportanttodeliverhighperformanceinthecloud.Locality-basedoptimization.Thestate-of-the-artwork,R-NUCA[45],obtainsnear-optimalcacheblockplacementbyclassifyingblocksonlineandplacingdataclosetothecore.Tomitigatethelossofreusingcachedstateswhenreschedul-ingaprocess,affinityscheduling[46]helpsreducecachemissesbyjudiciouslyschedulingaprocessonarecentlyusedCPU.Toimprovetheperformancein“multiexecution”applications,Mergeable[7]capturesdatasimilaritiesandmergesduplicatecachelinesownedbydifferentprocessestoobtainsubstantialcapacitysavings.Nevertheless,perform-ingtheexplicitlymergingoperationsoncacheblocksdemandsrelativelylongerexecutiontimeandincreasescomputationcomplexity.Theprocess-levelcachemanage-mentpolicy(PCM)[12]hastheassumptionthatallmemoryregionsbelongingtoarunningprocessexhibitthesameaccesspattern.MCC-DB[17]makesuseofdifferentlocalitystrengthsandqueryexecutionpatternstominimizecacheconflicts.Thisimprovementworksundertheassumptionwhentherearemultiplecandidateplansthatareaccuratelyestimatedinadvance.However,thisassumptiondoesnotalwaysholdinmanypracticalapplicationsbecauseperform-ingaccurateestimategenerallyrequireshigh-computation

14IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

overheads.Unlikethem,MERCURYexploresandexploitsthelocalitypropertybylightweighthashingapproach,thusobtainingsignificantperformanceimprovements.

Hardwareacceleration.ToreducethecachepollutioncausedbyLRUthatinsertsnon-reusableitemsintothecachewhileevictingreusableones,ROCS[6]employshardwarecounterstocharacterizecachebehaviorsandintroducesapollutebuffertohostnon-reusedcachelinesofpagesbeforeeviction.Moreover,toaddresstheproblemsofincreasedcapacityinterferenceandlongerL2accesslatency,CloudCache[47]leveragesfine-grainedhardwaremonitor-ingandcontroltodynamicallyexpandandshrinkL2cachesforworkingthreadsbyusingdynamicglobalpartitioning,distance-awaredataplacement,andlimitedtargetbroad-cast.Hardware-assistedexecutionthrottling[48]helpsregulatefairnessinmodernmulticoreprocessors,whiledemonstratingtherelativebenefitsofthevariousresourcecontrolmechanisms.Moreover,toreducethelargeamountsofmissesintheLLCbetweentheevictionofablockanditsreuse,Scavenger[49]dividesthetotalcachestorageintoaconventionalcacheandavictimfilearchitecturetoidentifyandretainhigh-prioritycacheblocksthataremorelikelytobereused.MERCURYbridgesthegapbetweenamulticorearchitectureandanoperatingsystems.Theexistinghard-wareaccelerationapproachescanuseMERCURYtosimplifyoperationsandoptimizesystemcalls.

Operationsenhancements.MergeSort[50]performedanefficientmultiwaymergewithoutbeingconstrainedbythememorybandwidthforhigh-throughputdatabaseapplica-tions.Theparallelskylinecomputationcouldbenefitfrommulticorearchitectures,suchasparallelversionofthebranch-and-boundalgorithm.Parketal.[51]presentedaparallelalgorithmbasedonparallelprogrammingthatwasevaluatedasacasestudyofparallelizingdatabaseopera-tions.Acooperation-basedlockingparadigm[52]wasproposedforefficientparallelizationoffrequencycountingandtop-kovermultiplestreamsinthecontextofmulticoreprocessors.Inaddition,adaptiveaggregation[53]demon-stratedthatachipmultiprocessorwithnewdimensionscouldenhanceconcurrentsharingofaggregationdatastructuresandcontentiousaccessestofrequentlyusedvalues.Qiaoetal.[54]introducedaschedulingtechniquetocooperatemultiplememoryscanstoreducetheoverheadonmemorybandwidth.Theseresearchprojectsaimtomakeasinglequerybenefitfromthecache,whichisorthogonaltoourwork.MininglocalitycanimproveparallelqueriesinmulticoreCPU[55]andtree-structureddata[56].Twopopularjoinalgorithms,suchashashjoinandsort-mergejoin,wasre-examinedin[57]tousemulticorecacheblockingtominimizeaccesslatency,increasecomputedensityandbalanceloadamongstcores,evenforheavilyskewedinputdatasets.CATCH[58]canstoreuniquecontentsininstructioncachebymeansofhashing,buttheirproposedsystemdoesnotsupportmodificationsincacheddata.Inaddition,thecachecompressiontechnique[59]compressestheL2dataresultstoreducethecachespaceandtheoff-chipaccesses,thusobtainingbandwidthsavings.Thecoopera-tivecachingtechnique[60]inamultiprocessorcanreduceoff-chipaccessthroughusingacooperativeprivatecacheeitherbystoringasinglecopyofcleanblocksorprovidingacache-like,spill-overmemoryforstoringevictedcachelines.MERCURYcanimprovethequeryperformancebyusinglocality-awaredataplacementstrategy.

Workloadsawareness.AnOS-basedcachepartitioningmechanism[29]presentsexecutionandmeasure-basedstrategiesformulticorecachepartitioninguponmultiplerepresentativeworkloads.Anonuniformcachearchitecture(NUCA)[61]takesadvantageofproximityofdatafromtheaccessingprocessor.TofurtheraddresstheproblemofonchipdatalocalityinlargesharedNUCA,PageNUCA[8]proposedafullyhardwiredcoarse-graindatamigrationmechanismthatdynamicallymonitoredtheaccesspatternsofthecoresatthegranularityofapage.Subsequently,theNuRAPIDproposal[62]decoupledthetaganddataplacementinaNUCAbyaugmentingeachtaganddatablockwithaforwardandreversepointertothecorrespond-ingdatablockandtag,respectively.NUcache[63]makesuseoftheDelinquentPC-Next-Usecharacteristictoimprovetheperformanceofsharedcachesinmulti-cores.TheNUcacheorganizationlogicallypartitionstheassociativewaysofacachesetintoMainWaysandDeliWays.MERCURYisorthogonaltotheexistingschemes.Itleveragesthelight-weightLSH-basedcomputationandobtainssignificantperformanceimprovementsontheLLCbyaccuratelycapturingthedifferentiatedlocalityacrossdata.

Scheduling.Age-basedschedulingforheterogeneousmultiprocessor[64]allowsathreadwiththelargerremain-ingexecutiontimetoruninafastercoregiventhepredictionofremainingexecutiontime.Athread-basedpreloadingtechniqueforsimultaneousmultithreadingprocessorswasproposedin[65]tousethehelperthreadtoperformaggressivedatapreloading.Toimprovetheutilizationofon-chipmemoryandreducetheimpactofexpensiveDRAMandremotecacheaccesses,O2scheduling[66]schedulesobjectsandoperationstocachesandcores.Todecreasetheunnecessarysharingofnetworkcontrolstateatallstacklayers,theIsoStackarchitecture[67]offloadsnetworkstackprocessingtoadedicatedprocessorcore.Moreover,inte-gratedprocessor-cachepartitioning[9]dividesboththeavailableprocessorsandthesharedcacheinachipmulti-processoramongdifferentmultithreadedapplications.TheexistingschedulingstrategiescanfurtherhelpoptimizeMERCURYperformance.

8CONCLUSION

MERCURY,asaninfrastructureofthecloud,playsasignificantroleinmanagingthemultilevelcachehierarchy.Byexploringandexploitingdatasimilaritythatisderivedfromlocality-awareaccesspatterns,MERCURYalleviateshomogeneousdataplacementandimprovessystemper-formancebythelow-complexityMC-LSHcomputation.Thecost-effectiveMERCURYisabletoprovidehybridfunctionalities.Oneistoprovidealightweightmechanismforallocatingcacheresources.TheotheristosupporttheOS-baseddynamiccacheallocationandcapturedatasimilaritywiththeaidofspace-efficientstructures.MER-CURY,hence,allowstheOScontroloverthesharedLLCs,whileminimizingsoftwareoverheads.Experimentsusingthereal-worlddatasetsdemonstratetheMERCURY’sefficiency.

Sincemodernmicroprocessorsincreasinglyincorporatemultiplememorycontrollers[68],[69],ourfutureworkwillconsiderthelocality-awareschemesforimplementingeffi-cientdataplacementinmultiplememorycontrollers.

HUAETAL.:DATASIMILARITY-AWARECOMPUTATIONINFRASTRUCTUREFORTHECLOUD15

ACKNOWLEDGMENTS

ThisworkwassupportedinpartbyNationalNaturalScienceFoundationofChina(NSFC)underGrant61173043;NationalBasicResearch973ProgramofChinaunderGrant2011CB302301;NSFCunderGrant61025008,61232004;FundamentalResearchFundsforthecentraluniversities,HUST,undergrant2012QN098;theNSERCDiscoveryGrant341823;USNationalScienceFoundationAward1116606.ThisisanextendedversionofourpaperpublishedintheProceedingsofthe20thIEEEInternationalSympo-siumonModeling,Analysis,andSimulationofComputerandTelecommunicationSystems(MASCOTS),2012,pages:371-378.

REFERENCES

[1]J.GantzandD.Reinsel,“DigitalUniverseStudy:ExtractingValuefromChaos,”Proc.Int’lDataCorporation(IDC),June2011.

[2]ScienceStaff,“DealingwithData—ChallengesandOpportu-nities,”Science,vol.331,no.6018,pp.692-693,2011.

[3]M.Armbrustetal.,“AViewofCloudComputing,”Comm.ACM,vol.53,no.4,pp.50-58,2010.

[4]S.Bykov,A.Geller,G.Kliot,J.Larus,R.Pandya,andJ.Thelin,“Orleans:CloudComputingforEveryone,”Proc.ACMSymp.CloudComputing(SOCC),2011.

[5]S.Wu,F.Li,S.Mehrotra,andB.Ooi,“QueryOptimizationforMassivelyParallelDataProcessing,”Proc.ACMSymp.CloudComputing(SOCC),2011.

[6]

L.Soares,D.Tam,andM.Stumm,“ReducingtheHarmfulEffectsofLast-LevelCachePolluterswithanOS-Level,Software-OnlyPolluteBuffer,”Proc.IEEE/ACM41stAnn.Int’lSymp.Microarch-itecture(MICRO),pp.258-269,2009.

[7]

S.Biswas,D.Franklin,A.Savage,R.Dixon,T.Sherwood,andF.Chong,“Multi-Execution:MulticoreCachingforData-SimilarExecutions,”Proc.36thAnn.Int’lSymp.ComputerArchitecture(ISCA),2009.

[8]

M.Chaudhuri,“PageNUCA:SelectedPoliciesforPage-GrainLocalityManagementinLargeSharedChip-MultiprocessorCaches,”Proc.Int’lSymp.High-PerformanceComputerArchitecture(HPCA),pp.227-238,2009.

[9]

S.Srikantaiah,R.Das,A.K.Mishra,C.R.Das,andM.Kandemir,“ACaseforIntegratedProcessor-CachePartitioninginChipMultiprocessors,”Proc.Conf.HighPerformanceComputingNetwork-ing,Storage,andAnalysis(SC),2009.

[10]

X.Ding,K.Wang,andX.Zhang,“SRM-Buffer:AnOSBufferManagementTechniquetoPreventLastLevelCachefromThrashinginMulticores,”Proc.SixthConf.ComputerSystems(EuroSys),2011.

[11]Y.Chen,S.Byna,andX.Sun,“DataAccessHistoryCacheandAssociatedDataPrefetchingMechanisms,”Proc.IEEE/ACMConf.Supercomputing(SC),2007.

[12]

J.Lin,Q.Lu,X.Ding,Z.Zhang,X.Zhang,andP.Sadayappan,“EnablingSoftwareManagementforMulticoreCacheswithaLightweightHardwareSupport,”Proc.Conf.HighPerformanceComputingNetworking,Storage,andAnalysis(SC),2009.

[13]

J.Stuecheli,D.Kaseridis,D.Daly,H.Hunter,andL.John,“TheVirtualWriteQueue:CoordinatingDRAMandLast-LevelCachePolicies,”Proc.37thAnn.Int’lSymp.ComputerArchitecture(ISCA),2010.

[14]A.Forin,B.Neekzad,andN.Lynch,“Giano:TheTwo-HeadedSystemSimulator,”TechnicalReportMSR-TR-2006-130,MicrosoftResearch,2006.

[15]

S.Biswas,D.Franklin,T.Sherwood,andF.Chong,“Conflict-AvoidanceinMulticoreCachingforData-SimilarExecutions,”Proc.10thInt’lSymp.PervasiveSystems,Algorithms,andNetworks(ISPAN),2009.

[16]“PostgreSQL,”http://www.postgresql.org/,2013.

[17]R.Lee,X.Ding,F.Chen,Q.Lu,andX.Zhang,“MCC-DB:MinimizingCacheConflictsinMulti-CoreProcessorsforData-bases,”Proc.VLDBEndowment,vol.2,no.1,pp.373-384,2009.[18]

T.R.B.Bershad,D.Lee,andB.Chen,“AvoidingConflictMissesDynamicallyinLargeDirect-MappedCaches,”Proc.SixthInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),1994.

[19]Y.Yan,X.Zhang,andZ.Zhang,“Cacheminer:ARuntime

ApproachtoExploitCacheLocalityonSMP,”IEEETrans.ParallelandDistributedSystems,vol.11,no.4,pp.357-374,Apr.2000.[20]K.Zhang,Z.Wang,Y.Chen,H.Zhu,andX.Sun,“PAC-PLRU:A

CacheReplacementPolicytoSalvageDiscardedPredictionsfromHardwarePrefetchers,”Proc.IEEE/ACM11thInt’lSymp.Cluster,Cloud,andGridComputing(CCGrid),pp.265-274,2011.

[21]G.Suh,S.Devadas,andL.Rudolph,“AnalyticalCacheModels

withApplicationstoCachePartitioning,”Proc.ACM15thInt’lConf.Supercomputing(ICS),2001.

[22]“UCIMachineLearningRepository,”TheForestCoverTypeData

Set,http://archive.ics.uci.edu/ml/datasets/Covertype,2013.[23]D.Ellard,J.Ledlie,P.Malkani,andM.Seltzer,“PassiveNFS

TracingofEmailandResearchWorkloads,”Proc.SecondUSENIXConf.FileandStorageTechnologies(FAST),2003.

[24]E.Riedel,M.Kallahalla,andR.Swaminathan,“AFrameworkfor

EvaluatingStorageSystemSecurity,”Proc.Conf.FileandStorageTechnologies(FAST),2002.

[25]SPEC2000,http://www.spec.org/cpu2000/,2013.

[26]S.CarrandK.Kennedy,“CompilerBlockabilityofNumerical

Algorithms,”Proc.SupercomputingConf.,1992.

[27]E.E.R.M.S.LamandM.E.Wolf,“TheCachePerformanceand

OptimizationsofBlockedAlgorithms,”Proc.FourthInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),1991.

[28]M.S.L.T.C.MowryandA.Gupta,“DesignandEvaluationofa

CompilerAlgorithmforPrefetching,”Proc.FifthInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),1992.

[29]J.Lin,Q.Lu,X.Ding,Z.Zhang,X.Zhang,andP.

Sadayappan,“GainingInsightsintoMulticoreCachePartition-ing:BridgingtheGapbetweenSimulationandRealSystems,”Proc.IEEE14thInt’lSymp.HighPerformanceComputerArchitecture(HPCA),2008.

[30]P.IndykandR.Motwani,“ApproximateNearestNeighbors:

TowardRemovingtheCurseofDimensionality,”Proc.13thAnn.ACMSymp.TheoryofComputing(STOC),1998.

[31]Q.Lv,W.Josephson,Z.Wang,M.Charikar,andK.Li,“Multi-ProbeLSH:EfficientIndexingforHigh-DimensionalSimilaritySearch,”Proc.33rdInt’lConf.VeryLargeDataBases(VLDB),pp.950-961,2007.

[32]R.Shinde,A.Goel,P.Gupta,andD.Dutta,“SimilaritySearchand

LocalitySensitiveHashingUsingTernaryContentAddressableMemories,”Proc.ACMSIGMODInt’lConf.ManagementofData(SIGMOD),pp.375-386,2010.

[33]A.JolyandO.Buisson,“APosterioriMulti-ProbeLocality

SensitiveHashing,”Proc.ACMInt’lConf.Multimedia,2008.

[34]G.Taylor,P.Davies,andM.Farmwald,“TheTLBSlice-aLow-CostHigh-SpeedAddressTranslationMechanism,”Proc.17thAnn.Int’lSymp.ComputerArchitecture(ISCA),1990.

[35]A.AndoniandP.Indyk,“Near-OptimalHashingAlgorithmsfor

ApproximateNearestNeighborinHighDimensions,”Comm.ACM,vol.51,no.1,pp.117-122,2008.

[36]L.Fan,P.Cao,J.Almeida,andA.Broder,“SummaryCache:A

ScalableWide-AreaWebCacheSharingProtocol,”IEEE/ACMTrans.Networking,vol.8,no.3,pp.281-293,June2000.

[37]B.Bloom,“Space/TimeTrade-OffsinHashCodingwithAllow-ableErrors,”Comm.ACM,vol.13,no.7,pp.422-426,1970.

[38]Y.Tao,K.Yi,C.Sheng,andP.Kalnis,“QualityandEfficiency

inHigh-DimensionalNearestNeighborSearch,”Proc.ACMSIGMODInt’lConf.ManagementofData(SIGMOD),2009.

[39]Y.Hua,B.Xiao,D.Feng,andB.Yu,“BoundedLSHforSimilarity

SearchinPeer-to-PeerFileSystems,”Proc.37thInt’lConf.ParallelProcessing(ICPP),pp.644-651,2008.[40]TPC,http://www.tpc.org/,2013.

[41]Y.Hua,B.Xiao,B.Veeravalli,andD.Feng,“Locality-Sensitive

BloomFilterforApproximateMembershipQuery,”IEEETrans.Computers,vol.61,no.6,pp.817-830,June2012.

[42]Z.Zhang,Z.Zhu,andX.Zhang,“CachedDramforILPProcessor

MemoryAccessLatencyReduction,”IEEEMicro,vol.21,no.4,pp.22-32,July2001.

[43]S.Byna,Y.Chen,X.Sun,R.Thakur,andW.Gropp,“ParallelI/O

PrefetchingUsingMPIFileCachingandI/OSignatures,”Proc.IEEE/ACMConf.Supercomputing(SC),2008.

[44]Z.Zhang,Z.Zhu,andX.Zhang,“DesignandOptimizationof

LargeSizeandLowOverheadOff-ChipCaches,”IEEETrans.Computers,vol.53,no.7,pp.843-855,July2004.

16IEEETRANSACTIONSONCOMPUTERS,VOL.63,NO.1,JANUARY2014

[45]N.Hardavellas,M.Ferdman,B.Falsafi,andA.Ailamaki,“Near-OptimalCacheBlockPlacementwithReactiveNonuniformCacheArchitectures,”IEEEMicro,vol.30,no.1,pp.20-28,Jan./Feb.2010.[46]J.Torrellas,A.Tucker,andA.Gupta,“BenefitsofCache-Affinity

SchedulinginShared-MemoryMultiprocessors:ASummary,”Proc.ACMSIGMETRICSConf.MeasurementandModelingofComputerSystems(SIGMETRICS),1993.

[47]H.Lee,S.Cho,andB.Childers,“Cloudcache:Expandingand

ShrinkingPrivateCaches,”Proc.IEEE17thInt’lSymp.HighPerformanceComputerArchitecture(HPCA),pp.219-230,2011.

[48]X.Zhang,S.Dwarkadas,andK.Shen,“HardwareExecution

ThrottlingforMulti-coreResourceManagement,”Proc.USENIXAnn.TechnicalConf.,2009.

[49]A.Basu,N.Kirman,M.Kirman,M.Chaudhuri,andJ.Martinez,

“Scavenger:ANewLastLevelCacheArchitecturewithGlobalBlockPriority,”Proc.40thAnn.IEEE/ACMInt’lSymp.Micro-architecture(MICRO),pp.421-432,2007.

[50]J.Chhugani,A.Nguyen,V.Lee,W.Macy,M.Hagog,Y.Chen,A.

Baransi,S.Kumar,andP.Dubey,“EfficientImplementationofSortingonMulti-CoreSIMDCPUArchitecture,”Proc.VLDBEndowment,vol.1,pp.1313-1324,2008.

[51]S.Park,T.Kim,J.Park,J.Kim,andH.Im,“ParallelSkyline

ComputationonMulticoreArchitectures,”Proc.IEEE25thInt’lConf.DataEng.(ICDE),2009.

[52]S.Das,S.Antony,D.Agrawal,andA.E.Abbadi,“Thread

CooperationinMulticoreArchitecturesforFrequencyCountingoverMultipleDataStreams,”Proc.VLDBEndowment,vol.2,pp.217-228,2009.

[53]J.CieslewiczandK.Ross,“AdaptiveAggregationonChip

Multiprocessors,”Proc.33rdInt’lConf.VeryLargeDataBases(VLDB),2007.

[54]L.Qiao,V.Raman,F.Reiss,P.Haas,andG.Lohman,“Main-MemoryScanSharingforMulti-CoreCPUs,”Proc.VLDBEndowment,vol.1,pp.610-621,2008.

[55]W.HanandJ.Lee,“Dependency-AwareReorderingfor

ParallelizingQueryOptimizationinMulti-CoreCPUs,”Proc.ACMSIGMODInt’lConf.ManagementofData(SIGMOD),2009.

[56]S.TatikondaandS.Parthasarathy,“MiningTree-StructuredData

onMulticoreSystems,”Proc.VLDBEndowment,vol.2,pp.694-705,2009.

[57]C.Kim,T.Kaldewey,V.Lee,E.Sedlar,A.Nguyen,N.Satish,J.

Chhugani,A.D.Blas,andP.Dubey,“Sortvs.HashRevisited:FastJoinImplementationonModernMulti-CoreCPUs,”Proc.VLDBEndowment,vol.2,pp.1378-1389,2009.

[58]M.KleanthousandY.Sazeides,“CATCH:AMechanismfor

DynamicallyDetectingCache-Content-DuplicationandItsAppli-cationtoInstructionCaches,”Proc.Conf.Design,Automation,andTestinEurope(DATE),2008.

[59]A.AlameldeenandD.Wood,“AdaptiveCacheCompressionfor

High-PerformanceProcessors,”Proc.31stAnn.Int’lSymp.Com-puterArchitecture(ISCA),2004.

[60]J.ChangandG.Sohi,“CooperativeCachingforChipMulti-processors,”Proc.33rdAnn.Int’lSymp.ComputerArchitecture(ISCA),2006.

[61]C.Kim,D.Burger,andS.Keckler,“AnAdaptive,Non-Uniform

CacheStructureforWire-DelayDominatedon-ChipCaches,”Proc.10thInt’lConf.ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),2002.

[62]Z.Chishti,M.Powell,andT.Vijaykumar,“DistanceAssociativity

forHigh-PerformanceEnergy-EfficientNon-UniformCacheAr-chitectures,”Proc.IEEE/ACM36thAnn.Int’lSymp.Microarchitec-ture(MICRO),2003.

[63]R.Manikantan,K.Rajan,andR.Govindarajan,“NUcache:An

EfficientMulticoreCacheOrganizationBasedonNext-UseDistance,”Proc.IEEE17thInt’lSymp.HighPerformanceComputerArchitecture(HPCA),pp.243-253,2011.

[64]N.Lakshminarayana,J.Lee,andH.Kim,“AgeBasedScheduling

forAsymmetricMultiprocessors,”Proc.IEEE/ACMSupercomput-ingConf.,2009.

[65]J.Zhou,J.Cieslewicz,K.Ross,andM.Shah,“ImprovingDatabase

PerformanceonSimultaneousMultithreadingProcessors,”Proc.31stInt’lConf.VeryLargeDataBases(VLDB),2005.

[66]S.Boyd-Wickizer,R.Morris,andM.F.Kaashoek,“Reinventing

SchedulingforMulticoreSystems,”Proc.12thConf.HotTopicsinOperatingSystems(HotOS),2009.

[67]L.Shalev,J.Satran,E.Borovik,andM.Ben-Yehuda,“IsoStack:

HighlyEfficientNetworkProcessingonDedicatedCores,”Proc.USENIXAnn.TechnicalConf.,2010.

[68]M.Awasthi,D.Nellans,K.Sudan,R.Balasubramonian,andA.

Davis,“ManagingDataPlacementinMemorySystemswithMultipleMemoryControllers,”Int’lJ.ParallelProgramming,vol.40,no.1,pp.57-83,2012.

[69]M.Awasthi,D.Nellans,K.Sudan,R.Balasubramonian,andA.

Davis,“HandlingtheProblemsandOpportunitiesPosedbyMultipleOn-ChipMemoryControllers,”Proc.19thInt’lConf.ParallelArchitecturesandCompilationTechniques(PACT),2010.YuHuareceivedtheBEandPhDdegreesincomputersciencefromtheWuhanUniversity,China,in2001and2005,respectively.HeisanassociateprofessorattheHuazhongUniversityofScienceandTechnology,China.Hisresearchinterestsincludecomputerarchitecture,cloudcomputing,andnetworkstorage.Hehasmorethan40paperstohiscreditinmajorjournalsandinternationalconferencesincludingIEEETrans-actionsonComputers,IEEETransactionson

ParallelandDistributedSystems,USENIXATC,INFOCOM,SC,ICDCS,ICPP,andMASCOTS.Hehasbeenontheprogramcommitteesofmultipleinternationalconferences,includingINFOCOMandICPP.HeisaseniormemberoftheIEEE,andaMemberofACMandUSENIX.XueLiureceivedtheBSandMSdegreesinmathematicsandautomaticcontrol,respec-tively,fromTsinghuaUniversity,China,andthePhDdegreeincomputersciencefromtheUniversityofIllinoisatUrbana-Champaignin2006.HeisanassociateprofessorintheSchoolofComputerScienceatMcGillUniversity.Hisresearchinterestsincludecomputernetworksandcommunications,smartgrid,real-timeandembeddedsystems,cyber-physicalsystems,

datacenters,andsoftwarereliability.HeservesasaneditorforIEEETransactionsonVehicularTechnology,anassociateeditorfortheIEEETransactionsonParallelandDistributedSystems,andaneditorfortheIEEECommunicationsSurveysandTutorials.HisworkhasreceivedtheYear2008BestPaperAwardfromIEEETransactionsonIndustrialInformatics,andtheFirstPlaceBestPaperAwardoftheACMConferenceonWirelessNetworkSecurity(WiSec2011).HeisamemberoftheIEEEandtheACM.

DanFengreceivedtheBE,ME,andPhDdegreesincomputerscienceandtechnologyfromtheHuazhongUniversityofScienceandTechnology(HUST),China,in1991,1994,and1997,respectively.SheisaprofessorandvicedeanoftheSchoolofComputerScienceandTechnology,HUST.Herresearchinterestsincludecomputerarchitecture,massivestoragesystems,andparallelfilesystems.Shehasmorethan100publicationstohercreditin

journalsandinternationalconferences,includingIEEETransactionsParallelandDistributedSystems,JCST,USENIXATC,FAST,ICDCS,HPDC,SC,ICS,andICPP.Sheservesastheprogramcommitteememberofmultipleinternationalconferences,includingSCandMSST.SheisamemberoftheIEEE.

.Formoreinformationonthisoranyothercomputingtopic,pleasevisitourDigitalLibraryatwww.computer.org/publications/dlib.

因篇幅问题不能全部显示,请点此查看更多更全内容