您的当前位置：首页正文

An experimental and theoretical comparison of model selection methods

来源：爱够旅游网

MichaelKearnsAT&TBellLaboratoriesMurrayHill,NewJerseyYishayMansourTelAvivUniversityTelAviv,Israel

AndrewY.Ng

CarnegieMellonUniversityPittsburgh,PennsylvaniaDanaRonHebrewUniversityJerusalem,Israel

1Introduction

Inthemodelselectionproblem,wemustbalancethecom-plexityofastatisticalmodelwithitsgoodnessofﬁttothetrainingdata.Thisproblemarisesrepeatedlyinstatisticales-timation,machinelearning,andscientiﬁcinquiryingeneral.Instancesofthemodelselectionproblemincludechoosingthebestnumberofhiddennodesinaneuralnetwork,de-terminingtherightamountofpruningtobeperformedonadecisiontree,andchoosingthedegreeofapolynomialﬁttoasetofpoints.Ineachofthesecases,thegoalisnottominimizetheerroronthetrainingdata,buttominimizetheresultinggeneralizationerror.

Manymodelselectionalgorithmshavebeenproposedintheliteratureofseveraldifferentresearchcommunities,toomanytoproductivelysurveyhere.(Amoredetailedhistoryoftheproblemwillbegiveninthefullpaper.)Perhapssurprisingly,despitethemanyproposedsolutionsformodelselectionandthediversemethodsofanalysis,directcomparisonsbetweenthedifferentproposals(eitherexperimentalortheoretical)arerare.

Thegoalofthispaperistoprovidesuchacomparison,andmoreimportantly,todescribethegeneralconclusionstowhichithasled.Relyingonevidencethatisdividedbetweencontrolledexperimentalresultsandrelatedformalanalysis,wecomparethreewell-knownmodelselectionalgorithms.Weattempttoidentifytheirrelativeandabsolutestrengthsandweaknesses,andweprovidesomegeneralmethodsforanalyzingthebehaviorandperformanceofmodelselectionalgorithms.Ourhopeisthattheseresultsmayaidthein-formedpractitionerinmakinganeducatedchoiceofmodel

capsufferedbytheentireclassofpenalty-basedalgorithmsthatdoesnotafﬂictcrossvalidation.InSection8,weweightheevidenceandﬁndthatitprovidesconcreteargumentsfa-voringtheuseofcrossvalidation(oratleastcauseforcautioninusinganypenalty-basedalgorithm).

2Deﬁnitions

Throughoutthepaperweassumethataﬁxedbooleantargetfunctionisusedtolabelinputsdrawnrandomlyaccord-ing,wetodeﬁneaﬁxedthedistributiongeneralization.Foranybooleanfunctionable.Weusetoerror

each1,wheredenoteisthetherandomsamplevari-size,,andisdrawn1randomlyandindependentlyaccordingto,wherethenoisebitwithprobability;wecall012thenoiserate0.1Inisthe1casethat0,wewillsometimeswishtodiscussthegen-eralizationerrorofwithrespecttothenoisywedeﬁne

noisebit.Note,whereexamples,istheso

simplicity,1thatandwewillusethe1arerelatedexpression“with12bytheequality

highprobability”.For

tomeanwithprobability1overthedrawof,atacostofafactoroflog1inthebounds—thus,ourboundsallcontain“hidden”logarithmicfactors,butourhandlingofconﬁdenceisentirelystandardandwillbespelledoutinthefullpaper.

Weassumeanestedsequenceofhypothesisclasses(ormod-els)1.Thetargetfunctionmaynotmayorargminbecontainedinanyoftheseclasses,sowedeﬁne

).Thus,and(similarly,

isthebestapproximationto

the(withqualityrespectofthistoapproximation.)intheclass,andNotethatmeasuresincreasingfunctionofsincethehypothesisfunctionisclassesanon-arenested.Thus,largervaluesofcanonlyimprovethepotentialapproximativepowerofthehypothesisclass.Ofcourse,thedifﬁcultyistorealizethispotentialonthebasisofasmallsample.

Withthisnotation,themodelselectionproblemcanbestatedinformally:pothesis,thegoalon˜istothechoosebasisofahypothesisarandomsamplecomplexityofa˜,ﬁxedandasizehy-˜˜,suchthattheresultinggeneralizationerrorisminimized.Inmanytreatmentsofmodelincludingours,itisexplicitlyorimplicitlyassumedselection,thatthemodelselectionalgorithmhascontrolonlyoverthechoiceofthecomplexity˜pothesis˜It,isbutassumednotoverthatthetherechoiceisaﬁxedofthealgorithmﬁnalhy-˜.thatchoosesasetofcandidatehypotheses,onefromeachhypothesisclass.Giventhissetofcandidatehypotheses,themodelselectionalgorithmthenchoosesoneofthecandidatesastheﬁnalhypothesis.

Tomaketheseideasmoreprecise,wedeﬁnethetrainingerrorˆˆtheversionspace:.Note—that:ˆ,andminˆmaycontainmorethanonefunctioninseveralfunctionsmayminimize

thetrainingerror.Ifwearelucky,wehaveinourpossessiona(possiblyrandomized)learningalgorithmthattakesasinputanysample˜andanycomplexityvalue,andoutputsamemberof(usingsomeunspeciﬁedcriteriontobreaktiesif1).Moregenerally,itmaybethecasethatﬁndinganyfunctioninisintractable,andthatissimplyaheuristic(suchasbackpropagationorID3)thatdoesthebestjobitcanatﬁnding˜withsmalltraining˜erroronandinputˆand.Ineithercase,wedeﬁneexpecttraining—byˆgoing,likeˆˆ˜.Notethatweerror.

toalarger,complexity,tobeanon-increasingwecanonlyfunctionreduceourofWecannowgiveaprecisestatementofthemodelselec-tionproblem.Firstofall,aninstanceofthemodelse-lectiongetfunction,isproblemthehypothesisconsistsisthefunctionofatupleinputdistribution,classsequence,,where

andisisthethetar-un-derlyinglearningalgorithm.Themodelselectionproblemisthen:Giventhesample,andthesequenceoffunc-tions˜11˜determined˜by

thelearningalgorithm,selectacomplexityvaluethat˜theresultinggeneralizationerror.Thus,such˜minimizesamodelselectionalgorithmisgivenboththesampleandthesequenceof(increasinglycomplex)hypothesesderivedbyfrom,andmustchooseoneofthesehypotheses.

Thecurrentformalizationsufﬁcestomotivateakeydeﬁnitionandadiscussionofthefundamentalissuesinmodelselection.

Wedeﬁne˜variable(determinedbytherandom.Thus,variableis)thatarandom

givesthetruegeneralizationerrorofthefunction˜chosenbyfromtheclass.Ofcourse,isnotdirectlyaccessibletoamodelselectionalgorithm;itcanonlybeestimatedorguessedinvariouswaysfromthesample.Asimplebutimportantobservationisthatnomodelselectionalgorithmcanachievegeneralizationerrorlessthanminthebehaviorofthefunction—especiallythelocation.Thusandvalueofitsminimum—isinsomesensetheessentialquantityofinterestinmodelselection.

Theprevailingfolkwisdominseveralresearchcommunities

positsthatwilltypicallyhaveaglobalminimumthatisnontrivial—thatis,atan“intermediate”valueofawayfromtheextremes0and.Asademonstrationofthevalidityofthisview,andasanintroductiontoapar-ticularmodelselectionproblemthatwewillexamineinourexperiments,wecallthereader’sattentiontoFigure1.Inthismodelselectionproblem(whichweshallrefertoastheintervalsmodelselectionproblem),theinputdomainissim-plythereallinesegment01,andthehypothesisclassissimplytheclassofallbooleanfunctionsover01inwhichweallowatmostalternationsoflabel;thusistheclassofallbinarystepfunctionswithatmost2steps.Fortheex-periments,theunderlyinglearningalgorithmthatwehaveimplementedperformstrainingerrorminimization.Thisisararecasewhereefﬁcientminimizationispossible;wehavedevelopedanalgorithmbasedondynamicprogrammingthatrunsinlineartime,thusmakingexperimentsonlargesam-plesfeasible.Thesamplewasgeneratedusingthetargetfunctionin100thatdivides01into100segmentsofequal

width1100andalternatinglabel.InFigure1weplot(whichwecancalculateexactly,sincewehavechosenthetargetfunction)whenconsistsof2000randomex-amples(drawnfromtheuniforminputdistribution)corruptedbynoiseattherate02.Forourcurrentdiscussionitsufﬁcestonotethatdoesindeedexperienceanontrivialminimum.Notsurprisingly,thisminimumoccursnear(butnotexactlyat)thetargetcomplexityof100.

3ThreeAlgorithmsforModelSelection

Theﬁrsttwomodelselectionalgorithmsweconsideraremembersofageneralclassthatweshallinformallyrefertoaspenalty-basedalgorithms(andshallformallydeﬁneshortly).Thecommonthemebehindthesealgorithmsistheirattempttoconstructanapproximationtosolelyonthebasisofthetrainingerrorˆandthecomplexity,oftenbytryingto“correct”ˆbytheamountthatitunderestimatesthroughtheadditionofa“complexitypenalty”term.

InVapnik’sGuaranteedRiskMinimization(GRM)[11],˜chosenaccordingtotherule1

is˜argminˆ1

isanupperboundonˆ

andhenceˆ

ensurethattheresultingsumupperboundsto,andˆif,we

weareoptimisticwemightfurtherhopethatthesumisinfactacloseapproximationto,andthatitsminimizationisthereforetantamounttotheminimizationof.TheactualrulegiveninEquation(1)isslightlymorecomplexthanthis,

andreﬂectsareﬁnedboundonˆ

thatvariesfromforˆcloseto0to

npVaagainstworst-casechoicesabovebyfromalogarithmic.SincefactorthisintendedfactortorendersguardGRMuncompetitiveontheensuingexperiments,weconsiderthismodiﬁedandquitecompetitiverulewhosespiritisthesame.2

OurgoalhereissimplytogiveonereasonableinstantiationofMDL.Othercodingschemesareobviouslypossible;however,severalofourformalresultswillholdforessentiallyallMDLinstantiations.

that3

tonormalize,we1).Thistakeslog

bits;dividingbyobtain1log[2],whereisthebinaryentropyfunction.Nowgiven,thelabelsincanbedescribedsimplybycodingthemis-takesof(thatis,thoseindiceswhereatanormalizedcostofˆ.Technically,inthecoding),schemejustdescribedwealsoneedtospecifyversionandˆofMDL,butthatthewecostshallofexaminetheseisfornegligible.thetheintervalsThus,valuesmodeltheofselectionproblemdictatesthefollowingchoiceof˜:

˜argminˆ2

Inthecontextofmodelselection,GRMandMDLcanbothbe

interpretedasattemptstomodelbytransformingˆand.Moreformally,amodelselectionalgorithmoftheform

˜argminˆ3

shallbecalledapenalty-basedalgorithm4.Noticethatan

idealbythe(orpenalty-basedalgorithmwouldobey

ˆsameatleastvalueofˆ).

andwouldbeminimizedThethirdmodelselectionalgorithmthatweexaminehasa

differentspiritthanthepenalty-basedalgorithms.Incrossvalidation(CV)[9,10],weuseonlyafraction1oftheexamplesintoobtainthehypothesissequence˜˜1

oftheﬁrst1—thatis,˜isnowconsists1

examplesin.Here,where01isaparameteroftheCValgorithmwhosetuningwediscussbrieﬂylater.CVchooses˜accordingtotherule

˜argminˆ˜4

whereˆ˜istheerrorof˜on,thelastexamples

ofthatwerewithheldinselecting˜.NoticethatforCV,we

expectthequantity

˜tobe(perhapsconsiderably)largerthaninthecaseofGRMandMDL,becausenow˜waschosenonthebasisofonly1examplesratherthanallexamples.Forthisreasonwewishmoregeneralnotation

˜ofthesampletoindicatetointroducethefractionthe

0withheldfromtraining.CVsettlesforinsteadofwhichtodirectlyinestimateordertohavean.

independenttestsetwith4AControlledExperimentalComparison

Ourresultsbeginwithacomparisonoftheperformanceandpropertiesofthethreemodelselectionalgorithmsinacare-fullycontrolledexperimentalsetting—namely,theintervalsmodelselectionproblem.Amongtheadvantagesofsuchcontrolledexperiments,atleastincomparisontoempiricalresultsondataofunknownorigin,areourabilitytoexactlymeasuregeneralizationerror(sinceweknowthetargetfunc-tionandthedistributiongeneratingnthedata),andourability

ntopreciselystudytheeffectsofvaryingparametersofthedata(suchasnoiserate,targetfunctioncomplexity,andsamplesize),ontheperformanceofmodelselectionalgorithms.Theexperimentalbehaviorweobserveforeshadowsanumberofimportantthemesthatweshallrevisitinourformalresults.WebeginwithFigure2.Toobtainthisﬁgure,atrainingsam-plewasgeneratedfromtheuniforminputdistributionandlabeledaccordingtoanintervalsfunctionover01consist-ingof100intervalsofalternatinglabelandequalwidth5;thesamplewascorruptedwithnoiseatrate02.InFigure2,wehaveplottedthetruegeneralizationerrors(measuredwithrespecttothenoise-freesourceofexamples)GRM,MDLand

01forCV)ofthehypothesesCV(usingtestfraction

˜selectedfromthesequence˜1byeachthethree

algorithmsasafunctionofsamplesize,whichrangedfrom1to3000examples.AsdescribedinSection2,thehy-potheses˜wereobtainedbyminimizingthetrainingerrorwithineachclass.Detailsofthecodeusedtoperformtheseexperimentswillbeprovidedinthefullpaper.

Figure2demonstratesthesubtletyinvolvedincomparingthethreealgorithms:inparticular,weseethatnoneofthethreealgorithmsoutperformstheothersforallsamplesizes.Thuswecanimmediatelydismissthenotionthatoneofthealgorithmsexaminedcanbesaidtobeoptimalforthisprobleminanystandardsense.Gettingintothedetails,weseethatthereisaninitialregime(forfrom1toslightlylessthan1000)inwhichMDListhelowestofthethreeerrors,sometimesoutperformingGRMbyaconsiderablemargin.Thenthereisasecondregime(forabout1000toabout2500)whereaninterestingreversalofrelativeperformanceoccurs,sincenowGRMisthelowesterror,considerablyoutperformingMDL,whichhastemporarilyleveledoff.Inbothoftheseﬁrsttworegimes,CVremainstheintermediateperformer.Inthethirdandﬁnalregime,MDLdecreasesrapidlytomatchGRMandtheslightlylargerCV,andtheperformanceofallthreealgorithmsremainsquitesimilarforalllargersamplesizes.

InsightintothecausesofFigure2isgivenbyFigure3,whereforthesamerunsusedtoobtainFigure2,weinsteadplotthequantities˜GRM,˜MDLand˜CV,thevalueof˜cho-senbyGRM,MDLandCVrespectively(thus,the“correct”value,inthesenseofsimplyhavingthesamenumberofintervalsasthetargetfunction,is100).Hereweseethatforsmallsamplesizes,correspondingtotheﬁrstregimedis-cussedforFigure2above,˜GRMisslowlyapproaching100frombelow,reachingandremainingatthetargetvaluefor

1500.Althoughwehavenotshownitexplic-about

itly,GRMisincurringnonzerotrainingerrorthroughouttheentirerangeof.Incomparison,foralonginitialperiod(correspondingtotheﬁrsttworegimesof),MDLissimplychoosingtheshortesthypothesisthatincursnotrainingerror(andthusencodesboth“legitimate”intervalsandnoise),andconsequently˜MDLgrowsinanuncontrolledfashion.Moreprecisely,itcanbeshownthatduringthisperiod˜MDLisobeying˜MDL21122,where0

isthenumberof(equallyspaced)intervalsinthetargetfunctionandisthenoiserate(soforthecurrentexperiment

constantlessthan1(oranalogously,multiplytheMDLcom-plexitypenaltytermbyaconstantgreaterthan1toreduceovercoding).Theresultsofanexperimentonsuchamod-iﬁedversionofGRMareshowninFigures7and8,wheretheoriginalGRMperformanceiscomparedtoamodiﬁedversioninwhichthecomplexitypenaltyismultipliedby0.5.Interestinglyandperhapsunfortunately,weseethatthereisnofreelunch:whilethemodiﬁedversiondoesindeedcodemorerapidlyandthusreducethesmallgeneralizationer-ror,thiscomesatthecostofasubsequentovercodingregimewithacorrespondingdegradationingeneralizationerror(andinfactaconsiderablyslowerreturnto100thanunderthesameconditions)6

MDL

.Thereversephenomenon(re-luctancetocode)isexperiencedforMDLwithanincreasedcomplexitypenaltymultiplier(detailsinthefullpaper).

Letussummarizethekeypointsdemonstratedbytheseex-periments.First,noneofthethreealgorithmsdominatesthe

othersforallsamplesizes.Second,thetwopenalty-basedalgorithmsseemtohaveabiaseithertowardsoragainstcod-ingthatisovercomebytheinherentpropertiesofthedataasymptotically,butthatcanhavealargeeffectongeneraliza-tionerrorforsmalltomoderatesamplesizes.Third,thisbiascannotbeovercomesimplybyadjustingtherelativeweightoferrorandcomplexitypenalties,withoutreversingthebiasoftheresultingruleandsufferingincreasedgeneralizationerrorforsomerangeof.Fourth,whileCVisnotthebestofthealgorithmsforanyvalueof,itdoesmanagetofairlycloselytrackthebestpenalty-basedalgorithmforeachvalueof,andconsiderablybeatsbothGRMandMDLintheirregimesofweakness.Wenowturnourattentiontoourfor-malresults,whereeachofthesekeypointswillbedevelopedfurther.

ABoundonGeneralizationErrorfor

Penalty-BasedAlgorithms

Webeginourformalresultswithaboundonthegeneral-izationerrorforpenalty-basedalgorithmsthatenjoysthree

features.First,itisgeneral:itappliestopracticallyanypenalty-basedalgorithm,andholdsforanymodelselectionproblem(ofcourse,thereisapricetopayforsuchgenerality,asdiscussedbelow).Second,forcertainalgorithmsandcer-tainproblemstheboundcangiverapidratesofconvergencetosmallerror.Third,theformoftheboundissuggestiveofsomeofthebehaviorseenintheexperimentalresults.Westatetheboundforthespecialbutnaturalcaseinwhichtheunderlyinglearningalgorithmistrainingerrormini-mization;inthefullpaper,wewillpresentastraightforwardanalogueformoregeneral.BoththistheoremandTheo-rem2inthefollowingsectionarestatedforthenoise-freecase;butagain,straightforwardgeneralizationstothenoisycasewillbeincludedinthefullpaper.

Theorem1Letselectionprobleminwhichbeaninstanceoftheperformstrainingerrormodel

mini-mization,andassumeforconveniencethatistheVCdimen-˜

wheregeneralizationapproachesmin.Therateerrorofachievablethisapproachinoptanywillofdependthe(whichclassesisthebestonproperties

)as

of.

Proof:Foranyvalueof,wehavetheinequality

ˆ˜˜ˆ

because˜ischosentominimizeˆ.Usingthe

uniformconvergencebound

toobtainasmaller

quantity,andwecanreplacetheoccurrenceofˆontheright-handsideby

,0we,andseeas

sothattheargumentsapproachthepoint01

weobtainthestatementofthetheorem.

LetusnowdiscusstheformoftheboundgiveninTheorem1.Theﬁrsttermerrorwithinapproachestheoptimalgeneralizationinthelimitoflarge,andthesecondtermdirectlypenalizeslargecomplexity.Thesetermsmaybethoughtofascompeting.Inorderfortoapproach

mininordertohaverapidlyafastrateandnotofconvergence),justasymptotically(thatshouldis,notpenalizecomplexitytoostrongly,whichisobviouslyat

oddswiththeoptimizationoftheterm

smallwewouldlike

tobesmall,topreventthechoice

oflarge˜.However,

min

servethatwecanavoidweakening.ForEquationthisalgorithm,(7)toEquation

weob-(8),becausehere

˜11

Thisisnotsomysterious,sinceSGRMpenalizesstronglyfor

complexity(evenmoresothanGRM).Thisboundexpressesthegeneralizationerrorastheminimumofthesumofthebestpossibleerrorwithineachclassandapenaltyforcom-plexity.Suchaboundseemsentirelyreasonable,giventhatitisessentiallytheexpected˜valueoftheempiricalquantityweminimizedtochooseintheﬁrstplace.Furthermore,if

.If

issomedecreasingfunction

of(say,forsome01),thentheboundon˜givenbyTheorem1decreasesatareasonablerate.

6ABoundontheAdditionalErrorofCV

Inthissectionwestateageneraltheoremboundingtheaddi-tionalgeneralizationerrorsufferedbycrossvalidationcom-paredtoanypolynomialcomplexitymodelselectionalgo-rithm.Bythiswemeanthatgivenasampleofsize,

algorithmwillneverchooseavalueof˜forsomeﬁxedexponentlargerthan1.Weemphasizethatthisisamildconditionthatismetinpracticallyeveryrealisticmodelselectionproblem:althoughtherearemanydocumentedcir-cumstancesinwhichwemaywishtochooseamodelwhosecomplexityisontheorderofthesamplesize,wedonotimaginewantingtochoose,forinstance,aneuralnetworkwithanumberofnodesexponentialinthesamplesize.Inanycase,moregeneralbutmorecomplicatedassumptionsmaybesubstitutedforthenotionofpolynomialcomplexity,andwediscusstheseinthefullpaper.

Theorem2Letbeanypolynomialcomplexitymodelse-lectionalgorithm,andletmodelselection.LetbeanyinstanceofgeneralizationerrorofMthehypothesesandCVchosendenotebytheexpectedandCVrespectively.Then

log

ProofSketch:LetbearandomsampleLetexamples,where11andof

.bethepolynomialboundonthe

complexityselectedby,andlet˜˜˜11

bedeterminedby.Bydeﬁnition

ofCV,˜ischosenaccordingto˜argminˆ˜.

Bystandarduniformconvergenceargumentswehavethat˜˜ˆ

Theorem2yields

SGRM

log

Butaswehavepreviouslyobserved,thegeneralizationerrorofanymodelselectionalgorithm(including)oninput

˜,andourclaimdirectlyislowerboundedbymin

follows.

NotethattheboundofTheorem2doesnotclaimCV

forall(whichwouldmeanthatcrossvalidationM

isanoptimalmodelselectionalgorithm).Theboundgivenisweakerthanthisidealintwoimportantways.First,andperhapsmostimportantly,M1maybeconsiderablylargerthanM.Thiscouldeitherbeduetopropertiesoftheunderlyinglearningalgorithm,orduetoinherentphasetransitions(suddendecreases)intheoptimalinformation-theoreticlearningcurve[8,3]—thus,inanextremecase,itcouldbethatthegeneralizationerrorthatcanbeachievedwithinsomeclassbytrainingonexamplesiscloseto0,butthattheoptimalgeneralizationerrorthatcanbeachievedinbytrainingonaslightlysmallersampleisnear12.Thisisintuitivelytheworstcaseforcrossvalidation—whenthesmallfractionofthesamplesavedfortestingwascriticallyneededfortraininginordertoachievenontrivialperformance—andisreﬂectedintheﬁrsttermofourbound.Obviouslytheriskof“missing”phasetransitionscanbeminimizedbydecreasingthetestfraction,butonlyattheexpenseofincreasingthetestpenaltyterm,whichisthesecondwayinwhichourboundfallsshortoftheideal.However,unlikethepotentiallyunboundeddifferenceM1,M

ourboundonthetestpenaltycanbedecreasedwithoutanyproblem-speciﬁcknowledgebysimplyincreasingthetestfraction.

DespitethesetwocompetingsourcesofadditionalCVer-ror,theboundhassomestrengthsthatareworthdiscussing.Firstofall,theboundholdsforanymodelselectionproblem

.Webelievethatgivingsimilarlyinstance

generalboundsforanypenalty-basedalgorithmwouldbeextremelydifﬁcult,ifnotimpossible.Thereasonforthisbe-liefarisesfromthediversityoflearningcurvebehaviordoc-umentedbythestatisticalmechanicsapproach[8,3],amongothersources.Inthesamewaythatthereisnouniversallearningcurvebehavior,thereisnouniversalbehaviorfortherelationshipbetweenthefunctionsˆand—therelationshipbetweenthesequantitiesmaydependcriticallyonthetargetfunctionandtheinputdistribution(thispointismademoreformallyinSection7).CVissensitivetothisdependencebyvirtueofitstargetfunction-dependentanddistribution-dependentestimateof.Incontrast,bytheirverynature,penalty-basedalgorithmsproposeauniversalpenaltytobeassignedtotheobservationoferrorˆforahypothesisofcomplexity.

AmoretechnicalfeatureofTheorem2isthatitcanbecombinedwithboundsderivedforpenalty-basedalgorithmsusingTheorem1tosuggesthowtheparametershouldbetuned.Forexample,lettingbetheSGRMalgorithmdescribedinSection5,andcombiningEquation(11)with

for1

anypenalty-basedHereminmodelselectionalgorithm,either1or2forinstanceistheisthefunctionforinstancemin212,and

.expectedThus,ongeneralizationatleastoneoferrorthetwoofalgorithm

modelselec-tionproblems,thegeneralizationerrorofislowerboundedawayfromtheoptimalvaluemindependentof.

byaconstantin-ProofSketch:Fornotationalconvenience,intheproofwe

useˆofthesefunctions.and(Westart12with)toreferaroughtothedescriptionexpectedvaluesofthepropertiesofthetwoproblems(seeFigure9):inProblem1,the“right”choiceofis0,anyadditionalcodingdirectlyresultsinlargergeneralizationofcoding,decaysisrequiredgraduallytowithachieve.Inerror,nontrivialProblemandthetrainingerror,ˆ1generalization2,alargeamounter-ror,and2,thewheretrainingtheerrortrainingremainserrordropslargeasrapidly.

increasesuntil

Moreprecisely,wewillarrangethingssothattheﬁrstmodelselectionproblem(Problem1)hasthefollowingproper-ties(1)Thefunctionˆtionswith-intercepts1liesbetweentwolinearfunc-intercept21and111andcommon-mizedat110,and1furthermore,2;andforany(2)constant1ismini-wehavemodel1selectionproblem2.Wewill(Problemnextarrange2)willthatobey:the(1)secondThefunctionˆ21for02where1112,1111;and(2)Thefunctionboundedby2,but2islowerFigure9weillustrate1for0theconditionson2ˆ

2forthe0.twoIn

problems,andalsoincludehypotheticalinstancesofˆandˆ1furthermore2thatrepresentativeareconsistentofwiththe“true”thesebehaviorconditionsof(andtheˆarefunctionsactuallyobtainedforthetwoproblemswedeﬁnemomentarily).

Wecannowgivetheunderlyinglogicoftheproofusingthehypotheticalˆ1andˆ2.Let˜1denotethecomplexitychosenbyforProblem1,andlet˜2bedeﬁnedsimilarly.FirstconsiderthebehaviorofonProblem2.Inthisprob-lemweknowbyourassumptionson2thatiffailstochoose˜22,1,alreadygivingaconstantlowerboundonforthisproblem.Thisistheeasiercase;thusletusassumethat˜ofonProblem1.Referring22,toandFigureconsider9,wetheseebehavior0,ˆ1ˆthusthatfor02,andFor00

ˆ1ˆ218(becausepenalty-basedalgorithmsassigngreaterpenalties

forgreatertrainingerrororgreatercomplexity).haveassumedthat˜2

2,weknowthatSincewe

For2ˆ2ˆ2˜˜19

andinparticular,thisinequalityholdsfor0theotherhand,byourchoiceof0˜.

ˆ1andˆ2,ˆ1

ˆ2˜20.Therefore,

ˆ1˜2˜2ˆ2˜2˜2

Combiningthetwoinequalitiesabove(Equation18andEqua-tion19)withEquation20,wehavethat

For0

˜˜0

ˆ1

ˆ12

fromwhichitdirectlyfollowsthatin21

choose0˜Problem1,cannot

10.BythesecondconditiononProblem

1above,thisimpliesthat0;ifwearrangethat

bound0

onforforsomeProblemconstant1.

,thenwehaveaconstantlowerDuetospacelimitations,wedefertheprecisedescriptionsofProblems1and2forthefullpaper.However,inProblem1theclassesareessentiallythosefortheintervalsmodelselectionproblem,andinProblem2thearebasedonparityfunctions.

WenotethatalthoughTheorem3wasdesignedtocreatetwo

modelselectionproblemswiththemostdisparatebehaviorpossible,theprooftechniquecanbeusedtogivelowerboundsonthegeneralizationerrorofpenalty-basedalgorithmsundermoregeneralsettings.Inthefullpaperwewillalsoarguethatforthetwoproblemsconsidered,thegeneralizationerrorofCVisinfactclosetominadditivetermthatdecreasesrapidlywith(thatis,)withinforbothasmallprob-lems.Finally,weremarkthatTheorem3canbestrengthenedtoholdforasinglemodelselectionproblem(thatis,asinglefunctionclasssequenceanddistribution),withonlythetargetfunctionchangingtoobtainthetwodifferentbehaviors.Thisrulesoutthesalvationofthepenalty-basedalgorithmsviaproblem-speciﬁcparameterstobetuned,suchas“effectivedimension”.

8Conclusions

Basedonbothourexperimentalandtheoreticalresults,weofferthefollowingconclusions:

Modelselectionalgorithmsthatattempttoreconstructthe

curvesolelybyexaminingthecurveˆoftenhaveatendencytoovercodeorundercodeintheirhypothesisforsmallsamplesizes,whichisexactlythesamplesizeregimeinwhichmodelselectionisanissue.Suchten-denciesarenoteasilyeliminatedwithoutsufferingthereversetendency.

Thereexistmodelselectionproblemsinwhichahypothesis

whosecomplexityisclosetothesamplesizeshouldbechosen,andinwhichahypothesiswhosecomplexityiscloseto0shouldbechosen,butthatgenerateˆcurveswithinsufﬁcientinformationtodistinguishwhichisthecase.Thepenalty-basedalgorithmscannotsucceedinbothcases,whereasCVcan.

TheerrorofCVcanbeboundedintermsoftheerrorofany

otheralgorithm.TheonlycasesinwhichtheCVerrormaybedramaticallyworsearethoseinwhichphasetransitionsoccurintheunderlyinglearningcurvesatasamplesizelargerthanthatheldoutfortrainingbyCV.ThusweseethatbothtypesofalgorithmsconsideredhavetheirownAchilles’Heel.Forpenalty-basedalgorithms,itisaninabilitytodistinguishtwotypesofproblemsthatcallfordrasticallydifferenthypothesiscomplexities.ForCV,itisphasetransitionsthatunluckilyfallbetween1

examplesandexamples.Onbalance,wefeelthattheev-idencewehavegatheredfavorsuseofCVinmostcommoncircumstances.Perhapsthebestwayofstatingourposi-tionisasfollows:giventhegeneralupperboundonCVerrorwehaveobtained,andthelimitedapplicabilityofanyﬁxedpenalty-basedruledemonstratedbyTheorem3andtheexperimentalresults,theburdenofprooflieswiththeprac-titionerwhofavorsanpenalty-basedalgorithmoverCV.Inotherwords,suchapractitionershouldhaveconcreteevi-dence(experimentalortheoretical)thattheiralgorithmwilloutperformCVontheproblemofinterest.Suchevidencemustarisefromdetailedproblem-speciﬁcknowledge,sincewehavedemonstratedherethediversityofbehaviorthatispossibleinnaturalmodelselectionproblems.Acknowledgements

WegivewarmthankstoYoavFreundandRonittRubinfeldfortheircollaborationonvariousportionsoftheworkpresentedhere,andfortheirinsightfulcomments.ThankstoSebastianSeungandVladimirVapnikforinterestingandhelpfulconversations.

References

[1]A.R.BarronandT.M.Cover.Minimumcomplexityden-sityestimation.IEEETransactionsonInformationTheory,

37:1034–1054,1991.[2]T.CoverandJ.Thomas.ElementsofInformationTheory.

Wiley,1991.[3]D.Haussler,M.Kearns,H.S.Seung,andN.Tishby.Rigourous

learningcurveboundsfromstatisticalmechanics.InProceed-ingsoftheSeventhAnnualACMConfernceonComputationalLearningTheory,pages76–87,1994.[4]J.R.QuinlanandR.L.Rivest.Inferringdecisiontreesusing

theminimumdescriptionlengthprinciple.InformationandComputation,80(3):227–248,1989.[5]J.Rissanen.Modelingbyshortestdatadescription.Automat-ica,14:465–471,1978.[6]J.Rissanen.Stochasticcomplexityandmodeling.Annalsof

Statistics,14(3):1080–1100,1986.[7]J.Rissanen.StochasticComplexityinStatisticalInquiry,vol-ume15ofSeriesinComputerScience.WorldScientiﬁc,1989.[8]H.S.Seung,H.Sompolinsky,andN.Tishby.Statistical

mechanicsoflearningfromexamples.PhysicalReview,A45:6056–6091,1992.[9]M.Stone.Cross-validatorychoiceandassessmentofstatistical

predictions.JournaloftheRoyalStatisticalSocietyB,36:111–147,1974.[10]M.Stone.Asymptoticsforandagainstcross-validation.

Biometrika,64(1):29–35,1977.[11]V.N.Vapnik.EstimationofDependencesBasedonEmpirical

Data.Springer-Verlag,NewYork,1982.[12]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconver-genceofrelativefrequenciesofeventstotheirprobabilities.TheoryofProbabilityanditsApplications,16(2):264–280,1971.

0.50.40.30.20.100500100015002000Figure1:Experimentalplotsofthefunctions

withlocalminimum),

ˆ(monotonicallydecreasing(uppercurve)curveversuswithlocal(lowercurve

complexityminimum)forandatargetfunctionof100alternatingintervals,samplesize2000andnoiserate02.Eachdatapointrepresentsanaverageover10trials.Theﬂatteningofandthenoisysamplecanberealizedwithnooccurstrainingaterror.

thepointwhere0.50.40.30.20.150010001500200025003000Figure2:Experimentalplotsofgeneralizationerrors

(mostrapidinitialdecrease),MDL

andCV(intermediateinitialdecrease)GRM(leastrapidinitialatargetfunctionof100alternatingdecrease)intervalsversusandnoisesampleratesize0for20.Eachdatapointrepresentsanaverageover10trials.

700600500400300200100050010001500200025003000Figure3:Experimentalplotsofhypothesislengths˜MDL

(mostrapidinitialincrease),˜and˜atargetGRM

function(leastof100rapidalternatinginitialCVincrease)(intermediateintervalsversusandnoisesampleinitialratesizeincrease)

0for20.Eachdatapointrepresentsanaverageover10trials.

1.061.041.0210.980.960.940.920500100015002000Figure4:MDLtotalpenaltyˆversuscom-plexityforasinglerunon2000examplesofatargetfunctionof

100alternatingintervalsandnoiserate

020.Thereisalocalminimumatapproximately100,andtheglobalminimumatthepointofconsistencywiththenoisysample.

1.081.061.041.0210.980.960.940.920.90.880.8601000200030004000Figure5:MDLtotalpenalty

ˆversuscomplex-ityforasinglerunon4000examplesofatargetfunctionof100

alternatingintervalsandnoiserate020.Theglobalminimumhasnowswitchedfromthepointofconsistencytothetargetvalueof100.

21.81.61.41.210.80.60.40500100015002000Figure6:GRMtotalpenaltyˆ

1ˆversussamplesizeonatargetof

100alternatingintervalsandnoiserate020.Eachdatapointrepresentsanaverageover10trials.

3002502001501005000100020003000400050006000Figure8:Experimentalplotsofhypothesislength˜complexitypenaltymultipliers1.0(slowinitialincrease)GRM

andusing

0.5(rapidinitialincrease)onthecomplexitypenaltyterm1

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文