AndrewY.Ng
CarnegieMellonUniversityPittsburgh,PennsylvaniaDanaRonHebrewUniversityJerusalem,Israel
1Introduction
Inthemodelselectionproblem,wemustbalancethecom-plexityofastatisticalmodelwithitsgoodnessoffittothetrainingdata.Thisproblemarisesrepeatedlyinstatisticales-timation,machinelearning,andscientificinquiryingeneral.Instancesofthemodelselectionproblemincludechoosingthebestnumberofhiddennodesinaneuralnetwork,de-terminingtherightamountofpruningtobeperformedonadecisiontree,andchoosingthedegreeofapolynomialfittoasetofpoints.Ineachofthesecases,thegoalisnottominimizetheerroronthetrainingdata,buttominimizetheresultinggeneralizationerror.
Manymodelselectionalgorithmshavebeenproposedintheliteratureofseveraldifferentresearchcommunities,toomanytoproductivelysurveyhere.(Amoredetailedhistoryoftheproblemwillbegiveninthefullpaper.)Perhapssurprisingly,despitethemanyproposedsolutionsformodelselectionandthediversemethodsofanalysis,directcomparisonsbetweenthedifferentproposals(eitherexperimentalortheoretical)arerare.
Thegoalofthispaperistoprovidesuchacomparison,andmoreimportantly,todescribethegeneralconclusionstowhichithasled.Relyingonevidencethatisdividedbetweencontrolledexperimentalresultsandrelatedformalanalysis,wecomparethreewell-knownmodelselectionalgorithms.Weattempttoidentifytheirrelativeandabsolutestrengthsandweaknesses,andweprovidesomegeneralmethodsforanalyzingthebehaviorandperformanceofmodelselectionalgorithms.Ourhopeisthattheseresultsmayaidthein-formedpractitionerinmakinganeducatedchoiceofmodel
capsufferedbytheentireclassofpenalty-basedalgorithmsthatdoesnotafflictcrossvalidation.InSection8,weweightheevidenceandfindthatitprovidesconcreteargumentsfa-voringtheuseofcrossvalidation(oratleastcauseforcautioninusinganypenalty-basedalgorithm).
2Definitions
Throughoutthepaperweassumethatafixedbooleantargetfunctionisusedtolabelinputsdrawnrandomlyaccord-ing,wetodefineafixedthedistributiongeneralization.Foranybooleanfunctionable.Weusetoerror
each1,wheredenoteisthetherandomsamplevari-size,,andisdrawn1randomlyandindependentlyaccordingto,wherethenoisebitwithprobability;wecall012thenoiserate0.1Inisthe1casethat0,wewillsometimeswishtodiscussthegen-eralizationerrorofwithrespecttothenoisywedefine
noisebit.Note,whereexamples,istheso
simplicity,1thatandwewillusethe1arerelatedexpression“with12bytheequality
highprobability”.For
tomeanwithprobability1overthedrawof,atacostofafactoroflog1inthebounds—thus,ourboundsallcontain“hidden”logarithmicfactors,butourhandlingofconfidenceisentirelystandardandwillbespelledoutinthefullpaper.
Weassumeanestedsequenceofhypothesisclasses(ormod-els)1.Thetargetfunctionmaynotmayorargminbecontainedinanyoftheseclasses,sowedefine
).Thus,and(similarly,
isthebestapproximationto
the(withqualityrespectofthistoapproximation.)intheclass,andNotethatmeasuresincreasingfunctionofsincethehypothesisfunctionisclassesanon-arenested.Thus,largervaluesofcanonlyimprovethepotentialapproximativepowerofthehypothesisclass.Ofcourse,thedifficultyistorealizethispotentialonthebasisofasmallsample.
Withthisnotation,themodelselectionproblemcanbestatedinformally:pothesis,thegoalon˜istothechoosebasisofahypothesisarandomsamplecomplexityofa˜,fixedandasizehy-˜˜,suchthattheresultinggeneralizationerrorisminimized.Inmanytreatmentsofmodelincludingours,itisexplicitlyorimplicitlyassumedselection,thatthemodelselectionalgorithmhascontrolonlyoverthechoiceofthecomplexity˜pothesis˜It,isbutassumednotoverthatthetherechoiceisafixedofthealgorithmfinalhy-˜.thatchoosesasetofcandidatehypotheses,onefromeachhypothesisclass.Giventhissetofcandidatehypotheses,themodelselectionalgorithmthenchoosesoneofthecandidatesasthefinalhypothesis.
Tomaketheseideasmoreprecise,wedefinethetrainingerrorˆˆtheversionspace:.Note—that:ˆ,andminˆmaycontainmorethanonefunctioninseveralfunctionsmayminimize
thetrainingerror.Ifwearelucky,wehaveinourpossessiona(possiblyrandomized)learningalgorithmthattakesasinputanysample˜andanycomplexityvalue,andoutputsamemberof(usingsomeunspecifiedcriteriontobreaktiesif1).Moregenerally,itmaybethecasethatfindinganyfunctioninisintractable,andthatissimplyaheuristic(suchasbackpropagationorID3)thatdoesthebestjobitcanatfinding˜withsmalltraining˜erroronandinputˆand.Ineithercase,wedefineexpecttraining—byˆgoing,likeˆˆ˜.Notethatweerror.
toalarger,complexity,tobeanon-increasingwecanonlyfunctionreduceourofWecannowgiveaprecisestatementofthemodelselec-tionproblem.Firstofall,aninstanceofthemodelse-lectiongetfunction,isproblemthehypothesisconsistsisthefunctionofatupleinputdistribution,classsequence,,where
andisisthethetar-un-derlyinglearningalgorithm.Themodelselectionproblemisthen:Giventhesample,andthesequenceoffunc-tions˜11˜determined˜by
thelearningalgorithm,selectacomplexityvaluethat˜theresultinggeneralizationerror.Thus,such˜minimizesamodelselectionalgorithmisgivenboththesampleandthesequenceof(increasinglycomplex)hypothesesderivedbyfrom,andmustchooseoneofthesehypotheses.
Thecurrentformalizationsufficestomotivateakeydefinitionandadiscussionofthefundamentalissuesinmodelselection.
Wedefine˜variable(determinedbytherandom.Thus,variableis)thatarandom
givesthetruegeneralizationerrorofthefunction˜chosenbyfromtheclass.Ofcourse,isnotdirectlyaccessibletoamodelselectionalgorithm;itcanonlybeestimatedorguessedinvariouswaysfromthesample.Asimplebutimportantobservationisthatnomodelselectionalgorithmcanachievegeneralizationerrorlessthanminthebehaviorofthefunction—especiallythelocation.Thusandvalueofitsminimum—isinsomesensetheessentialquantityofinterestinmodelselection.
Theprevailingfolkwisdominseveralresearchcommunities
positsthatwilltypicallyhaveaglobalminimumthatisnontrivial—thatis,atan“intermediate”valueofawayfromtheextremes0and.Asademonstrationofthevalidityofthisview,andasanintroductiontoapar-ticularmodelselectionproblemthatwewillexamineinourexperiments,wecallthereader’sattentiontoFigure1.Inthismodelselectionproblem(whichweshallrefertoastheintervalsmodelselectionproblem),theinputdomainissim-plythereallinesegment01,andthehypothesisclassissimplytheclassofallbooleanfunctionsover01inwhichweallowatmostalternationsoflabel;thusistheclassofallbinarystepfunctionswithatmost2steps.Fortheex-periments,theunderlyinglearningalgorithmthatwehaveimplementedperformstrainingerrorminimization.Thisisararecasewhereefficientminimizationispossible;wehavedevelopedanalgorithmbasedondynamicprogrammingthatrunsinlineartime,thusmakingexperimentsonlargesam-plesfeasible.Thesamplewasgeneratedusingthetargetfunctionin100thatdivides01into100segmentsofequal
width1100andalternatinglabel.InFigure1weplot(whichwecancalculateexactly,sincewehavechosenthetargetfunction)whenconsistsof2000randomex-amples(drawnfromtheuniforminputdistribution)corruptedbynoiseattherate02.Forourcurrentdiscussionitsufficestonotethatdoesindeedexperienceanontrivialminimum.Notsurprisingly,thisminimumoccursnear(butnotexactlyat)thetargetcomplexityof100.
3ThreeAlgorithmsforModelSelection
Thefirsttwomodelselectionalgorithmsweconsideraremembersofageneralclassthatweshallinformallyrefertoaspenalty-basedalgorithms(andshallformallydefineshortly).Thecommonthemebehindthesealgorithmsistheirattempttoconstructanapproximationtosolelyonthebasisofthetrainingerrorˆandthecomplexity,oftenbytryingto“correct”ˆbytheamountthatitunderestimatesthroughtheadditionofa“complexitypenalty”term.
InVapnik’sGuaranteedRiskMinimization(GRM)[11],˜chosenaccordingtotherule1
is˜argminˆ1
isanupperboundonˆ
andhenceˆ
ensurethattheresultingsumupperboundsto,andˆif,we
weareoptimisticwemightfurtherhopethatthesumisinfactacloseapproximationto,andthatitsminimizationisthereforetantamounttotheminimizationof.TheactualrulegiveninEquation(1)isslightlymorecomplexthanthis,
andreflectsarefinedboundonˆ
thatvariesfromforˆcloseto0to
npVaagainstworst-casechoicesabovebyfromalogarithmic.SincefactorthisintendedfactortorendersguardGRMuncompetitiveontheensuingexperiments,weconsiderthismodifiedandquitecompetitiverulewhosespiritisthesame.2
OurgoalhereissimplytogiveonereasonableinstantiationofMDL.Othercodingschemesareobviouslypossible;however,severalofourformalresultswillholdforessentiallyallMDLinstantiations.
that3
tonormalize,we1).Thistakeslog
bits;dividingbyobtain1log[2],whereisthebinaryentropyfunction.Nowgiven,thelabelsincanbedescribedsimplybycodingthemis-takesof(thatis,thoseindiceswhereatanormalizedcostofˆ.Technically,inthecoding),schemejustdescribedwealsoneedtospecifyversionandˆofMDL,butthatthewecostshallofexaminetheseisfornegligible.thetheintervalsThus,valuesmodeltheofselectionproblemdictatesthefollowingchoiceof˜:
˜argminˆ2
Inthecontextofmodelselection,GRMandMDLcanbothbe
interpretedasattemptstomodelbytransformingˆand.Moreformally,amodelselectionalgorithmoftheform
˜argminˆ3
shallbecalledapenalty-basedalgorithm4.Noticethatan
idealbythe(orpenalty-basedalgorithmwouldobey
ˆsameatleastvalueofˆ).
andwouldbeminimizedThethirdmodelselectionalgorithmthatweexaminehasa
differentspiritthanthepenalty-basedalgorithms.Incrossvalidation(CV)[9,10],weuseonlyafraction1oftheexamplesintoobtainthehypothesissequence˜˜1
ofthefirst1—thatis,˜isnowconsists1
examplesin.Here,where01isaparameteroftheCValgorithmwhosetuningwediscussbrieflylater.CVchooses˜accordingtotherule
˜argminˆ˜4
whereˆ˜istheerrorof˜on,thelastexamples
ofthatwerewithheldinselecting˜.NoticethatforCV,we
expectthequantity
˜tobe(perhapsconsiderably)largerthaninthecaseofGRMandMDL,becausenow˜waschosenonthebasisofonly1examplesratherthanallexamples.Forthisreasonwewishmoregeneralnotation
˜ofthesampletoindicatetointroducethefractionthe
0withheldfromtraining.CVsettlesforinsteadofwhichtodirectlyinestimateordertohavean.
independenttestsetwith4AControlledExperimentalComparison
Ourresultsbeginwithacomparisonoftheperformanceandpropertiesofthethreemodelselectionalgorithmsinacare-fullycontrolledexperimentalsetting—namely,theintervalsmodelselectionproblem.Amongtheadvantagesofsuchcontrolledexperiments,atleastincomparisontoempiricalresultsondataofunknownorigin,areourabilitytoexactlymeasuregeneralizationerror(sinceweknowthetargetfunc-tionandthedistributiongeneratingnthedata),andourability
ntopreciselystudytheeffectsofvaryingparametersofthedata(suchasnoiserate,targetfunctioncomplexity,andsamplesize),ontheperformanceofmodelselectionalgorithms.Theexperimentalbehaviorweobserveforeshadowsanumberofimportantthemesthatweshallrevisitinourformalresults.WebeginwithFigure2.Toobtainthisfigure,atrainingsam-plewasgeneratedfromtheuniforminputdistributionandlabeledaccordingtoanintervalsfunctionover01consist-ingof100intervalsofalternatinglabelandequalwidth5;thesamplewascorruptedwithnoiseatrate02.InFigure2,wehaveplottedthetruegeneralizationerrors(measuredwithrespecttothenoise-freesourceofexamples)GRM,MDLand
01forCV)ofthehypothesesCV(usingtestfraction
˜selectedfromthesequence˜1byeachthethree
algorithmsasafunctionofsamplesize,whichrangedfrom1to3000examples.AsdescribedinSection2,thehy-potheses˜wereobtainedbyminimizingthetrainingerrorwithineachclass.Detailsofthecodeusedtoperformtheseexperimentswillbeprovidedinthefullpaper.
Figure2demonstratesthesubtletyinvolvedincomparingthethreealgorithms:inparticular,weseethatnoneofthethreealgorithmsoutperformstheothersforallsamplesizes.Thuswecanimmediatelydismissthenotionthatoneofthealgorithmsexaminedcanbesaidtobeoptimalforthisprobleminanystandardsense.Gettingintothedetails,weseethatthereisaninitialregime(forfrom1toslightlylessthan1000)inwhichMDListhelowestofthethreeerrors,sometimesoutperformingGRMbyaconsiderablemargin.Thenthereisasecondregime(forabout1000toabout2500)whereaninterestingreversalofrelativeperformanceoccurs,sincenowGRMisthelowesterror,considerablyoutperformingMDL,whichhastemporarilyleveledoff.Inbothofthesefirsttworegimes,CVremainstheintermediateperformer.Inthethirdandfinalregime,MDLdecreasesrapidlytomatchGRMandtheslightlylargerCV,andtheperformanceofallthreealgorithmsremainsquitesimilarforalllargersamplesizes.
InsightintothecausesofFigure2isgivenbyFigure3,whereforthesamerunsusedtoobtainFigure2,weinsteadplotthequantities˜GRM,˜MDLand˜CV,thevalueof˜cho-senbyGRM,MDLandCVrespectively(thus,the“correct”value,inthesenseofsimplyhavingthesamenumberofintervalsasthetargetfunction,is100).Hereweseethatforsmallsamplesizes,correspondingtothefirstregimedis-cussedforFigure2above,˜GRMisslowlyapproaching100frombelow,reachingandremainingatthetargetvaluefor
1500.Althoughwehavenotshownitexplic-about
itly,GRMisincurringnonzerotrainingerrorthroughouttheentirerangeof.Incomparison,foralonginitialperiod(correspondingtothefirsttworegimesof),MDLissimplychoosingtheshortesthypothesisthatincursnotrainingerror(andthusencodesboth“legitimate”intervalsandnoise),andconsequently˜MDLgrowsinanuncontrolledfashion.Moreprecisely,itcanbeshownthatduringthisperiod˜MDLisobeying˜MDL21122,where0
isthenumberof(equallyspaced)intervalsinthetargetfunctionandisthenoiserate(soforthecurrentexperiment
constantlessthan1(oranalogously,multiplytheMDLcom-plexitypenaltytermbyaconstantgreaterthan1toreduceovercoding).Theresultsofanexperimentonsuchamod-ifiedversionofGRMareshowninFigures7and8,wheretheoriginalGRMperformanceiscomparedtoamodifiedversioninwhichthecomplexitypenaltyismultipliedby0.5.Interestinglyandperhapsunfortunately,weseethatthereisnofreelunch:whilethemodifiedversiondoesindeedcodemorerapidlyandthusreducethesmallgeneralizationer-ror,thiscomesatthecostofasubsequentovercodingregimewithacorrespondingdegradationingeneralizationerror(andinfactaconsiderablyslowerreturnto100thanunderthesameconditions)6
MDL
.Thereversephenomenon(re-luctancetocode)isexperiencedforMDLwithanincreasedcomplexitypenaltymultiplier(detailsinthefullpaper).
Letussummarizethekeypointsdemonstratedbytheseex-periments.First,noneofthethreealgorithmsdominatesthe
othersforallsamplesizes.Second,thetwopenalty-basedalgorithmsseemtohaveabiaseithertowardsoragainstcod-ingthatisovercomebytheinherentpropertiesofthedataasymptotically,butthatcanhavealargeeffectongeneraliza-tionerrorforsmalltomoderatesamplesizes.Third,thisbiascannotbeovercomesimplybyadjustingtherelativeweightoferrorandcomplexitypenalties,withoutreversingthebiasoftheresultingruleandsufferingincreasedgeneralizationerrorforsomerangeof.Fourth,whileCVisnotthebestofthealgorithmsforanyvalueof,itdoesmanagetofairlycloselytrackthebestpenalty-basedalgorithmforeachvalueof,andconsiderablybeatsbothGRMandMDLintheirregimesofweakness.Wenowturnourattentiontoourfor-malresults,whereeachofthesekeypointswillbedevelopedfurther.
5
ABoundonGeneralizationErrorfor
Penalty-BasedAlgorithms
Webeginourformalresultswithaboundonthegeneral-izationerrorforpenalty-basedalgorithmsthatenjoysthree
features.First,itisgeneral:itappliestopracticallyanypenalty-basedalgorithm,andholdsforanymodelselectionproblem(ofcourse,thereisapricetopayforsuchgenerality,asdiscussedbelow).Second,forcertainalgorithmsandcer-tainproblemstheboundcangiverapidratesofconvergencetosmallerror.Third,theformoftheboundissuggestiveofsomeofthebehaviorseenintheexperimentalresults.Westatetheboundforthespecialbutnaturalcaseinwhichtheunderlyinglearningalgorithmistrainingerrormini-mization;inthefullpaper,wewillpresentastraightforwardanalogueformoregeneral.BoththistheoremandTheo-rem2inthefollowingsectionarestatedforthenoise-freecase;butagain,straightforwardgeneralizationstothenoisycasewillbeincludedinthefullpaper.
Theorem1Letselectionprobleminwhichbeaninstanceoftheperformstrainingerrormodel
mini-mization,andassumeforconveniencethatistheVCdimen-˜
5
wheregeneralizationapproachesmin.Therateerrorofachievablethisapproachinoptanywillofdependthe(whichclassesisthebestonproperties
)as
of.
Proof:Foranyvalueof,wehavetheinequality
ˆ˜˜ˆ
6
because˜ischosentominimizeˆ.Usingthe
uniformconvergencebound
ˆ
˜
toobtainasmaller
quantity,andwecanreplacetheoccurrenceofˆontheright-handsideby
˜
˜
˜
0
,0we,andseeas
sothattheargumentsapproachthepoint01
weobtainthestatementofthetheorem.
10
LetusnowdiscusstheformoftheboundgiveninTheorem1.Thefirsttermerrorwithinapproachestheoptimalgeneralizationinthelimitoflarge,andthesecondtermdirectlypenalizeslargecomplexity.Thesetermsmaybethoughtofascompeting.Inorderfortoapproach
mininordertohaverapidlyafastrateandnotofconvergence),justasymptotically(thatshouldis,notpenalizecomplexitytoostrongly,whichisobviouslyat
oddswiththeoptimizationoftheterm
˜
smallwewouldlike
tobesmall,topreventthechoice
oflarge˜.However,
min
servethatwecanavoidweakening.ForEquationthisalgorithm,(7)toEquation
weob-(8),becausehere
˜11
Thisisnotsomysterious,sinceSGRMpenalizesstronglyfor
complexity(evenmoresothanGRM).Thisboundexpressesthegeneralizationerrorastheminimumofthesumofthebestpossibleerrorwithineachclassandapenaltyforcom-plexity.Suchaboundseemsentirelyreasonable,giventhatitisessentiallytheexpected˜valueoftheempiricalquantityweminimizedtochooseinthefirstplace.Furthermore,if
˜
.If
issomedecreasingfunction
of(say,forsome01),thentheboundon˜givenbyTheorem1decreasesatareasonablerate.
6ABoundontheAdditionalErrorofCV
Inthissectionwestateageneraltheoremboundingtheaddi-tionalgeneralizationerrorsufferedbycrossvalidationcom-paredtoanypolynomialcomplexitymodelselectionalgo-rithm.Bythiswemeanthatgivenasampleofsize,
algorithmwillneverchooseavalueof˜forsomefixedexponentlargerthan1.Weemphasizethatthisisamildconditionthatismetinpracticallyeveryrealisticmodelselectionproblem:althoughtherearemanydocumentedcir-cumstancesinwhichwemaywishtochooseamodelwhosecomplexityisontheorderofthesamplesize,wedonotimaginewantingtochoose,forinstance,aneuralnetworkwithanumberofnodesexponentialinthesamplesize.Inanycase,moregeneralbutmorecomplicatedassumptionsmaybesubstitutedforthenotionofpolynomialcomplexity,andwediscusstheseinthefullpaper.
Theorem2Letbeanypolynomialcomplexitymodelse-lectionalgorithm,andletmodelselection.LetbeanyinstanceofgeneralizationerrorofMthehypothesesandCVchosendenotebytheexpectedandCVrespectively.Then
CV
M
1
log
.
ProofSketch:LetbearandomsampleLetexamples,where11andof
.bethepolynomialboundonthe
complexityselectedby,andlet˜˜˜11
bedeterminedby.Bydefinition
ofCV,˜ischosenaccordingto˜argminˆ˜.
Bystandarduniformconvergenceargumentswehavethat˜˜ˆ
Theorem2yields
CV
SGRM
1
log
15
1
Butaswehavepreviouslyobserved,thegeneralizationerrorofanymodelselectionalgorithm(including)oninput
˜,andourclaimdirectlyislowerboundedbymin
follows.
NotethattheboundofTheorem2doesnotclaimCV
forall(whichwouldmeanthatcrossvalidationM
isanoptimalmodelselectionalgorithm).Theboundgivenisweakerthanthisidealintwoimportantways.First,andperhapsmostimportantly,M1maybeconsiderablylargerthanM.Thiscouldeitherbeduetopropertiesoftheunderlyinglearningalgorithm,orduetoinherentphasetransitions(suddendecreases)intheoptimalinformation-theoreticlearningcurve[8,3]—thus,inanextremecase,itcouldbethatthegeneralizationerrorthatcanbeachievedwithinsomeclassbytrainingonexamplesiscloseto0,butthattheoptimalgeneralizationerrorthatcanbeachievedinbytrainingonaslightlysmallersampleisnear12.Thisisintuitivelytheworstcaseforcrossvalidation—whenthesmallfractionofthesamplesavedfortestingwascriticallyneededfortraininginordertoachievenontrivialperformance—andisreflectedinthefirsttermofourbound.Obviouslytheriskof“missing”phasetransitionscanbeminimizedbydecreasingthetestfraction,butonlyattheexpenseofincreasingthetestpenaltyterm,whichisthesecondwayinwhichourboundfallsshortoftheideal.However,unlikethepotentiallyunboundeddifferenceM1,M
ourboundonthetestpenaltycanbedecreasedwithoutanyproblem-specificknowledgebysimplyincreasingthetestfraction.
DespitethesetwocompetingsourcesofadditionalCVer-ror,theboundhassomestrengthsthatareworthdiscussing.Firstofall,theboundholdsforanymodelselectionproblem
.Webelievethatgivingsimilarlyinstance
generalboundsforanypenalty-basedalgorithmwouldbeextremelydifficult,ifnotimpossible.Thereasonforthisbe-liefarisesfromthediversityoflearningcurvebehaviordoc-umentedbythestatisticalmechanicsapproach[8,3],amongothersources.Inthesamewaythatthereisnouniversallearningcurvebehavior,thereisnouniversalbehaviorfortherelationshipbetweenthefunctionsˆand—therelationshipbetweenthesequantitiesmaydependcriticallyonthetargetfunctionandtheinputdistribution(thispointismademoreformallyinSection7).CVissensitivetothisdependencebyvirtueofitstargetfunction-dependentanddistribution-dependentestimateof.Incontrast,bytheirverynature,penalty-basedalgorithmsproposeauniversalpenaltytobeassignedtotheobservationoferrorˆforahypothesisofcomplexity.
AmoretechnicalfeatureofTheorem2isthatitcanbecombinedwithboundsderivedforpenalty-basedalgorithmsusingTheorem1tosuggesthowtheparametershouldbetuned.Forexample,lettingbetheSGRMalgorithmdescribedinSection5,andcombiningEquation(11)with
for1
anypenalty-basedHereminmodelselectionalgorithm,either1or2forinstanceistheisthefunctionforinstancemin212,and
.
.expectedThus,ongeneralizationatleastoneoferrorthetwoofalgorithm
modelselec-tionproblems,thegeneralizationerrorofislowerboundedawayfromtheoptimalvaluemindependentof.
byaconstantin-ProofSketch:Fornotationalconvenience,intheproofwe
useˆofthesefunctions.and(Westart12with)toreferaroughtothedescriptionexpectedvaluesofthepropertiesofthetwoproblems(seeFigure9):inProblem1,the“right”choiceofis0,anyadditionalcodingdirectlyresultsinlargergeneralizationofcoding,decaysisrequiredgraduallytowithachieve.Inerror,nontrivialProblemandthetrainingerror,ˆ1generalization2,alargeamounter-ror,and2,thewheretrainingtheerrortrainingremainserrordropslargeasrapidly.
increasesuntil
Moreprecisely,wewillarrangethingssothatthefirstmodelselectionproblem(Problem1)hasthefollowingproper-ties(1)Thefunctionˆtionswith-intercepts1liesbetweentwolinearfunc-intercept21and111andcommon-mizedat110,and1furthermore,2;andforany(2)constant1ismini-wehavemodel1selectionproblem2.Wewill(Problemnextarrange2)willthatobey:the(1)secondThefunctionˆ21for02where1112,1111;and(2)Thefunctionboundedby2,but2islowerFigure9weillustrate1for0theconditionson2ˆ
2forthe0.twoIn
problems,andalsoincludehypotheticalinstancesofˆandˆ1furthermore2thatrepresentativeareconsistentofwiththe“true”thesebehaviorconditionsof(andtheˆarefunctionsactuallyobtainedforthetwoproblemswedefinemomentarily).
Wecannowgivetheunderlyinglogicoftheproofusingthehypotheticalˆ1andˆ2.Let˜1denotethecomplexitychosenbyforProblem1,andlet˜2bedefinedsimilarly.FirstconsiderthebehaviorofonProblem2.Inthisprob-lemweknowbyourassumptionson2thatiffailstochoose˜22,1,alreadygivingaconstantlowerboundonforthisproblem.Thisistheeasiercase;thusletusassumethat˜ofonProblem1.Referring22,toandFigureconsider9,wetheseebehavior0,ˆ1ˆthusthatfor02,andFor00
ˆ1ˆ218(becausepenalty-basedalgorithmsassigngreaterpenalties
forgreatertrainingerrororgreatercomplexity).haveassumedthat˜2
2,weknowthatSincewe
For2ˆ2ˆ2˜˜19
andinparticular,thisinequalityholdsfor0theotherhand,byourchoiceof0˜.
On
ˆ1andˆ2,ˆ1
2
ˆ2˜20.Therefore,
ˆ1˜2˜2ˆ2˜2˜2
20
Combiningthetwoinequalitiesabove(Equation18andEqua-tion19)withEquation20,wehavethat
For0
˜˜0
ˆ1
ˆ12
2
fromwhichitdirectlyfollowsthatin21
choose0˜Problem1,cannot
10.BythesecondconditiononProblem
1above,thisimpliesthat0;ifwearrangethat
bound0
onforforsomeProblemconstant1.
,thenwehaveaconstantlowerDuetospacelimitations,wedefertheprecisedescriptionsofProblems1and2forthefullpaper.However,inProblem1theclassesareessentiallythosefortheintervalsmodelselectionproblem,andinProblem2thearebasedonparityfunctions.
WenotethatalthoughTheorem3wasdesignedtocreatetwo
modelselectionproblemswiththemostdisparatebehaviorpossible,theprooftechniquecanbeusedtogivelowerboundsonthegeneralizationerrorofpenalty-basedalgorithmsundermoregeneralsettings.Inthefullpaperwewillalsoarguethatforthetwoproblemsconsidered,thegeneralizationerrorofCVisinfactclosetominadditivetermthatdecreasesrapidlywith(thatis,)withinforbothasmallprob-lems.Finally,weremarkthatTheorem3canbestrengthenedtoholdforasinglemodelselectionproblem(thatis,asinglefunctionclasssequenceanddistribution),withonlythetargetfunctionchangingtoobtainthetwodifferentbehaviors.Thisrulesoutthesalvationofthepenalty-basedalgorithmsviaproblem-specificparameterstobetuned,suchas“effectivedimension”.
8Conclusions
Basedonbothourexperimentalandtheoreticalresults,weofferthefollowingconclusions:
Modelselectionalgorithmsthatattempttoreconstructthe
curvesolelybyexaminingthecurveˆoftenhaveatendencytoovercodeorundercodeintheirhypothesisforsmallsamplesizes,whichisexactlythesamplesizeregimeinwhichmodelselectionisanissue.Suchten-denciesarenoteasilyeliminatedwithoutsufferingthereversetendency.
Thereexistmodelselectionproblemsinwhichahypothesis
whosecomplexityisclosetothesamplesizeshouldbechosen,andinwhichahypothesiswhosecomplexityiscloseto0shouldbechosen,butthatgenerateˆcurveswithinsufficientinformationtodistinguishwhichisthecase.Thepenalty-basedalgorithmscannotsucceedinbothcases,whereasCVcan.
TheerrorofCVcanbeboundedintermsoftheerrorofany
otheralgorithm.TheonlycasesinwhichtheCVerrormaybedramaticallyworsearethoseinwhichphasetransitionsoccurintheunderlyinglearningcurvesatasamplesizelargerthanthatheldoutfortrainingbyCV.ThusweseethatbothtypesofalgorithmsconsideredhavetheirownAchilles’Heel.Forpenalty-basedalgorithms,itisaninabilitytodistinguishtwotypesofproblemsthatcallfordrasticallydifferenthypothesiscomplexities.ForCV,itisphasetransitionsthatunluckilyfallbetween1
examplesandexamples.Onbalance,wefeelthattheev-idencewehavegatheredfavorsuseofCVinmostcommoncircumstances.Perhapsthebestwayofstatingourposi-tionisasfollows:giventhegeneralupperboundonCVerrorwehaveobtained,andthelimitedapplicabilityofanyfixedpenalty-basedruledemonstratedbyTheorem3andtheexperimentalresults,theburdenofprooflieswiththeprac-titionerwhofavorsanpenalty-basedalgorithmoverCV.Inotherwords,suchapractitionershouldhaveconcreteevi-dence(experimentalortheoretical)thattheiralgorithmwilloutperformCVontheproblemofinterest.Suchevidencemustarisefromdetailedproblem-specificknowledge,sincewehavedemonstratedherethediversityofbehaviorthatispossibleinnaturalmodelselectionproblems.Acknowledgements
WegivewarmthankstoYoavFreundandRonittRubinfeldfortheircollaborationonvariousportionsoftheworkpresentedhere,andfortheirinsightfulcomments.ThankstoSebastianSeungandVladimirVapnikforinterestingandhelpfulconversations.
References
[1]A.R.BarronandT.M.Cover.Minimumcomplexityden-sityestimation.IEEETransactionsonInformationTheory,
37:1034–1054,1991.[2]T.CoverandJ.Thomas.ElementsofInformationTheory.
Wiley,1991.[3]D.Haussler,M.Kearns,H.S.Seung,andN.Tishby.Rigourous
learningcurveboundsfromstatisticalmechanics.InProceed-ingsoftheSeventhAnnualACMConfernceonComputationalLearningTheory,pages76–87,1994.[4]J.R.QuinlanandR.L.Rivest.Inferringdecisiontreesusing
theminimumdescriptionlengthprinciple.InformationandComputation,80(3):227–248,1989.[5]J.Rissanen.Modelingbyshortestdatadescription.Automat-ica,14:465–471,1978.[6]J.Rissanen.Stochasticcomplexityandmodeling.Annalsof
Statistics,14(3):1080–1100,1986.[7]J.Rissanen.StochasticComplexityinStatisticalInquiry,vol-ume15ofSeriesinComputerScience.WorldScientific,1989.[8]H.S.Seung,H.Sompolinsky,andN.Tishby.Statistical
mechanicsoflearningfromexamples.PhysicalReview,A45:6056–6091,1992.[9]M.Stone.Cross-validatorychoiceandassessmentofstatistical
predictions.JournaloftheRoyalStatisticalSocietyB,36:111–147,1974.[10]M.Stone.Asymptoticsforandagainstcross-validation.
Biometrika,64(1):29–35,1977.[11]V.N.Vapnik.EstimationofDependencesBasedonEmpirical
Data.Springer-Verlag,NewYork,1982.[12]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconver-genceofrelativefrequenciesofeventstotheirprobabilities.TheoryofProbabilityanditsApplications,16(2):264–280,1971.
0.50.40.30.20.100500100015002000Figure1:Experimentalplotsofthefunctions
withlocalminimum),
ˆ(monotonicallydecreasing(uppercurve)curveversuswithlocal(lowercurve
complexityminimum)forandatargetfunctionof100alternatingintervals,samplesize2000andnoiserate02.Eachdatapointrepresentsanaverageover10trials.Theflatteningofandthenoisysamplecanberealizedwithnooccurstrainingaterror.
thepointwhere0.50.40.30.20.150010001500200025003000Figure2:Experimentalplotsofgeneralizationerrors
(mostrapidinitialdecrease),MDL
andCV(intermediateinitialdecrease)GRM(leastrapidinitialatargetfunctionof100alternatingdecrease)intervalsversusandnoisesampleratesize0for20.Eachdatapointrepresentsanaverageover10trials.
700600500400300200100050010001500200025003000Figure3:Experimentalplotsofhypothesislengths˜MDL
(mostrapidinitialincrease),˜and˜atargetGRM
function(leastof100rapidalternatinginitialCVincrease)(intermediateintervalsversusandnoisesampleinitialratesizeincrease)
0for20.Eachdatapointrepresentsanaverageover10trials.
1.061.041.0210.980.960.940.920500100015002000Figure4:MDLtotalpenaltyˆversuscom-plexityforasinglerunon2000examplesofatargetfunctionof
100alternatingintervalsandnoiserate
020.Thereisalocalminimumatapproximately100,andtheglobalminimumatthepointofconsistencywiththenoisysample.
1.081.061.041.0210.980.960.940.920.90.880.8601000200030004000Figure5:MDLtotalpenalty
ˆversuscomplex-ityforasinglerunon4000examplesofatargetfunctionof100
alternatingintervalsandnoiserate020.Theglobalminimumhasnowswitchedfromthepointofconsistencytothetargetvalueof100.
21.81.61.41.210.80.60.40500100015002000Figure6:GRMtotalpenaltyˆ
1
1ˆversussamplesizeonatargetof
100alternatingintervalsandnoiserate020.Eachdatapointrepresentsanaverageover10trials.
3002502001501005000100020003000400050006000Figure8:Experimentalplotsofhypothesislength˜complexitypenaltymultipliers1.0(slowinitialincrease)GRM
andusing
0.5(rapidinitialincrease)onthecomplexitypenaltyterm1
因篇幅问题不能全部显示,请点此查看更多更全内容