Knowledge Discovery in Textual Databases (KDT

Tags: text, text categorization, SouthAmerica, Knowledge Discovery, interesting, KDT, uniform distribution, Ford, Traditional data, data structure, texts, information flood, hierarchical structure, South America, data summarization, information theory, model distribution, concept hierarchy, Marsais J. Experimentsof, entropy, relative entropy, conditional distributions, Proceedingsof, data points, KL, conditional distribution, conditioning
Content: From: KDD-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Knowledge Discovery in Textual Databases (KDT)
RonenFeldmanandIdo Dagan
Math and ComputerScienceDept. Bar&n University Ramat-Gan,ISRAEL 52900 { feldman,[email protected]} bimacs.cs.biu.ac.il
Abstract The information age is characterizedby a RAPID GROWTH in the amountof information availablein electronicmedia. Traditional data handling methods are not adequate to cope with this information flood. KnowledgeDiscovery in Databases(KDD) is a new paradigm that focuses on computerizedexploration of large amounts of data and on discovery of relevant and interesting patterns within them. While most work on KDD is concerned with structured databases, it is clear that this paradigm is required for handling the huge amount of information that is available only in unstructuredtextual form. To apply traditional KDD on texts it is necessaryto impose some structure on the data that would be rich enough to allow for interesting KDD operations.On the other hand, we have to consider the severe limitations of current text processing technology and define rather simple structures that can be extracted from texts fairly automatically and in a reasonable cost. We propose using a text categorization paradigm to annotate text articles with meaningful concepts that are organized in hierarchical structure. We suggest that this relatively simple annotationis rich enoughto provide the basis for a KDD framework, enabling data summarization, exploration of interesting patterns, and trend analysis. This researchcombinesthe KDD and text categorizationparadigms and suggestsadvancesto the stateof the art in both areas. Introduction Knowledge discovery is defined as the nontrivial extraction of implicit, previously unknown, and potentially usefulinformation from given dataIpiatetskyShapiro and Frawley 19911.Algorithms for knowledge UA:I"3.4.,i.".,".Cly ".U..g."UI.+L +& IEm efficient and L&8aI,JvkwI.bV.n&c otl!y interesting lmowledge. In order to be regarded as efficient, the complexity of the algorithm must be polynomial (with low degree)both in spaceand time. Algorithms that can not meet this criteria won't be able to cope with very large databases.Knowledge would be regardedas interesting if it providessomenontrivial and usefulinsight aboutobjectsin the databaseT. hereare two main major bodiesof work in knowledgediscovery.The fust is concentratedaround applying machine learning and statistical analysis techniques towards automatic discoveryof patternsin knowledgebases,while the other body of work is concentratedaround providing a user
guidedenvironmentfor exploration of data. Among the systemsthat belong to the first group we can mention EXPLORA (Klosgen, 1992), KDW (Piatetsky-Shapiro and Matheus, 1992), and Spotlight (Anand and Kahn, 1991).Among the systemsthe belongto the secondgroup we can mention IMACS (Bra&man et al, 1992) and NielsenOpportunityExplorer (Anand andKahn 1993). Mcoonsctempreedvwioiiuhs swtrourckturiend &k&n&o&wledgieti dr&iswcoveary iwwgaes portion of the available information does not appearin structured databasesbut rather in collections of text articles drawn from various sources.However,before we can perform any kind of knowledgediscoveryin texts we must extract some structured information from them. Here we show how the Knowledge Discovery in Texts (KDT) systemis using the simplest form of information extraction, namely the categorizationof the topics of a text by meaningful concepts.While more complex types of information have been extracted from texts, most notably in the work presentedat the series of Message UnderstandingConferences(MUC), text categorization methods were shown to be simple, robust and easy to reproduce. Therefore text categorization can be consideredas an acceptablepre-requisitefor initial KDT effotts, which can be later followed by the incorporation of morecomplexdatatypes. data structure: the Concept Hierarchy In order to @or-m KDD tasks it is traditionally required that the data will be structuredin someway. Furthermore, this structure should reflect the way in which the user conceptualize the domain that is describedby the data. Most work on KDD is concernedwith structured data bases,and simply utilizes the given databasestructurefor the KDD purposes.In the caseof unstructuredtexts, we haveto decidewhich structure to imposeon the data. In doing so, we have to mnsider very carefully the following tradeoff. Given the severe limitations of currenttechnologyin robust processingof text we needto definerather simple structuresthat can be extractedfrom texts fairly automatically and in a reasonablecost. On
112 KDD-95
the other hand, the structure should be rich enough to allow for interestingKDD operations. In this paper,we proposea rather simple, data structure, which 1s r&&iy ey t* cx&act from" text&* As described below, this data structure enablesinteresting KDD operations. Our main goal is to study text collections by viewing and analyzing various concept distributions. Using concept distributions enablesus to identify distributions that highly deviatefrom the averAge Distribution (of some class of objects) or that are highly skewed (when expecting a uniform distribution). After identifying the limits of using this data structureit will be possible to extract further types of data from the text, enhancethe KDD algorithmsto exploit thenew types of data and examine their overall contribution to the KDD goals.
The Concept Hierarchy
The concept hierarchy is the central data structure in
our architecture. The concepthierarchy is a directed
acyclic graph (DAG) of concepts where each of the
concepts is identified by a unique name. An arc from
concept A to B denotesthat A is a more general concept
than B (i.e., communication + wireless
communication + cellular phone, company
+IBM,activity+product
announcement). A
portion of the "technology" subtree in the concept
hierarchy is shown in Figure 1 (the edges point
downward).
The hierarchy containsonly conceptsthat are of interest
to the user. Its structure defines the generalizations
and partitioning that the user wants to make when
summarizing and analyzing the dam. For example,the
arc wireless communication + cellular .y"rIr.".^lA".e^ U,?G,.rllr",E.&3.dl"u.(,nI1*I`.1* tnl b"aG..l&I;Q-.;L.I1.nlIO..YnlGr"rIl.Fg"~~""~~.l.a,xula:-.nl"*lil,lm
the userwants to aggregatethe data about cellular phones
with the data about All Other daughters of the concept
"wireless communication". Also, when analyzing the
distribution of data within the concept "wireless
communication", one of the categoriesby which the data
will be partitioned is "cellular phones". Currently, the
concepthierarchy is constructedmanually by the user. As
future research, we plan to investigate the use of
document clustering and term clustering methods
(Cutting et al, 1993; Pereira et al. 1993) to support the
user in constructing a concept hierarchy that is suitable
for texts of a given domain.
I
I
I
I m --
WI n -,
I
I
Q- -w-- 4wJ
Figure 1 - ConceptHierarchy for technologicalconcepts
Tagging the text with concepts UYoUrVhY umA-tUid"eA" iAc" tscumeo6d"v hvv, au c""eLt n"Af r&-lp&~pt~ tm?t correspond to its content (e.g. {IBM, product announcement,Power PC}, {Motorola, patent, cellular phone}). Tagging an article with a concept entails implicitly its tagging with all the ancestors of the concept in the hierarchy. It is therefore desired that an article will be tagged with the lowest conceptspossible. In the current version of the system these concept sets provide the only information extracted from an article, each set denoting the joint occurrenceof its membersin the article.
For the KDD purposes,it doesnot matter which method is used for tagging. As was explained earlier, it is very realistic to assume automatic tagging by some text categorization method. On the other hand, tagging may be semi-automaticor manual, as common for many text collections for which keywords or category labels are assignedby hand (like Reuters,ClariNet and Individual). KDD over concept distributions
Concept Distributions
The KDD mechanism summarizes and analyzes the
contentof the conceptsetsthat annotatethe articles of the
database.The basic notion for describing this content is
the distribution of daughter concepts relative to their
siblings (or more generally, the distribution of
descendantsof a node relative to other descendantsof that
node). Formally, we set a concept node C in the
hierarchy to specify a discrete random variable whose
--""1l"l" "."I.."" ^_^ A"-"*".l L". f&".a^__L_A"""/LA- _^"_.^-
PUSSlUlG VillUGS illt: UGMJlGU 0)' 1lS UilU~lllGlS
[llUlll IIUW VII
we relate to daughtersfor simplicity, but the definitions
can be applied for any combination of levels of
descendants).We denote the distribution of the random
variable by P(C=c), where c ranges over the daughtersof
C. The event C=c correspondsto the annotation of a
documentwith the conceptc. P(C=cJ is the proportion of
documents annotated with ci among all documents
annotatedwith any daughterof C.
Feldman 113
For example, the occurrencesof the daughtersof the
concept C="compufers" in the text corpus may be
distributed as follows: P(C="mainJ?ames"~)=O.l;
P(C= "work-stations") = 0.4; P(C= "PCs")=O.S.
We may also be interestedin the joint distribution of
severalconceptnodes.For example,the joint distribution
of C,=companyand Cr= `%ompl&?rs"may be as follows
(figures are consistent with those of the previous
example):
P(C~=IBM,C~=main&me)=O.O~
P(C~=Digital,C~=mainframe)=O.O3;
P(C,=IBM,Cz=work-stations)=O.Z; P(C~=Digital,
G=work-stations)=O.Z;
P(C,=IBM,C,=PCs)=O.4;
P(C~=Digital,C2=PCs)=O.l. A data point of this
distribution is a joint occurrenceof daughtersof the two
conceptscompanyand "computers".
The daughter distribution of a concept may be
conditionedon someother concept(s),which is regarded
as a conditioning event. For example, we may be
interestedin the daughterdistribution of C="computers"
in articleswhich discussannouncementosf new products.
This distribution is denotedas P(C=c I announcement),
where announcement is the conditioning concept.
`P/P, -Um-nIIiwnl&lrlJmI ac
UlllcIY II wnn"~*nrrrrmrvr~ormrKDenrt,l.,,
fAm"1 n"nlramunrpn"l,a
r"Lta.nIIn"b+klame
the proportion of documents annotated with both
mainframes and announcementamong all documents
annotatedwith both amwuncementand any daughterof
"compute& .
Conceptdistributions provide the userwith a powerful
way for browsingthe dataand for summarizingit. One
form of queries in the system simply presents
distributions and data points in the hierarchy. As is
common in Data Analysis and summarization, a
distribution can be presentedeither as a table or as a
graphical chart (bar, pie or radar). In addition, the
concept distributions serve to identify interesting
patterns in the data. Browsing and identification of
interesting patterns would typically be combinedin the
samesession,as the user specifieswhich portions of the
concepthierarchyshewishesto explore.
Comparing Distributions
The purpose of KDD is to present "interesting" information to the user.We suggest to quantify the degree of "interest"of somedata by comparingit to a given, or an "expected", model. Usually, interesting data would be data that deviates significantly from the expectedmodel. In some cases,the user may be interestedin datathat highly agreeswith themodel. In our case,we use conceptdistributions to describethe data. We therefore needa measurefor comparing the
`A similar use of conditional distributions appears in the EIXPLDRA system (Klosgen 1993). Our conditioned variables and conditioning eventsare analogousto Klosgen's dependentand independentvaribales.
distribution definedby the data to a model distribution. We chose to use the relative entropy measure (or Kullback-Leibler(KL) distance),defined in information theory, though we plan to investigateother measures as well. The KL-distance seemsto be an appropriate measure for our purpose since it measures the amount of information we lose if we model a given distribution p by another distribution q. Denoting the distributionof the data by p aud the modeldistribution by q, the distance from p(x) to q(x) measures the amount of "surprise" in seeingp while expecting q. Formally, the relative entropy betweentwo probability distributionsp(x) and q(x) is definedas:
`he relative entropyis alwaysnon-negativeand is 0 if andonly ifp=q. According to this view, interesting distributions will be those with a large distanceto the model distribution. Interesting data points will be thosethat make a big contribution to this distance, in one or several distributions. Below we identify three types of model distributions,with which it is interesting to comparea givendistributionof thedata model Distributions
The Uniform Distribution
Comparing with the uniform distribution tells us how
much a given distribution is "sharp", or heavily
concentratedon only few of the values it can take. For
example,regard a distribution of the form P(C=c I .I$,
where C=comparzyand 4 is a specific product (a
daughterof the conceptproduct). Distributions of this
form will have a large distance from the uniform
distributionfor productsxi that arementionedin the texts
only in connection with very few companies (e.g.,
productsthat aremantiacturedby only few companies).
Using the uniform distribution as a model meansthat we
establishour expectationonly on the structure of the
concepthierarchy,without relying on any findings in the
data. In this case,there is no reasonto expectdifferent
probabilitiesfor different siblings (a uniformativeprior).
Notice that measuring the KL-distance to the uniform
distribution is equivalentto measuringthe entropyof the
given distribution, sinceD(pllu)= log(N) - H(p), where u
i.c" tUh"ra r-wv.x-ifnnn
A"[email protected]"hU.rUti"nY-m,
*A. l iAnU +Wh"r, na.n-)-u-.r+
n"1f -pniow;Ik"lIa.2
valuesin the (discrete)distribution, and H is the entropy
function. Looking at D(plluJ makes it clear why using
entropy to measure the "interestingness", or the
"informativeness"of the given distribution is a special
caseof the generalframework, wherethe expectedmodel
is theuniform distribution.
114 JCDD-95
Sibling Distribution Considera conditional distribution of the form P(C=c I xi), wherexi is a conditioningconcept.In many cases,it is natural to expectthat this distribution would be similar to otherdistributionsof this form, in which the conditioning eventis a sibling of xi. For example,for C=activity, aud x,=Ford, we could expect a distribution that is quite similar to such distributions where the conditioning conceptis anothercar manufacturer. To capturethis reasoning,we use Avg P(C=c I x), the averagesibling distribution, as a model for P(C=c I 4). wherex rangesover all siblings of xi (including xr itself). In the ahoveexample,we would measurethe distance from the distributionP(C=activity I Ford) to the average distributionAvg P(C=activity I x), wherex rangesover all car manufacturers. The distance between these two distributionswould be large if the activity profile of Ford differs a lot from the average profile of other car manufacturers. In somecases,the user may be interestedin comparing two distributions which are conditionedby two specific siblings (e.g.Ford and GeneralMotors). In this case,the distancebetweenthe distributions indicateshow much thesetwo siblin-eos- ~ha--vesimilar cnmro-fmie-s--:with reag-a-rdt-o- t-h- e conditionedclass C (e.g. companiesthat are similar in their activity profile). Such distancescan also be usedto cluster siblings, forming subsetsof siblings that are similar to eachothe3. Past Distributions (trend analysis) One of the most important tools for an analyst is the ability to follow trendsin the activities of companiesin the variousdomains.For example,such a trend analysis tool should be able to compare the activities that a companydid in certain domain in the past with the activities it is doing in thosedomainscurrently. An exampleconclusionfrom such analysis can be that a wmwv is shifting interests and rather than concentratingin one domain it is moving to another domain. Finding trendsis achievedby using a distribution which is constructedfrom old dataas the expectedmodel for the samedistribution when constructedfrom new data.Then, trends can be discoveredby searching for significant deviationsfrom the expectedmodel. Finding Interesting Patterns Interestingpatternscan be identified at two levels.First, we can identify interesting patterns by finding `Notice that the KL-distance is an asymmetric measure. If desired, a symmetric measure can be obtained by the summing the two distances in both directions, that is, ~Wq~+~~qW.
distributionsthathave a high KL-distanceto the expected model, as defined by one of the three methods above. Second,when focusingon a specificdistribution, we can identify interesting patterns by focusing on those componentsthat mostly affect the KLdistimce to the expectedmodel. For example, when focusing on the distribution P(C=activity I Ford), we can discoverwhich activities are mentioned most &zquently with Ford (deviation from the uniform distribution), in which activities Ford is most different than an "average"car manufacturer (deviation from the average sibling distribution), and which activities has mostly changed their proportion over time within the overall activity profdeof Ford (deviationfrom pastdistribution). A major issuefor future rnsearchis to develop efficient algorithms that would searchthe concepthierarchy for interestingpatternsof the two typesabove.In our current implementationwe useexhaustivesearch,which is made feasibleby letting the userspecifyeachtime which nodes in the hierarchyare of interest(seeexamplesbelow).It is our impressionthat this mode of operationis useful and feasible,since in many casesthe user can, and would actually like to, provide guidanceon areas of current interest. Naturally, better search capabilities would furtherimprovethe system. Implementation and Results In order to test our framework we have implementeda prototype of KDT in LPA Prolog for Windows. The prototypeprovidesthe user a convenientway for finding interestingpatternsin the text corpora.The Corporawe used for this paper is the Reuters-22173 text categorization test collection. The documents in the Reuters-22173 collection appeared on the Reuters newswirein 1987. The 22173documentswereassembled and indexed with categoriesby personnelfrom Reuters Ltd. and Carnegie Croup, -Inc. in 1987. Further formatting and datafile productionwasdonein 1991and 1992by David D. Lewis and PeterShoemaker. The documentswere tagged by the Reuters personnel with 135 categoriesfrom the Economics domain. Our prototypesystemwnvertecl the documenttag fdes into a set of prolog facts. Each document is representedas prolog fact which includes all the tags related to the document.There are 5 types of tags: countries,topics, peopIe,organizationsand Stock ExchangesT. he user can investigatetheprolog databaseusing this framework.The examplesin this paperarerelatedto the couutry and topic tags of the articles (which are the Iargest tag groups); althoughwe havefound interestingpatternsin the other tag groupsaswell. Typically the userwould starta sessionwith the prototype by either loading a class hierarchy from a file or by
Feldman 115
building a new hierarchy basedon the collection of tags of all articles. The following classesare a sample of classesthat were built out the collection of countries mentioned in the articles: South America. Western Europe and Eastern Europe. In the next phase we comparedthe averagetopic Distribution of countries in South America to the average topic distribution of cmmtriesin WesternEurope.In the terms of the previous Section,we comparedfor all topics t the expressionAvg P(Topic = t I c) where c rangesover all countriesin South America to the same expressionwhere c ranges over all countriesin WesternEurope. In the next tables we see the topics for which we got the largest KLDistancebetweenthe suitableaveragesover the 2 classes. In Table 1 we seetopics which have a much larger share in SouthAmerica than in WesternEurope.In Table 2 we seethe topicswhich havea much larger sharein Western Europethan in SouthAmerica. Table 1 - ComparingSouthAmerica to WesternEurope
Table2 - ComparingWesternEuropeto SouthAmerica
~~~ acq dxmd
e `.a/.,, s..,.*: *r 4>
0.119
9.5 I373
OS/9
0.067
5.8 I230
0.416
earn
0.052
5.2 I204
OS/22
capgews mney_fx illtiz7est
0.035 0.03 1 0.029
1.8/71 4.9 I191 2.6 I101
0.05/l 1.1 I13 0.214
We can see that (according to this text collection)
countries South America have much larger portion of
agricultureand rare metals topics, while WesternEurope
countrieshave a much larger portion of ftnancial topics.
In the next phase, we went into a deeper analysis of
comparing the individual topic distribution of the
countries in South America to the average topic
11-&21--1:--ST-,I --.--r-I^- I- n-..A. *-^L^-
7- T-L,-
d
UlSUliJUUUZlO1
Zill WUIlKlGS
ill DUUUI AllLCllLii.
ill 1kiUlG J
we seethe topics in which the country topic distribution
deviatedconsiderablyfrom the averagedistribution (i.e.,
the topics that mostly affected the KL-distance to the
averagedistribution). From the table we can infer the
following information:
l Colmbia puts much larger emphasison coffeethan any other country in SouthAmerica. (it is interesting to note that Brazil that has 47 articles about coffee, more than any other country, is below the class averagefor Coffee).
l Both Brazil and Mexico (not shown) have a large proportionof articlesthat talk aboutloans. Table3 - ComparingTopic Distributions of Brazil, and Columbiato Avg P(Topic = t I SouthAmerica)
ship
0.065
7.4 (27)
1.0 (32)
lOan
0.063
29.6 (108) 18.2 (223)
zie
0-0.0.05279
51.259(4270)
0: .5 (22)
I nm<
In Table4 we seethe resultsof a similar analysisthat was donefrom the oppositepoint of view. In this casewe built a classof all agriculturerelated topics and computedthe distribution of each individual topic and comparedit to the averagedistribution of topics in the class.We picked 2 of the topics that got the highest relative entropy and listed the comties that that mostly affected the KG distanceto theaveragecountry distribution. Table4 - ComparingCountry Distributionsof cocoa,and coffeeto Avg P(Country = c I [email protected])
Finding Elementswith Small Entropy Another KDD tool is aimed at finding elementsin the databasethat have relatively low entropy, i.e., elements that have"sharp"distributions (a "sharp"distribution is a distribution that is heavily concentratedon a small fraction of the valuesit cantake). When the system computed the entropy of the topic distribution of all countriesin the databasewe found that Iran (aczmding to text collection used) that appearsin 141 articles has an entropy of 0.508, where 69 of the articles are aboutcrude,59 are about ship, the other 13 tP*im-u-l*euLslAninG- wr c^hu^ic_llhu_K_IrLya_n-W.aLnpzlp+LeIlarnLsdcb-belor-ne-,gl-at&uo1v1_e3-1-yd1i-,lfo-f-ew.rent1ot-tp-otm-piciss.
116 KDD-95
Columbia. In this case 75.5% of the topics in which Columbia is mentioned are crude (59.2%) and coffee(l6.3%). When the systemcomputedthe entropy of the country distribution of all topics we notice that the topic "cam" has very high concentrationin 6 countries.More than 95% of the articles that talk about earning involve the countriesUSA, Canada,UK, West Germany,Japanand Australia.The other5% aredistributedamonganother31 countries. Summary We have presenteda new framework for knowledge discovery in texts. This framework is basedon three components:The definition of a concept hierarchy, the categorizationof texts by conceptsfrom the hierarchy, and the comparison of concept distributions to find "unexpected"patterns. We conjecturethat our uniform and compact model can become useful for KDD in structm-ed databasesas well. Currently, we are performing research in text categorizationwhich has somesimilarity to that of (Hebrail and Marsais, 1992), which is gearedto make the KDT systemmore feasible andaccurate.In addition, we arebuilding anotherlayer to the system that will provide the user with textual conclusions based on the distribution analysis it is performing. We plan to usethe ICDT systemfor filtering and summarizing new articles. We conjecturethat the conceptdistributionsof articlesmarked as interestingby theusercanbe usedfor updatingthe user'spersonalnews profile and for suggestingsubscribingto news groupsof similar chamcteristics. Acknowledgments The authors would like to thank Haym Hirsh and the anonymous reviewers for helpful comments. Ronen Feldmanis supportedby anEshkolFellowship. References Anand T. and Kahn G., 1993. Opportunity Explorer: NavigatingLarge DatabasesUsing KnowledgeDiscovexy Templates. In Proceedingsof the 1993 workshop on KnowledgeDiscoveryin Databases. Apte, C., Damerau F. and Weiss S., 1994. Towards language independent automated learning of text categorizationmodels. In Proceedingsof ACM-SIGIR Conferenceon InformationRetrieval. Bra&man R., Selfridge P., Terveen L., Alunan B., Borgida A., Halper F.. Kirk T., Lazar A., McGuhmess D--I.. a--nd R~~e~s-nick L.: 1993. Integrated Sup-portfor Data
Archaeology. International Journal of Intelligent and CooperativeJnformationSystems.
Cutting C., Karger D. and PedersenJ., 1993. Constant interaction-time Scatter/Gatherbrowsing of very large document collections. In Proceedingsof ACM-SIGIR Conferenceon InfomUon Retrieval.
Lewis D., 1992.An evaluationof phrasal and clustered representations on a text categorization problem. In Proceedinn=sc o_f- A_-C_-M.--L`3--I-G--IR C_o_n--f-e-~re--nce on information retrieval.
Feldman R., 1994. Knowledge Diivery in Textual DatabasesT. echnicalReport.Bar-BanUniversity,RamatGan, Israel.
Frawley W.J., Piatetsky-Shapii G., and Matheus C.J., 1991.KnowledgeDiscoveryin DatabasesA: n Overview. In knowledgeDiscoveryin Databaseseds. G. PiitetskyShapiro and W. Frawley, l-27. Cambridge, MA: MIT PaWACSC "".
Hebrail G., and Marsais J. Experimentsof Texual Data Analysisat Electricitede France.In Proceedingsof IFCS92 of the International Federation of Classification Societies.
JacobsP., 1992. Joining statistics with NLP for text categorization.In Proceedingsof the 3rd Conferenceon Applied NaturalLanguageProcessing.
Ksulnu~ong~vnu WI. ., 1-,0,a9.3 PL ma"n".h"-lmm Lf,"v. AKn-"n..~--i~lp"rl~~ T-."'S- crnw.-m., i-nI Databases and Their Treatment in the Statistics Interpreter EXPLORA. International Journal for Intelligent Systemsvol. 7(7), 649-673.
Lewis D. and GaleW., 1994.Training text classifiersby uncertainty sampling. In Proceedmgsof ACM-SIGIR Conferenceon InformationRetrieval.
Mertzbacher M. and Chu W., 1993. Pattern-Based
Clusteringfor DatabasesAttribute Values.In Proceedmgs
,".If tIhynu 1ADjrO>_Ql nw,,".~rla,auhu,"-y,..
on AK~~+"r~r~lPWrl~t~"e
YniWcrnWw'Yn7.,
ipa
Databases.
Feldman 117

File: knowledge-discovery-in-textual-databases-kdt.pdf
Title: Knowledge Discovery in Textual Databases (KDT)
Author: Feldman, Ronen;Dagan, Ido
Subject: Knowledge Discovery;Data Mining;KDD-1995
Published: Thu Jul 1 10:55:28 1999
Pages: 6
File size: 0.7 Mb


PACKET STATUS REGISTER, 40 pages, 2.1 Mb

Surpassing the love of men, 3 pages, 0.04 Mb
Copyright © 2018 doc.uments.com