Categorization of Web Sites in Turkey With Svm

dc.contributor.advisor Püskülcü, Halis
dc.contributor.author Şimşek, Kadir
dc.date.accessioned 2014-07-22T13:51:33Z
dc.date.available 2014-07-22T13:51:33Z
dc.date.issued 2004
dc.description Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2004 en_US
dc.description Includes bibliographical references (leaves: 61-63) en_US
dc.description Text in English; Abstract: Turkish and English en_US
dc.description ix, 70 leaves en_US
dc.description.abstract In this study of topic .Categorization of Web Sites in Turkey with SVM. after a brief introduction to what the World Wide Web is and a more detailed description of text categorization and web site categorization concepts, categorization of web sites including all prerequisites for classification task takes part. As an information resource the web has an undeniable importance in human life. However the huge structure of the web and its uncontrolled growth led to new information retrieval research areas to be risen in last years. Web mining, the general name of these studies, investigates activities and structures on the web to automatically discover and gather meaningful information from the web documents. It consists of three subfields: .Web Structure Mining., .Web Content Mining. and .Web Usage Mining.. In this project, web content mining concept was applied on the web sites in Turkey during the categorization process. Support Vector Machine, a supervised learning method based on statistics and principle of structural risk minimization is used as the machine learning technique for web site categorization. This thesis is intended to draw a conclusion about web site distributions with respect to thematic categorization based on text. The popular web directory Yahoo.s 12 top level categories were used in this project. Beside of the main purpose, we gathered several statistical descriptive informations about web sites and contents used in html pages. Metatag usage percentages, html design structures and plug-in usage are some of these information. The processes taken through solution, start with employing a web downloader which downloads web page contents and other information such as frame content from each web site. Next, manipulating, parsing and simplifying the downloaded documents takes place. At this point, preperations for categorization task are completed. Then, by applying Support Vector Machine (SVM) package SVMLight developed by Thorsten Joachims, web sites are classified under given categories. The classification results obtained in the last section show that there are some over-lapping categories exist and accuracy and precision values are between 60-80. In addition to categorization results, we saw that almost 17 of web sites utilize html frames and 9367 web sites include metakeywords. en_US
dc.identifier.uri https://hdl.handle.net/11147/3447
dc.language.iso en en_US
dc.publisher Izmir Institute of Technology en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject.lcc QA76.9.D343 .S58 2004 en
dc.subject.lcsh Data mining en
dc.subject.lcsh Web sites--Turkey en
dc.subject.lcsh Web sites--Classification en
dc.title Categorization of Web Sites in Turkey With Svm en_US
dc.type Master Thesis en_US
dspace.entity.type Publication
gdc.author.institutional Şimşek, Kadir
gdc.author.institutional Püskülcü, Halis
gdc.coar.access open access
gdc.coar.type text::thesis::master thesis
gdc.description.department Thesis (Master)--İzmir Institute of Technology, Computer Engineering en_US
gdc.description.publicationcategory Tez en_US
gdc.description.scopusquality N/A
gdc.description.wosquality N/A
relation.isAuthorOfPublication f3844554-c555-4f40-8a31-c2b1f5f2d3e6
relation.isAuthorOfPublication.latestForDiscovery f3844554-c555-4f40-8a31-c2b1f5f2d3e6
relation.isOrgUnitOfPublication 9af2b05f-28ac-4014-8abe-a4dfe192da5e
relation.isOrgUnitOfPublication 9af2b05f-28ac-4004-8abe-a4dfe192da5e
relation.isOrgUnitOfPublication 9af2b05f-28ac-4003-8abe-a4dfe192da5e
relation.isOrgUnitOfPublication.latestForDiscovery 9af2b05f-28ac-4014-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
T000450.pdf
Size:
960.12 KB
Format:
Adobe Portable Document Format
Description:
MasterThesis

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: