Please use this identifier to cite or link to this item:
Title: L’utilisation du Deep Learning pour l’extraction du contenu des pages web
Authors: madoui, soumia
Issue Date: 20-Jun-2019
Abstract: The problem of content extraction is a subject of study since the development of the World Wide Web. Its goal is to separate the main content of a web page, such as the text of an article, from the noisy content, such as advertisements and navigation links. Most content extraction approaches operate on the block level, that is, the web page is segmented into blocks, and then each of these blocks is determined to be part of the main content or the noisy content of the Web page. In this project, we try to apply the content extraction at a deeper level, namely to HTML elements. During the thesis, we investigate the notion of main content more closely, create a dataset of web pages whose elements have been manually labeled as either part of the main content or the noisy content by the web scraping, then we apply the deep learning (convolution neural network) to this data set in order to induce a model for separating the main content and the noisy content. Finally, this induced model is going to be evaluated by a different dataset of manually labeled Web pages using the web scraping also. Key words: Content extraction, deep learning, convolution neural network (CNN), web scraping, main content, noisy content.
Appears in Collections:Faculté des Sciences Exactes et des Sciences de la Nature et de la Vie (FSESNV)

Files in This Item:
File Description SizeFormat 
madoui_soumia.pdf4,25 MBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.