5/19/2023 0 Comments Html regex data extractorThe proposed classification approach uses only text data in the image elements rather than extracting local and global features such as pixels or color of the image.Īutomatic relevant image extraction is a web data extraction task that has been proposed for determining the representative images of a web page ,. This study aims to automatically discover the relevant image in a web document by eliminating the irrelevant images as boilerplate. Increasing the number of local and global features for an extraction task increases the complexity of the extraction task, but it does not guarantee a performance improvement. The global features are constructed from the web site properties, including the average text length of the HTML element for the given web site, the average number of links inside the relevant HTML elements, the hierarchical depth of the relevant content, and etc. Local features have page-specific features such as the ratio of the number of words per text length inside the HTML element, the parent tag names, the CSS style class names, etc. The boiler template removal process through a classification model requires complex feature extraction processes where the features are constructed from local and global properties of HTML elements. The boiler elements contain irrelevant content, including menu items, icons, advertisements, widgets, and interactive materials. These studies use a classification model to remove boiler elements from a web page. Most of the studies focus on automatic data extraction techniques. The proposed approach examines the textual data of images like an expert user and extracts the most appropriate regular expressions (regexes) through positive and negative image samples. In this study, we propose an extraction approach for determining the relevant images from textual image elements. A large training dataset is needed to increase the performance of the machine learning model. On the other hand, automatic extraction techniques find features from these elements and construct a machine learning model on these features. The worst side of this extraction technique is that it depends on an expert user. The expert user examines the HTML elements in the source code of the web page and investigates an appropriate pattern for this task. An expert user or automatic extraction techniques are needed to extract this content. Web pages contain valuable information located inside titles, main content, comments, names, keywords, images, videos, etc. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this study, we propose a fully-automated approach based on alignment of regular expressions to automatically extract the relevant images from web pages. However, these operations are difficult and laborious. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. Public void setLinkAddress(String linkElement) ĪrrayList linkElements = htmlTagExtraction.extractHTMLLinks(this.HTML_DOCUMENT) įor (int i = 0 i < linkElements.Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. You should take a look at the Pattern class documentation to learn how to construct your own regular expressions according to your policy. So here are the two regular expressions we are going to use :
0 Comments
Leave a Reply. |