Abstract—In this paper we address the problem of dataset
extraction from research articles. With the growing digital data
repositories and the demand of data centric research in data
mining community, finding appropriate dataset for a research
problem has become an essential step in scientific research. But
given the wide variety of data usage in scientific research it is
very difficult to figure out which datasets are most useful for a
particular research topic. To alleviate this problem, an
automated dataset search engine is a powerful tool. In this work
we propose a novel approach to extract dataset names from
research articles. We propose a novel way of using “web
intelligence” from academic search engines and online
dictionaries to mine dataset names from research articles. We
also show a comparison between different sources of “web
knowledge” by comparing different academic search engines
such as Google scholar, Microsoft academic search. The
performance of this approach is evaluated using standard
information retrieval metric such as precision, recall and
F-measure. We get an F-measure of 80%. This accuracy is
significant for an unsupervised approach.
Index Terms—Dataset, information retrieval, web mining,
search engines.
The authors are with CS department at University of Minnesota,
Minnesota, MN 55414, USA (e-mail: ayush@ cs.umn.edu, srivas@
cs.umn.edu).
Cite:Ayush Singhal and Jaideep Srivastava, "Data Extract: Mining Context from the Web for Dataset Extraction," International Journal of Machine Learning and Computing vol. 3, no. 2, pp. 219-223, 2013.