Information Sciences and Technology

IST academic search engine awarded ‘Best Open Source Project’ by BCS

Co-created by IST Professor C. Lee Giles, CiteSeerX served as a pioneer platform to help scholars locate scientific literature and citations online

C. Lee Giles, David Reese Professor of Information Sciences and Technology Credit: Penn StateCreative Commons

UNIVERSITY PARK, Pa. — CiteSeerX, one of the world’s earliest open source academic search engines and based in the Penn State College of Information Sciences and Technology (IST), has been recognized by the Information Retrieval Specialist Group of the British Computer Society (BCS) as the Best Open Source Project as part of its 2021 Search Industry Awards.  

“It’s quite an honor for Penn State and IST to have this recognition from such a prominent society,” said C. Lee Giles, David Reese Professor of Information Sciences and Technology and co-creator of the search engine. 

Originally launched as CiteSeer in 1998 and renamed CiteSeerX in 2008, the search engine was one of the pioneer platforms that implemented the automated citation indexing technique to connect papers and researchers as a network. It actively crawls and harvests academic and scientific documents online and uses automatous citation indexing, making it possible for users to find related papers using citation graphs. In order to perform this indexing and information extraction as scale, CiteSeerX uses several machine learning methods. It is often considered a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search. 

"Automatically, we were able to bring up how many citations a paper had gotten,” said Giles. “Indexing based on importance was revolutionary at the time.”  

“Lee’s innovation and machine learning expertise, along with his proficiency in developing novel specialized search engines including CiteSeerX, have elevated him as a world-renowned leader in his field,” added Andrew Sears, dean of the College of IST. “We are proud to join BCS in celebrating Lee and recognizing CiteSeerX as a cutting-edge platform more than a decade after its launch.” 

CiteSeerX has grown to host more than 10 million full text English documents and metadata—including 32 million authors and 240 million citation mentions. It has three million individual users worldwide and receives one billion hits and 180 million downloads annually. The code and data supporting CiteSeerX has been open access since its inception, meaning it can be adapted as needed, by anyone, to fit users’ requirements.   

 "We don’t keep it to ourselves," Giles said. "We’ve shared it with others so they can build similar systems. Because it's modular, it can be changed to meet their needs." 

CiteSeerX was funded by the National Science Foundation, Microsoft, NASA and the Penn State College of Information Sciences and Technology. The initial search engine, CiteSeer, was created by Giles and his colleagues Kurt Bollacker and Steve Lawrence when they were at the NEC Research Institute (now NEC Labs). Its second generation, CiteSeerX, was developed by Giles and Isaac G. Councill, who earned a doctorate from the College of IST in 2006 and continued with the college as a postdoctoral scholar until 2008. The next generation CiteSeerX is being developed at Penn State in collaboration with Jian Wu, assistant professor of computer science at Old Dominion University. According to Wu, the team is “refactoring CiteSeerX from Solr Lucene and mySQL to Elasticsearch, all of which is open source.” 

The BCS Search Industry Awards recognize people, projects and organizations that have excelled in the design of search and information retrieval products and services. A charity with a royal charter, BCS aims to lead the information technology industry through its ethical challenges, support the people who work in the industry and make IT good for society. BCS currently has more than 60,000 members in 150 countries. 

CiteSeerX can be found at citeseerx.ist.psu.edu

Last Updated April 19, 2022

Contact