zakur
19th August 2003, 04:52 PM
Online document search reveals secrets (http://www.newscientist.com/news/news.jsp?id=ns99994057)
Many documents published online may unintentionally reveal sensitive corporate or personal information, according to a US computer researcher.
Simon Byers, at AT&T's research laboratory in the US, was able to unearth hidden information from many thousands of Microsoft Word documents posted online using a few freely available software tools and some basic programming techniques.
Sophisticated editing programs will often store information in a document file that the end user will not see. Storing recently deleted text can, for example, make editing a more efficient process. But Byers says it could also expose unaware users to significant risks.
In his report, Byers suggests that a crook could analyse electronic documents to gather information that could help them carry out corporate espionage or steal someone else's identity to commit fraud.
"It is feasible that an individual may include their social security number on copies of a resume sent to prospective employers, but delete it from the version put online to guard against identify theft," Byers writes.
Using an ordinary online search engine and a random selection of keywords, Byers was able to find more than 100,000 Word documents including business documents and individual resumes. He chose to examine Word files because they are so common and stresses that other document formats can contain similar hidden information.
For example, in 2002 the Washington Post published a version of a letter sent by the Washington sniper in Adobe PDF format. Names and telephone numbers were visibly blacked out, but still found embedded in the file. However, Byers's new research reveals how widespread such problems could be.
After downloading the Word files, Byers used the free software tools "antiword" and "catdoc" to convert them to plain text. He then wrote a simple script to locate text that was not displayed in the original Word format. Byers discovered a wealth of deleted text and potentially sensitive information including people's names, email headers, network paths and text from related documents.Be careful what you're putting on the Web...
Many documents published online may unintentionally reveal sensitive corporate or personal information, according to a US computer researcher.
Simon Byers, at AT&T's research laboratory in the US, was able to unearth hidden information from many thousands of Microsoft Word documents posted online using a few freely available software tools and some basic programming techniques.
Sophisticated editing programs will often store information in a document file that the end user will not see. Storing recently deleted text can, for example, make editing a more efficient process. But Byers says it could also expose unaware users to significant risks.
In his report, Byers suggests that a crook could analyse electronic documents to gather information that could help them carry out corporate espionage or steal someone else's identity to commit fraud.
"It is feasible that an individual may include their social security number on copies of a resume sent to prospective employers, but delete it from the version put online to guard against identify theft," Byers writes.
Using an ordinary online search engine and a random selection of keywords, Byers was able to find more than 100,000 Word documents including business documents and individual resumes. He chose to examine Word files because they are so common and stresses that other document formats can contain similar hidden information.
For example, in 2002 the Washington Post published a version of a letter sent by the Washington sniper in Adobe PDF format. Names and telephone numbers were visibly blacked out, but still found embedded in the file. However, Byers's new research reveals how widespread such problems could be.
After downloading the Word files, Byers used the free software tools "antiword" and "catdoc" to convert them to plain text. He then wrote a simple script to locate text that was not displayed in the original Word format. Byers discovered a wealth of deleted text and potentially sensitive information including people's names, email headers, network paths and text from related documents.Be careful what you're putting on the Web...