View Full Version : need to remove rubbish from text files
The Fool
9th October 2009, 03:03 AM
I have a million text files with large amounts of rubbish ascii and control characters all through them because the person who generated then from rtf files used a process that left them in that state.
Anyone know of a text file cleaner that I can run over them that will preserve the file name and clean all the crap out?
server 2008...
Wudang
9th October 2009, 03:17 AM
There's a few grep-based tools on t'interwebs that look useful. E.g.http://www.snapfiles.com/get/vgrep.html
Beanbag
9th October 2009, 03:33 AM
Well, if it was me, I'd write a program that reads and re-writes the files, discarding any characters that don't fall into the range of ASCII codes for numbers, letters, and punctuation.
But, then again, I'm a DIY-type person.
Beanbag
The Fool
9th October 2009, 03:36 AM
Well, if it was me, I'd write a program that reads and re-writes the files, discarding any characters that don't fall into the range of ASCII codes for numbers, letters, and punctuation.
But, then again, I'm a DIY-type person.
Beanbag
thats exactly what I need but I'm not a do it yourself sort of guy....more of a download it free from a geeky site guy... :)
Paul C. Anagnostopoulos
9th October 2009, 05:27 AM
Fool, I have a table-driven file converter that can do what you need. I use it to translate text files from one form to another (e.g., Word text to TeX). If you want to give me the files and a little spec, I'd be happy to clean them up for you.
~~ Paul
The Fool
9th October 2009, 07:15 PM
Fool, I have a table-driven file converter that can do what you need. I use it to translate text files from one form to another (e.g., Word text to TeX). If you want to give me the files and a little spec, I'd be happy to clean them up for you.
~~ Paul
Hi, thanks for the generous offer but......when I said a million, I wasn't exaggerating. Around 1,200,000 files total of just under 5Gb. Can't really send them to you.
Ducky
9th October 2009, 08:38 PM
Hi, thanks for the generous offer but......when I said a million, I wasn't exaggerating. Around 1,200,000 files total of just under 5Gb. Can't really send them to you.
Good grief man! 5Gb each? How much disk is that?
Wowbagger
9th October 2009, 09:06 PM
This sounds like a job for...... Regular Expressions!!
That reminds me. I need to brush up on my regular expressions skills. :o
The Fool
9th October 2009, 09:28 PM
Good grief man! 5Gb each? How much disk is that?
no thats total size of all files...
They are chock full of lumps of this sort of crapola.
oL0a)p")$'..._I$'JRI$''''I$$'I)DoU...o_(R)oY(R)OUmO[aLc槟``<F]E
F>UQO,8ssY<-XAuy_5UUAA-s__A`^)y.e_mIkXEy=U,* "9Iu-a_--E
sYIH(R);FfPu$I=d}=OEYgT'p:k0O>cO_v5Yzh^s-7B"*E
,fBaAYYku>GKEkr!W(c)vHxO%YV1o=;E-{_y"_<K~m-O@Z]E
-}{nUYM}-,ldjDU2(-Oktm&i3koO(c)_*Yz>J:8stm...UlݙZ<6_E
AQz"UYgEe"nF>PloOE
Ua+z n{s_'Os~I:*C}?_]~-U'"4odEN4Y*kd1I-^51,_O_y+xt=E
A?k-_ XmE
from when they were converted from rtf... I have to reduce thier file size because the database search indexing is indexing all this crap.
something I can run over the files to strip out >127 ascii characters would be the thing. at least it would be a start...
Ducky
9th October 2009, 09:50 PM
This sounds like a job for...... Regular Expressions!!
That reminds me. I need to brush up on my regular expressions skills. :o
http://imgs.xkcd.com/comics/regular_expressions.png
Oliver
10th October 2009, 02:58 AM
I have a million text files with large amounts of rubbish ascii and control characters
Why don't you just delete them? I mean, 5 Gigs of corrupt text files - think about it...
The Fool
10th October 2009, 03:22 AM
Why don't you just delete them? I mean, 5 Gigs of corrupt text files - think about it...
I would love to delete them....believe me.
Hey fool....how did you rescue those files? You know, the ones that clown created from those RTFs and then deleted the source RTFs and all backups of them?
Oh, simple.... Seeing as recreating good text files is not possible I just deleted the only remaining rubbish filled text versions of them.
The ones full of rubbish still index but the indexing engine takes a loooong time, it can't get done overnight as it needs to be
I don't know why I'm doing it. I'm supposed to be retired.
someone please help me, I need to get back to my rocking chair on the front porch.
moopet
10th October 2009, 03:25 AM
no thats total size of all files...
They are chock full of lumps of this sort of crapola.
oL0a)p")$'..._I$'JRI$''''I$$'I)DoU...o_(R)oY(R)OUmO[aLc槟``<F]E
F>UQO,8ssY<-XAuy_5UUAA-s__A`^)y.e_mIkXEy=U,* "9Iu-a_--E
sYIH(R);FfPu$I=d}=OEYgT'p:k0O>cO_v5Yzh^s-7B"*E
,fBaAYYku>GKEkr!W(c)vHxO%YV1o=;E-{_y"_<K~m-O@Z]E
-}{nUYM}-,ldjDU2(-Oktm&i3koO(c)_*Yz>J:8stm...UlݙZ<6_E
AQz"UYgEe"nF>PloOE
Ua+z n{s_'Os~I:*C}?_]~-U'"4odEN4Y*kd1I-^51,_O_y+xt=E
A?k-_ XmE
from when they were converted from rtf... I have to reduce thier file size because the database search indexing is indexing all this crap.
something I can run over the files to strip out >127 ascii characters would be the thing. at least it would be a start...
Here's a sane question: When whoever it was was taking the drugs which caused them to do this conversion, do you know what they were trying to convert it to? For instance, if they were trying to convert it to MingeSoft WordProcBonzai 3.2 format, we might have a handle on what could be done to convert it back. As it is, if there are no obvious delimiters between "valid" text and control codes it could be a pita.
The Fool
10th October 2009, 03:51 AM
Here's a sane question: When whoever it was was taking the drugs which caused them to do this conversion, do you know what they were trying to convert it to? For instance, if they were trying to convert it to MingeSoft WordProcBonzai 3.2 format, we might have a handle on what could be done to convert it back. As it is, if there are no obvious delimiters between "valid" text and control codes it could be a pita.
OK. They are CVs from a recruiting company that went belly up. My recruitment company purchased the carcass of this company. The information on these files is a valuable resource if you are looking for candidates to fill jobs. On their way out the door the IT people of this dying company were apparently requested to dump all the CVs from their db as "plain text files". This is apparently what they believe "plain text files" to be. They were probably RTF files at one point and all the crap is the formatting and embedded graphics, pics etc... I'm pretty sure the cake can't be unbaked so all I want to achieve is getting the file size down a bit without losing the english.
I thought losing everything but upper and lower case letters, numbers and any newlines and carriage returns would leave a somewhat smaller file that still retains the search indexable English content. the index engine goes crazy on the extended ascii characters and takes ages to produce huge index files that slow the user when running keyword searches to the point that they fall asleep at the keyboard.
a_unique_person
10th October 2009, 03:57 AM
I have a million text files with large amounts of rubbish ascii and control characters all through them because the person who generated then from rtf files used a process that left them in that state.
Anyone know of a text file cleaner that I can run over them that will preserve the file name and clean all the crap out?
server 2008...
That could be all the rtf stuff. If you rename one to .rtf, is is readable in word?
The Fool
10th October 2009, 04:00 AM
Here's a sane question: When whoever it was was taking the drugs which caused them to do this conversion, do you know what they were trying to convert it to? For instance, if they were trying to convert it to MingeSoft WordProcBonzai 3.2 format, we might have a handle on what could be done to convert it back. As it is, if there are no obvious delimiters between "valid" text and control codes it could be a pita.
OK. They are CVs from a recruiting company that went belly up. My recruitment company purchased the carcass of this company. The information on these files is a valuable resource if you are looking for candidates to fill jobs. On their way out the door the IT people of this dying company were apparently requested to dump all the CVs from their db as "plain text files". This is apparently what they believe "plain text files" to be. They were probably RTF files at one point and all the crap is the formatting and embedded graphics, pics etc... I'm pretty sure the cake can't be unbaked so all I want to achieve is getting the file size down a bit without losing the english.
I thought losing everything but upper and lower case letters, numbers and any newlines and carriage returns would leave a somewhat smaller file that still retains the search indexable English content. the index engine goes crazy on the extended ascii characters and takes ages to produce huge index files that slow the user when running keyword searches to the point that they fall asleep at the keyboard.
The Fool
10th October 2009, 04:02 AM
That could be all the rtf stuff. If you rename one to .rtf, is is readable in word?
yes, its readable but it looks just the same....its not jumping back into an RTF with all the formatting and graphics etc. same crapola in a nicer font.
soylent
10th October 2009, 05:06 AM
Here's what I would do if I was a machocist and didn't value my spare time:
The converter reads word for word, using spaces, line endings and tabs as delimiters. If a word contains more than one upper-case character and one or more verboten characters, mark it as rubbish; if it only contains more than one upper-case character or only contains one or more suspect characters, mark it as potential rubbish. If a potentially rubbish word is surrounded by rubbish words, it is downgraded to rubbish. If a rubbish word if surrounded by non-rubbish words, upgrade it to potential rubbish. Remove all words marked rubbish and remove all rubbish characters from words marked potentially rubbish.
Beanbag
10th October 2009, 09:39 AM
The odds that you are going to find a text filter that will make one pass over the file and make it perfect are between very slim and none. A lot of the garbage characters in that sample block you posted fall into the range of valid letter and numeral characters.
The best you can reasonably hope for is a filtering program that will strip most of the garbage out. The remaining file will still have to be looked at with human eyes and tweaked a bit, though not nearly as much as the raw file.
If you're interested, I can take a stab at making a converter. It's pretty trivial programming -- open source file, read it one character at a time, see if character is in the acceptable range, if so then write to output file, drop if not in range, repeat until end of source file. I'd need at least one source file to experiment with, preferably more.
If the file format is consistent from individual data set to the next, it's possible to do a better cleaning job, as particular blocks can be tagged as garbage and just dumped, characters contained be damned. Right now, I'm just assuming everything's just mixed fruits and nuts, with twigs thrown in for good measure.
PM me if you want.
Beanbag
Paul C. Anagnostopoulos
10th October 2009, 10:22 AM
Fool, can you run a PC executable? I could create the translator for you and then give you an executable that you could run over all the files. I can easily implement something like Soylent's rules.
However, I gotta agree with Beanbag that it's not clear this is entirely automatable.
~~ Paul
Leif Roar
10th October 2009, 12:11 PM
If you have a unix or linux box available, the standard command strings can do this, although you need to specify a minimum length of 1.
Oliver
10th October 2009, 12:34 PM
I would start by auto-deleting every "word" using more than 30 characters - and then auto-delete all characters outside the English alphabet - numbers excluded.
The Fool
10th October 2009, 04:25 PM
I would start by auto-deleting every "word" using more than 30 characters - and then auto-delete all characters outside the English alphabet - numbers excluded.
what would you use to "auto-delete" them?
I actually like the idea of deleting any long strings as the blocks of crap are continuous strings whereas the blocks of english language are (of course) full of spaces.
I'm not interested in getting these things back to presentation standard....just removing any significant portion of the garbage would do to ease the strain on the search indexing. Believe it or not this db doesn't just re-index changed docs...it re indexes the whole lot....every night.
Wowbagger
10th October 2009, 04:30 PM
Human readability is probably not an issue, then.
If you could send maybe just one or two sample files, some of us could build an application to filter them, which could, presumably, be used on all the others. We can have it filter files from one specified folder, and spit the results out to another.
The Fool
10th October 2009, 07:32 PM
Human readability is probably not an issue, then.
If you could send maybe just one or two sample files, some of us could build an application to filter them, which could, presumably, be used on all the others. We can have it filter files from one specified folder, and spit the results out to another.
Unfortunately they are people's CVs and I'm covered by privacy legislation so I can't send out examples without the persons permission....
Wowbagger
10th October 2009, 08:28 PM
Unfortunately they are people's CVs and I'm covered by privacy legislation so I can't send out examples without the persons permission.... Well, I suppose I could whip up a generic program to do it. But, I'm sure something like that should exist, somewhere. Sounds relatively trivial.
Did you try googling for "strip text out of files" and related terms?
The Fool
10th October 2009, 10:20 PM
Well, I suppose I could whip up a generic program to do it. But, I'm sure something like that should exist, somewhere. Sounds relatively trivial.
Did you try googling for "strip text out of files" and related terms?
yes...you would think something exists. I have googled till my eyes bled. could find nothing that was not just a single find and replace on a single file.
Leif Roar
10th October 2009, 10:25 PM
yes...you would think something exists.
*points to my earlier post* Something does exist: the strings command.
The Fool
11th October 2009, 01:37 AM
*points to my earlier post* Something does exist: the strings command.
know of a way I can run it on a windows box?
Blue Bubble
11th October 2009, 01:46 AM
know of a way I can run it on a windows box?
Download and install Cygwin (http://www.cygwin.com/).
Akhenaten
11th October 2009, 04:23 AM
need to remove rubbish from text files
We should come to some kind of an arrangement.
I have a need to insert rubbish in text files.
Reference: Any of my posts.
Dave
Paul C. Anagnostopoulos
11th October 2009, 05:25 AM
Is strings really going to do the trick? It's meant for listing the printable strings in object files and executables. Does it have a way to ignore special characters?
Fool, my offer still stands.
Unfortunately they are people's CVs and I'm covered by privacy legislation so I can't send out examples without the persons permission....
Would a nondisclosure agreement do the trick?
~~ Paul
shuttlt
11th October 2009, 06:20 AM
Assuming that you can strip out the worst of the non-ascii text, you'll still have 1.2 million garbled CV's. They won't be in a form that you'd want anybody external to your company looking at.
Edit---->
1.2 million CV's. That sounds like rather a lot. Is it one file to one CV?
shuttlt
11th October 2009, 06:32 AM
I take it asking someone from the staff of the old defunct company really, really nicely is out?
shuttlt
11th October 2009, 07:47 AM
You could try the following vbscript, it's not elegant, but it should do the job. Copy it into a text file and name it something like "replace.vbs". Then run it like this:
cscript replace.vbs <input file> <output file>
You'll still have a lot of crap in the file, but I think I've cleaned out the worst of the non-ascii stuff.:
***********************************************
Const ForReading = 1
Const ForWriting = 2
strOldFile=Wscript.Arguments(0)
strNewFile = Wscript.Arguments(1)
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile(strOldFile, ForReading)
Do Until objFile.AtEndOfStream
Set regEx = New RegExp
regEx.Pattern = "[^A-Za-z 0-9 \.,\?'""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]`~]"
regEx.IgnoreCase = True
regEx.Global = True
regEx.MultiLine = True
strLine = objFile.ReadLine
strLine = regEx.Replace(strLine,"")
strNewText = strNewText & strLine & vbCrLF
Loop
objFile.Close
If Not objFSO.FileExists(strNewFile) Then
objFSO.CreateTextFile(strNewFile)
End if
Set objFile = objFSO.OpenTextFile(strNewFile, ForWriting)
objFile.WriteLine strNewText
objFile.Close
**********************************************
Paul C. Anagnostopoulos
11th October 2009, 08:10 AM
Fool, I have an idea: Why don't you take four or five representative files, edit them, change the names to "John Smith", and send them to me? I can use them to create the conversion program and give it to you. That way I'll never see a real CV.
~~ Paul
Edited to add: I see ShuttIt has submitted a VB script.
The Fool
11th October 2009, 04:51 PM
Assuming that you can strip out the worst of the non-ascii text, you'll still have 1.2 million garbled CV's. They won't be in a form that you'd want anybody external to your company looking at.
Edit---->
1.2 million CV's. That sounds like rather a lot. Is it one file to one CV?
nobody is going to look at them, they are just being indexed by the db's search engine.
The Fool
11th October 2009, 05:19 PM
yeeeeeha. It has just been reported to me that a tape copy of these documents has been discovered in the original RTF format among boxes of crap recovered from their server room. I can just convert them to plain text (properly this time).
Thanks for all the assistance and generous offers folks. Looks like I can go back to the rocking chair on the front porch.
Paul C. Anagnostopoulos
11th October 2009, 05:36 PM
Excellent.
Mag tape, you say? How quaint.
~~ Paul
The Fool
11th October 2009, 06:06 PM
Excellent.
Mag tape, you say? How quaint.
~~ Paul
yes, for offsite backups apparently....found in a cardboard box along with old keyboards and ball mice...
Beanbag
11th October 2009, 06:29 PM
Could be worse. They could have been on punch cards.
Beanbag (yes, I've submitted jobs as a punched deck -- the original non-volitile storage, where the program was maybe a four inch deck, while the data ran two boxes roughly 18" long each)
The Fool
11th October 2009, 08:55 PM
Could be worse. They could have been on punch cards.
Beanbag (yes, I've submitted jobs as a punched deck -- the original non-volitile storage, where the program was maybe a four inch deck, while the data ran two boxes roughly 18" long each)
fortran on punchcards on a univac mainframe. teletype terminals......ahhhh, the memories.
The Fool
11th October 2009, 09:03 PM
dup
a_unique_person
12th October 2009, 12:27 AM
yes, for offsite backups apparently....found in a cardboard box along with old keyboards and ball mice...
That would explain the need for confidential information like that being so hard to find, it was obviously part of the security plan.
jmontecillo01
12th October 2009, 06:32 AM
Could be worse. They could have been on punch cards.
Facom 230/45 - 64K back to back. circa 1976
Programmer carrying 2 boxes of punch cards to operator:
I need to have this cobol program compiled and run. High priority, direct from the big boss.
Operator takes the two boxes. Trrrrrriped, 4 thousand cards (cobol program with no sequence number) on the floor.
Paul C. Anagnostopoulos
12th October 2009, 08:34 AM
Beanbag (yes, I've submitted jobs as a punched deck -- the original non-volitile storage, where the program was maybe a four inch deck, while the data ran two boxes roughly 18" long each)
And don't forget Basic programs on paper tape.
My favorite part was when the guy put his box of cards on top of the IBM 1403 printer. Then the printer ran out of paper and the top opened automatically. And the poor bastard hadn't punched sequence numbers on the cards.
Operator takes the two boxes. Trrrrrriped, 4 thousand cards (cobol program with no sequence number) on the floor.
There you go. Serves the guy right for not punching sequence numbers. Well, not really. Time for a partitioned dataset.
~~ Paul
grmcdorman
13th October 2009, 07:35 AM
Download and install Cygwin (http://www.cygwin.com/).
That is very much overkill for just the one program.
Fortunately, there are two native Windows alternatives: GNUWin32 (http://www.gnuwin32.org) - which are native Windows ports of many GNU/Linux utilities - and Systems Internals' (http://www.sysinternals.com) (now owned by Microsoft) strings (http://technet.microsoft.com/en-us/sysinternals/bb897439.aspx) (which supports Unicode as well as ASCII).
The Fool
14th October 2009, 04:07 AM
That would explain the need for confidential information like that being so hard to find, it was obviously part of the security plan.
Yes, why pay a fortune for Security consultants.....simply hide it under a pile of dead keyboards in a cardboard box....
Fun and games of picking over the corpse of a dead company. Sometimes you find edible stuff in the kitchen.
2001-2009, James Randi Educational Foundation. All Rights Reserved.
vBulletin® v3.7.7, Copyright ©2000-2013, Jelsoft Enterprises Ltd.