PDA

View Full Version : What represents an "" in an XML file?


Fredrik
16th October 2008, 08:50 AM
I'm using an application that let's me make notes about stuff. Those notes are stored in an xml file. Suppose I type an "" that gets saved in the file. How do I figure out how what data represents that "" in the file?

If I open the file in a text editor, the appears as å (Wordpad) or A ("vi"). If I open it in a browser, it looks like an . If I display the contents of the file using "cat -A" in a cygwin bash shell, the is displayed as M-CM-%.

The first line in the file is <?xml version="1.0" encoding="UTF-8"?>.

(I also need to figure out what represents , , , and ).

CFLarsen
16th October 2008, 09:16 AM
is the last letter in the Danish alphabet.

Fredrik
16th October 2008, 09:21 AM
It's also the third from last in the Swedish alphabet. What's your point?

Wudang
16th October 2008, 09:24 AM
Wordpad and vi don't know about UTF-8 just the standard ascii character sets AFAIK. Only open it with tools that do recognise the UTF-8 (i.e xml-aware tools) and you'll be okay

Fredrik
16th October 2008, 09:28 AM
Thanks, but I don't want to display it "correctly". I just want to know what this application puts into the file to represent those characters.

theMark
16th October 2008, 09:28 AM
The encoding of the file is UTF-8, which means (in a quite rough approximation), that the "low" characters 32-127 will be represented by their original characters, and the "special" characters above 127 are represented by two or more characters, starting with the "escape character" that you see as "A". That's why it works in the browser (which knows how to properly handle UTF-8 escapes) and not in Wordpad (which isn't multi-byte aware).

Since I'm on a Mac here, and using TextWrangler for bare-bone text file editing (TW is freeware and has a slew of encodings it understands), I'm not quite sure what would be the right software under Windows. I'd start looking into HTML editors, since they usually know how to handle/switch between various encodings.

Good luck...

This here has more info about the rules of unicode and utf than any sane person should ever want to know...
http://www.cl.cam.ac.uk/~mgk25/unicode.html

CrikeyBobs
16th October 2008, 09:45 AM
To see how the application stores the character, first look it up here (http://en.wikipedia.org/wiki/Latin_characters_in_Unicode). This gives the unicode value for the character.

The application you used stores data in UTF-8 format. Using the information here (http://en.wikipedia.org/wiki/UTF-8#Description), you can convert the unicode into UTF-8.

With regards to '', it has a unicode value of E5. Converting it into UTF-8 results in C3A5. This is what is stored in the file. If this value is treated as 2 characters by applications that do not understand UTF-8 then it will appear as 'å'.

I hope this helps.

jeremyp
16th October 2008, 01:02 PM
I'm using an application that let's me make notes about stuff. Those notes are stored in an xml file. Suppose I type an "" that gets saved in the file. How do I figure out how what data represents that "" in the file?

If I open the file in a text editor, the appears as å (Wordpad) or A ("vi"). If I open it in a browser, it looks like an . If I display the contents of the file using "cat -A" in a cygwin bash shell, the is displayed as M-CM-%.

The first line in the file is <?xml version="1.0" encoding="UTF-8"?>.

(I also need to figure out what represents , , , and ).


is c3 a5
is c3 a4
is c3 85
is c3 84
is c3 b6
is c3 96

ddt
16th October 2008, 02:58 PM
I'm using an application that let's me make notes about stuff. Those notes are stored in an xml file. Suppose I type an "" that gets saved in the file. How do I figure out how what data represents that "" in the file?

If I open the file in a text editor, the appears as å (Wordpad) or A ("vi"). If I open it in a browser, it looks like an . If I display the contents of the file using "cat -A" in a cygwin bash shell, the is displayed as M-CM-%.

The first line in the file is <?xml version="1.0" encoding="UTF-8"?>.

(I also need to figure out what represents , , , and ).

If you use vim (instead of vi), that is able to recognize the UTF-8 encoding and display correctly these characters. Instead of entering the character, you could also enter the numeric character reference &#xE5; . That will get stored as such in the XML file, but an application that just has to display the text will display it as .

Gagglegnash
16th October 2008, 03:22 PM
Hi

Get Microsoft's XML Notepad (http://msdn.microsoft.com/en-us/library/aa905339.aspx). It's a free download (http://www.microsoft.com/downloads/details.aspx?familyid=72d6aa49-787d-4118-ba5f-4f30fe913628&displaylang=en).

I'm pretty sure that you will find that it's represented by, "."

jeremyp
17th October 2008, 02:20 PM
is c3 a5
is c3 a4
is c3 85
is c3 84
is c3 b6
is c3 96
Just to clarify: what I've written here is the sequences of hex numbers you'll see if you open the UTF-8 XML file with a hex editor.

According to the meta tags of this HTML page, the encoding used is ISO-8859-1, so you'd see a different set of hex numbers by putting this page through a hex editor.

Fredrik
17th October 2008, 04:28 PM
Thanks for all the information guys. My problem was actually solved by the next software update of that application before I had time to try the most interesting suggestions, but at least I learned something.