JREF Homepage Swift Blog Events Calendar $1 Million Paranormal Challenge The Amaz!ng Meeting Useful Links Support Us
James Randi Educational Foundation JREF Forum
Forum Index Register Members List Events Mark Forums Read Help

Go Back   JREF Forum » General Topics » Computers and the Internet
Click Here To Donate

Notices


Welcome to the JREF Forum, where we discuss skepticism, critical thinking, the paranormal and science in a friendly but lively way. You are currently viewing the forum as a guest, which means you are missing out on discussing matters that are of interest to you. Please consider registering so you can gain full use of the forum features and interact with other Members. Registration is simple, fast and free! Click here to register today.

Reply
Old 16th October 2008, 08:50 AM   #1
Fredrik
Graduate Poster
 
Fredrik's Avatar
 
Join Date: Jun 2004
Posts: 1,912
What represents an "å" in an XML file?

I'm using an application that let's me make notes about stuff. Those notes are stored in an xml file. Suppose I type an "å" that gets saved in the file. How do I figure out how what data represents that "å" in the file?

If I open the file in a text editor, the å appears as Ã¥ (Wordpad) or A¥ ("vi"). If I open it in a browser, it looks like an å. If I display the contents of the file using "cat -A" in a cygwin bash shell, the å is displayed as M-CM-%.

The first line in the file is <?xml version="1.0" encoding="UTF-8"?>.

(I also need to figure out what represents Å, ä, Ä, ö and Ö).
Fredrik is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 09:16 AM   #2
CFLarsen
Penultimate Amazing
 
CFLarsen's Avatar
 
Join Date: Aug 2001
Posts: 42,804
Å is the last letter in the Danish alphabet.
__________________
SkepticReport.com
CFLarsen is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 09:21 AM   #3
Fredrik
Graduate Poster
 
Fredrik's Avatar
 
Join Date: Jun 2004
Posts: 1,912
It's also the third from last in the Swedish alphabet. What's your point?
Fredrik is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 09:24 AM   #4
Wudang
BOFH
 
Wudang's Avatar
 
Join Date: Jun 2003
Location: Sheffield
Posts: 8,243
Wordpad and vi don't know about UTF-8 just the standard ascii character sets AFAIK. Only open it with tools that do recognise the UTF-8 (i.e xml-aware tools) and you'll be okay
__________________
Aphorism: Subjects most likely to be declared inappropriate for humor are the ones most in need of it. -epepke
Wudang is online now   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 09:28 AM   #5
Fredrik
Graduate Poster
 
Fredrik's Avatar
 
Join Date: Jun 2004
Posts: 1,912
Thanks, but I don't want to display it "correctly". I just want to know what this application puts into the file to represent those characters.
Fredrik is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 09:28 AM   #6
theMark
Critical Thinker
 
theMark's Avatar
 
Join Date: Jul 2007
Location: Stuck in Old Europe and the 80s, where the music is better than today
Posts: 310
You need to find a better text editor ;)

The encoding of the file is UTF-8, which means (in a quite rough approximation), that the "low" characters 32-127 will be represented by their original characters, and the "special" characters above 127 are represented by two or more characters, starting with the "escape character" that you see as "A". That's why it works in the browser (which knows how to properly handle UTF-8 escapes) and not in Wordpad (which isn't multi-byte aware).

Since I'm on a Mac here, and using TextWrangler for bare-bone text file editing (TW is freeware and has a slew of encodings it understands), I'm not quite sure what would be the right software under Windows. I'd start looking into HTML editors, since they usually know how to handle/switch between various encodings.

Good luck...

This here has more info about the rules of unicode and utf than any sane person should ever want to know...
http://www.cl.cam.ac.uk/~mgk25/unicode.html
__________________
"I may not know what's right / but I know this can't be it.
I'm never satisfied / when the answers could be real."

Title: Unsatisfaction - by: Men Without Hats

Last edited by theMark; 16th October 2008 at 09:31 AM. Reason: Added link to unicode faq
theMark is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 09:45 AM   #7
CrikeyBobs
Critical Thinker
 
CrikeyBobs's Avatar
 
Join Date: Oct 2005
Location: London
Posts: 421
To see how the application stores the character, first look it up here. This gives the unicode value for the character.

The application you used stores data in UTF-8 format. Using the information here, you can convert the unicode into UTF-8.

With regards to 'å', it has a unicode value of E5. Converting it into UTF-8 results in C3A5. This is what is stored in the file. If this value is treated as 2 characters by applications that do not understand UTF-8 then it will appear as 'Ã¥'.

I hope this helps.
CrikeyBobs is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 01:02 PM   #8
jeremyp
Thinker
 
Join Date: Aug 2003
Location: Reading, UK
Posts: 223
Originally Posted by Fredrik View Post
I'm using an application that let's me make notes about stuff. Those notes are stored in an xml file. Suppose I type an "å" that gets saved in the file. How do I figure out how what data represents that "å" in the file?

If I open the file in a text editor, the å appears as Ã¥ (Wordpad) or A¥ ("vi"). If I open it in a browser, it looks like an å. If I display the contents of the file using "cat -A" in a cygwin bash shell, the å is displayed as M-CM-%.

The first line in the file is <?xml version="1.0" encoding="UTF-8"?>.

(I also need to figure out what represents Å, ä, Ä, ö and Ö).

å is c3 a5
ä is c3 a4
Å is c3 85
Ä is c3 84
ö is c3 b6
Ö is c3 96
jeremyp is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 02:58 PM   #9
ddt
Mafia Penguin
 
ddt's Avatar
 
Join Date: Dec 2007
Location: Netherlands
Posts: 10,323
Originally Posted by Fredrik View Post
I'm using an application that let's me make notes about stuff. Those notes are stored in an xml file. Suppose I type an "å" that gets saved in the file. How do I figure out how what data represents that "å" in the file?

If I open the file in a text editor, the å appears as Ã¥ (Wordpad) or A¥ ("vi"). If I open it in a browser, it looks like an å. If I display the contents of the file using "cat -A" in a cygwin bash shell, the å is displayed as M-CM-%.

The first line in the file is <?xml version="1.0" encoding="UTF-8"?>.

(I also need to figure out what represents Å, ä, Ä, ö and Ö).
If you use vim (instead of vi), that is able to recognize the UTF-8 encoding and display correctly these characters. Instead of entering the å character, you could also enter the numeric character reference &#xE5; . That will get stored as such in the XML file, but an application that just has to display the text will display it as å.
__________________
Proud member of the Solipsistic Autosycophant's Group
ddt is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 16th October 2008, 03:22 PM   #10
Gagglegnash
Graduate Poster
 
Gagglegnash's Avatar
 
Join Date: Jan 2008
Posts: 1,447
Hi

Get Microsoft's XML Notepad. It's a free download.

I'm pretty sure that you will find that it's represented by, "å."
__________________
But it does me no injury for my neighbor to say there are twenty gods or no God. It neither picks my pocket nor breaks my leg.
-----Thomas Jefferson, Notes on Virginia, 1782
Question with boldness even the existence of a god; because if there be one he must approve of the homage of reason more than that of blindfolded fear.
-----Thomas Jefferson, Letter to Peter Carr, August 10, 1787

Last edited by Gagglegnash; 16th October 2008 at 03:27 PM.
Gagglegnash is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 17th October 2008, 02:20 PM   #11
jeremyp
Thinker
 
Join Date: Aug 2003
Location: Reading, UK
Posts: 223
Originally Posted by jeremyp View Post
å is c3 a5
ä is c3 a4
Å is c3 85
Ä is c3 84
ö is c3 b6
Ö is c3 96
Just to clarify: what I've written here is the sequences of hex numbers you'll see if you open the UTF-8 XML file with a hex editor.

According to the meta tags of this HTML page, the encoding used is ISO-8859-1, so you'd see a different set of hex numbers by putting this page through a hex editor.
jeremyp is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Old 17th October 2008, 04:28 PM   #12
Fredrik
Graduate Poster
 
Fredrik's Avatar
 
Join Date: Jun 2004
Posts: 1,912
Thanks for all the information guys. My problem was actually solved by the next software update of that application before I had time to try the most interesting suggestions, but at least I learned something.
Fredrik is offline   Quote this post in a PM   Nominate this post for this month's language award Copy a direct link to this post Reply With Quote Back to Top
Reply

JREF Forum » General Topics » Computers and the Internet

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -7. The time now is 09:03 AM.
Powered by vBulletin. Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
© 2001-2012, James Randi Educational Foundation. All Rights Reserved.

Disclaimer: Messages posted in the Forum are solely the opinion of their authors.