turniphat t1_je22fws wrote
You need to know the type of data you are dealing with. For example, if you want to open a .wav file, you find the specification (https://ccrma.stanford.edu/courses/422-winter-2014/projects/WaveFormat/) and then you write your program to the specification.
It says first 4 bytes are the ID, then next 4 bytes are the size, then next 4 are the format... etc. etc. etc.
If somebody just hands you a blob of data and tells you to interpret it, then you are correct to be confused. You'd have no idea what the bytes mean.
Also, if you open a file in the wrong program, it interprets the bytes in the wrong way and you just get nonsense. Open a .exe file in notepad and it's just crazy characters all over the screen.
bulbaquil t1_je2yzza wrote
To summarize the .wav specification u/turniphat mentioned:
-
The first 4 bytes tell the computer "Hi, I'm a multimedia file. Please treat me accordingly."
-
Bytes 5 through 8 tell the computer "Here's how long I am." This is the answer to your question - one of the first things files of any kind will do is tell the computer how big they are, precisely because this is something the computer needs to know.
-
Bytes 9 through 12 tell the computer "Specifically, I'm a .wav file."
-
Bytes 13 through 36 tell the computer "Because I'm a .wav file, here are some things you need to know about me. Like, what's my bitrate, am I stereo or mono, how many channels do I have, etc."
-
Bytes 37 through 44 tell the computer: "Okay, the actual data's coming now. Just a reminder: this is how big it is."
-
Bytes 45 through whatever number the previous 44 bytes told us are the actual sound itself.
As for why the computer treats 1001 as 9 instead of as 2-1, because at a very fundamental level the computer isn't reading the data bit by bit; it's reading it in chunks (sort of like taking steps two at a time). By default, the chunk size is the "X" that they're talking about whenever they refer to an "X-bit system" or "X-bit architecture", but if a file is encountered, its directives on How to Read This Kind of File take over. So it isn't seeing it as a sequence "1-0-0-1" and trying to figure out where to break it; it's seeing it as a gestalt "1001" (really, "00001001") and treating it as a single unit. If you wanted a 2 and then a 1, you'd need two different units: 00000010 00000001.
Tl;dr: Files share information about themselves to the computer when they're loaded. One of the things they share is how big they are, and another is how many bits of data the computer should read from them at a time.
fiatfighter t1_je35lzm wrote
This really made sense to me and I am NOT that technologically literate. And I definitely do not understand coding or this byte structure thing. But when you said-ok this piece is the program or file saying this, and this one is telling it this-that helped me wrap my brain around it. Thank you! Off to submit my resume to Twitter! Oh wait…
RelativeApricot1782 t1_je4t9td wrote
>Bytes 37 through 44 tell the computer: “Okay the actual data is coming now. Just a reminder: this is how big it is.”
Why does the computer need to be reminded?
mrpenchant t1_je4vys2 wrote
They misstated it a little bit.
The way the format is set up is the first time it gives a length is for the whole thing, but it is defined to have 2 subchunks. The first subchunk will always have the same size for a wave file, but does provide a length of that subchunk and then the last data length is just for the data in the 2nd subchunk.
This is all to say, it's not a reminder but a slightly different length, which would be the length of the entire thing minus 36.
RelativeApricot1782 t1_je7w88z wrote
That makes more sense thanks
aiusepsi t1_je4u9v5 wrote
A computer doesn't, but software is (at least for now) written by human beings. You could have the size of the actual payload be implicit, and calculated from the information you've already seen, but there's more opportunity for the person writing the code which is reading the file to get the calculation wrong in some subtle way.
If the size is written explicitly just before the data, you can make the code which reads it much simpler and therefore more reliable. Simple and reliable is really good for this kind of code; mistakes can lead to software containing security vulnerabilities. Nobody wants to get a virus because they played a .wav file!
nerdguy1138 t1_je5ekl5 wrote
The gnu file utility can read the first few bytes of a file as a magic number to determine what kind of file it is.
There is a hacker magazine called POC or GTFO, meaning proof of concept.
The PDFs of that magazine can also be interpreted in various other ways. Files that you can do this with are called polyglots.
pseudopad t1_je248it wrote
This is probably the best explanation so far. There's a few posts talking about cpus and how many bits they are, but the question was about storage, and this reply describes how a computer (program) figures out what's inside a file.
Y34rZer0 t1_je2zatz wrote
The difference between data and information
ColdDesert77 t1_je3tg6b wrote
> the specification
What's that?
aiusepsi t1_je4oway wrote
A document written by a human being which describes the format of the file.
It's basically an agreement between person writing software which writes that kind of file, and the person writing software which reads that kind of file.
psycotica0 t1_je4ovsd wrote
Did you click on it? It's the document that describes what makes a wav file a wav file so that programs that can read wav files can read it. It's essentially a description and some instructions for the programmers making the reading program and the writing program so they know they're making the same file that contains the same information.
[deleted] t1_je3ps40 wrote
[removed]
[deleted] t1_je4j75h wrote
[deleted]
Viewing a single comment thread. View all comments