Current location - Loan Platform Complete Network - Big data management - Baidu cloud net disk capacity is present where the
Baidu cloud net disk capacity is present where the
Some time ago when using the Baidu network disk, suddenly found that the Baidu network disk can be free to receive 2TB space la!

Network hard disk we may have more or less contact, I have to say that in the era of everything is cloud, this is a very good network tools, and for us poor to the dregs of the free users, hard disk space is simply a hard injury, the beginning of the use of the real space, all kinds of tossing (to do the so-called task he there), and in the end, only to expand the 5G or so. Now, it's just a matter of having 2T of space on your hard drive.

How did you get this sudden 2T of space?

The truth is this!

Say I want 1G of network storage per user.

If there is a 1000G hard drive on the server that can all provide data storage for the users, how many users can be allocated a maximum of 1G of storage space per user?

You must have said 1000/1=1000 users.

But the fact that you so allocated, you will find that each user usually will not upload 1G of things will take up the capacity of the full, more or less, but the average user usually uploads only 50M files, that is to say, if you will be 1,000G hard disk to 1,000 people to use, but only effectively utilize one of the 50M * 1,000 = 50G of space, the remaining The remaining 950G of space would be completely wasted.

So what's the solution?

You can work around it by allocating this 1000G of space to 20000 users, each with an upload limit of 1G, but each usually uploads an average of 50M of data, so 20000*50M=1000G, which makes full use of the valuable storage space on the server. But you are also afraid that after this allocation to 20000 people, if at a certain moment people suddenly upload more data, then the user will not realize that the 1G of space you have allocated to people is false? So instead of allocating it to so many people, you could allocate it to 19,000 people, leaving some space for emergencies.

Suddenly you realize that the number of users you can allocate has doubled by 19 times, which is amazing. Is there any way to utilize it more efficiently?

If I have more than 1,000 servers, a server with 1,000G of space, then we have to leave 50G of empty space on each server in case a user suddenly uploads a large amount of data that leads to data congestion, then my 1,000 servers will be empty of 1,000 units * 50G = 50,000G of wasted space, what a pity. So the attack lions invented storage clusters, so that a user's data can be assigned to multiple servers to store, but in the user that seems to be only a 1G continuous space, then there is no need to set aside emergency space on each server, and can even be sufficiently full of the previous server, in the data to the next server to plug. This ensures maximum utilization of server space, and if at some point the administrator realizes that users are uploading data like crazy (which is rare in a large user base) and I don't have enough space available, it's okay to just add a few hard drives or servers on the fly.

Well, now we're getting much better server space utilization, and can allocate a certain amount of space to the maximum number of users. But is there a better solution for improvement?

The administrator realized one day that even though each user only stores 50M of stuff on average, that 50M doesn't happen overnight, it's slowly reached with 1-2 years of use, i.e., a new user just signing up for my webspace won't be able to upload anything, or will only be able to upload a tiny bit of very small stuff. So I initially allocate 50M of space to each user, and even if they fill up the 50M after 2 years, a lot of that space is wasted in the meantime. So the smart lion said: Since we can distributed, clustered storage, a user's data can be distributed on multiple servers, so let's assume that at the beginning of a newly registered user to provide 0M of space, in the future he used how much, I will provide him with how much storage space, so as to completely ensure the utilization of the hard disk. But the user's front end still has to show 1G.

The engineer's idea made it possible for me to use a 1000G server to provide about 1,000,000 people to register and use at the beginning of the establishment of the network disk, with more people registering, I also have money, and I can continue to increase the number of servers to provide them with late storage. Also since some of the servers were purchased over a year after completion, my purchase costs came down.

So...is that the end of it?

If it was an email provider, this would be high enough utilization. But not so with an online disk.

Smart engineers have realized: unlike mailboxes, the vast majority of everyone's content and attachments are self-created and different. But a lot of what people upload on a net drive is duplicated.

For example: Zhang San downloaded a TOKYO HOT today and uploaded it to his network disk, Li Si also downloaded the exact same TOKYO HOT three days later and uploaded it to the network hard disk, with the increase in users, you will find that the total **** there are 1,000 people uploaded 1,000 copies of the exact same file to your valuable server space, so the engineer So the engineers came up with a way, since it is the same file, I will only store a copy of the soon well, and then in the front end of the user shows that no one has a copy of the soon line. When some users want to delete the file, I don't really delete it, I just display it on the front end as if it's deleted, but keep it on the back end for other users who have the file to download. I don't delete the file until all users have deleted it.

This way as more and more data is stored and more and more users register, more and more duplicates are uploaded. You find that this becomes more and more efficient in detecting duplicate files stored. This way it seems that each person uploading non-duplicate files can only average 1M/user. Now you can offer more than 50 times as many users to use your limited space.

But as you use it, you find a pattern:

The "TOKYO HOT N0124" uploaded by Zhangsan and the "TH n124" uploaded by Lisan are the same file, but with different filenames, so couldn't I just recognize that they're the same file, and then just save them as different filenames for different users? Indeed, it is possible, but it is necessary to use some algorithms to identify the same file, such as the MD5 value, etc. As long as the two files have the same MD5 value. As long as the MD5 values of the two files are the same and the file sizes are the same, I would consider them to be the same file, and just save a copy of the file with different filenames for different users.

One day you realize that it's a huge CPU load because you need to calculate the MD5 value for each file, and you have to waste bandwidth uploading otherwise identical files back to check for consistency, can you improve that?

The smart engineers wrote a small software or a small plug-in, called "upload control", the work of calculating the MD5 value is given to the uploading user's computer to complete the work with this software, once the data that the user wants to upload is calculated to be the same as the data that has been stored on the server, then there is no need to upload, the user is marked directly with the MD5 value, and the file is uploaded to the server. Once the user has calculated that the data to be uploaded is the same as the data already stored on the server, it is simply not uploaded, and the user is marked directly as having successfully uploaded the file under file name XX. This process is almost instantaneous, and has been given the handsome name of "second transfer"!

Through all these steps, you realize that originally you could only provide web space to 1000 users, but after so many improvements, you can provide web space to almost 1,000,000 users with the same 1G space on the user side.

So if you're in a good mood one day and you advertise that you're going to raise the storage limit for each user to 1TB, then each user will still only upload an average of 50M of data, and only a few users will upload data that breaks through the 1G of raw space, and you'll realize that the cost to you will be negligible.

The hard-working Siege Lion is still working and digging for ways to more efficiently utilize the disk space provided by the server ......