.NET Forum / Languages / C# / July 2007
Compression size
|
|
Thread rating:  |
VBA - 23 Jun 2007 03:48 GMT I compressed a file with GZipStream class and is larger than the original file.... how can this be?, the original file is 737 KB and the "compressed" file is 1.1 MB. Did i miss something or is normal with that compression class?
 Signature VBA
Andrew Robinson - 23 Jun 2007 06:41 GMT What type of file are you compressing? Some highly compressed files such as images may grow in size when compressed a second time. Also, the Microsoft algorithms are not idea but rather make an attempt to steer clear of patent issues.
>I compressed a file with GZipStream class and is larger than the original > file.... how can this be?, the original file is 737 KB and the > "compressed" > file is 1.1 MB. Did i miss something or is normal with that compression > class? VBA - 23 Jun 2007 06:52 GMT First I compressed a txt file, i read that if a file is very small , the compression can turn it larger in size, so then i tried with a mp3 file (not sure if the file type matters) of 3.4 Mb, but turned it to 5.3 MB....so.....what's wrong??
 Signature VBA
> What type of file are you compressing? Some highly compressed files such as > images may grow in size when compressed a second time. Also, the Microsoft [quoted text clipped - 6 lines] > > file is 1.1 MB. Did i miss something or is normal with that compression > > class? Scott C - 23 Jun 2007 08:10 GMT > First I compressed a txt file, i read that if a file is very small , the > compression can turn it larger in size, so then i tried with a mp3 file (not > sure if the file type matters) of 3.4 Mb, but turned it to 5.3 > MB....so.....what's wrong?? MP3 is a compressed file... I bet you'd get better behavior with a 3.5 MB text file.
Scott
VBA - 23 Jun 2007 08:26 GMT But i can only compress text files?? because a tried a while ago with a pdf....and resulted the same..bigger size, but i don't know if a pdf file is somehow compressed already.
by the way, when compressing a file, the resulting compressed file should be with the same file extension? or i must use something like *.Z ????
 Signature VBA
> > First I compressed a txt file, i read that if a file is very small , the > > compression can turn it larger in size, so then i tried with a mp3 file (not [quoted text clipped - 5 lines] > > Scott Marc Gravell - 23 Jun 2007 09:20 GMT PDF can contain compressed graphics (and, IIRC, sometimes text), and if it is encrypted the data can appear relatively random. Both of these make it a poor choice for compression.
Put simply: some files compress very well indeed, and some don't. In particular, those that are already compressed (or highly random) don't tend to compress (and can get bigger).
Marc
Peter Duniho - 23 Jun 2007 11:26 GMT > [...] > by the way, when compressing a file, the resulting compressed file > should be > with the same file extension? or i must use something like *.Z ???? You can name the compressed file whatever you like. Of course, using the Gzip class, it's common to use the ".gz" extension for the output. But there's no requirement that you do so.
Pete
Peter Duniho - 23 Jun 2007 11:28 GMT > [...] Also, the Microsoft > algorithms are not idea but rather make an attempt to steer clear of > patent > issues. GzipStream may not implement an ideal algorithm, but since Gzip itself is an open format, I doubt that patent issues are part of the question.
Tom Spink - 23 Jun 2007 10:35 GMT > I compressed a file with GZipStream class and is larger than the original > file.... how can this be?, the original file is 737 KB and the > "compressed" file is 1.1 MB. Did i miss something or is normal with that > compression class? Hi VBA,
Random data is hard to compress, as compression techniques often work on probabilities (e.g. Huffman encoding). So, encrypted files and already compressed files, such as MP3s, JPEGs, GIFs, etc will not compress at all.
Text documents written in English, or files containing sparse data (such as BMPs and certain executables) will compress fairly well. It all depends on the compression algorithm.
You should choose an algorithm that's appropriate to the type of data you're trying to compress... a bad algorithm will almost certainly result in larger files.
But like I said at the start random data is hard if not damn near impossible to compress.
 Signature Tom Spink University of Edinburgh
VBA - 23 Jun 2007 17:16 GMT Looks very interesenting all that you are telling me :) I just now thought in a new question related it.... how does Winzip work?? i mean you can put any file in a Winzip file and compress it, and i read in a book that uses a similar compression algorithm, is that another type a compression or you could do a similar software in .NET using GZipStream????
 Signature VBA
> > I compressed a file with GZipStream class and is larger than the original > > file.... how can this be?, the original file is 737 KB and the [quoted text clipped - 17 lines] > But like I said at the start random data is hard if not damn near impossible > to compress. Peter Duniho - 23 Jun 2007 18:24 GMT > Looks very interesenting all that you are telling me :) > I just now thought in a new question related it.... how does Winzip > work?? Two standard compression algorithms on which much (nearly all, actually, as far as I know) of our lossless compression tools are built on are Huffman encoding and the Lempel-Ziv-Welch algorithm. I don't have specifics on the exact implementation of WinZip, but I gather that like all "zip" variations, it uses some forms of these algorithms.
If you want to have a better idea of how various compression schemes work, the place to start is reading about these basic algorithms.
> i mean you can put any file in a Winzip file and compress it, and i read > in a > book that uses a similar compression algorithm, is that another type a > compression or you could do a similar software in .NET using > GZipStream???? You can't "put any file in a Winzip file and compress it". Typically, something like WinZip will try a variety of specific compression algorithms to see which performs best (each variation of a given algorithm may perform differently, depending on the content and structure of the data). In some cases, no compression algorithm will reduce the size, or will reduce it significantly, and the original data will be used. But inclusion of file headers and other information will increase the file size at least a little.
Note that the GzipStream class does not have the entire data before it must make decisions about how to compress the data. As far as I know, it just uses a single "best general case" version of the "deflate" algorithm (based on Huffman and LZW). In any case, it's guaranteed that GzipStream doesn't have the ability to pick from a variety of algorithms to use the best-performing one, as something like WinZip can.
Again, I don't know specifically how WinZip works, but all compression tools have this basic behavior. There is not a single compression tool that is guaranteed to reduce the size of the data.
Pete
Arne Vajhøj - 02 Jul 2007 03:56 GMT >> Looks very interesenting all that you are telling me :) >> I just now thought in a new question related it.... how does Winzip [quoted text clipped - 5 lines] > specifics on the exact implementation of WinZip, but I gather that like > all "zip" variations, it uses some forms of these algorithms. Absolutely untrue.
LZ78 (LZW) is used in traditional Unix compress.
But ZIP and GZip uses LZ77.
Both often combined with either Huffman or Arithmetic encoding.
BZip uses Burrows Wheeler.
>> i mean you can put any file in a Winzip file and compress it, and i >> read in a [quoted text clipped - 17 lines] > GzipStream doesn't have the ability to pick from a variety of algorithms > to use the best-performing one, as something like WinZip can. I would assume that WinZip only uses the possibilities within the Zip format and not some custom format.
And deflate is still LZ77 not LZ78 (LZW).
Arne
Peter Duniho - 02 Jul 2007 04:00 GMT >> Two standard compression algorithms on which much (nearly all, >> actually, as far as I know) of our lossless compression tools are built [quoted text clipped - 3 lines] > > Absolutely untrue. Okay.
> LZ78 (LZW) is used in traditional Unix compress. > > But ZIP and GZip uses LZ77. > > Both often combined with either Huffman or Arithmetic encoding. That's what I said. I thought you said what I said was "absolutely untrue".
Maybe the word "absolutely" means something different in your native language? Here, it's used to emphasize, rather than to negate.
Pete
Arne Vajhøj - 03 Jul 2007 03:27 GMT >>> Two standard compression algorithms on which much (nearly all, >>> actually, as far as I know) of our lossless compression tools are [quoted text clipped - 18 lines] > Maybe the word "absolutely" means something different in your native > language? Here, it's used to emphasize, rather than to negate. ????
You said that nearly all lossless compression tools are build on LZW.
That is absolute untrue or complete bullshit or whatever you want to call it.
It even explained why: that ZIP and GZip does not use LZW. And they are a lot more used than good old Unix Compress.
Arne
Peter Duniho - 03 Jul 2007 03:35 GMT > You said that nearly all lossless compression tools are build on LZW. I wrote (and you quoted) "WinZip...uses some forms of these algorithms".
In what way is LZ77 (the algorithm you wrote is used with the ZIP format) _not_ "some form" of the LZW algorithm?
> That is absolute untrue or complete bullshit or whatever you want > to call it. My statement was just fine, and your own claims even confirm that. You can continue to write asinine things like "absolute untrue" and "complete bullshit" as much as you like, there was nothing wrong with my post. Furthermore, your posts continue to insult without educating.
If you have an actual point, try making it without being such an a.s.
Thanks, Pete
Arne Vajhøj - 04 Jul 2007 01:02 GMT >> You said that nearly all lossless compression tools are build on LZW. > > I wrote (and you quoted) "WinZip...uses some forms of these algorithms". > > In what way is LZ77 (the algorithm you wrote is used with the ZIP > format) _not_ "some form" of the LZW algorithm? No.
Not code wise. Not patent wise. Not in any way.
>> That is absolute untrue or complete bullshit or whatever you want >> to call it. > > My statement was just fine, and your own claims even confirm that. Bullshit.
> Furthermore, your posts continue to insult without educating. I have tried multiple times to explain to you that the most widely used compression algorithms does not use LZW they use LZ77.
That is educational.
That you refuse to understand it does not make it less educational.
> If you have an actual point, try making it without being such an a.s. It seems as if you just have difficulties understanding the point.
Arne
Peter Duniho - 04 Jul 2007 01:29 GMT > It seems as if you just have difficulties understanding the point. When you make a point that is comprehensible, then I will start worrying about whether I understand it.
Arne Vajhøj - 05 Jul 2007 00:37 GMT >> It seems as if you just have difficulties understanding the point. > > When you make a point that is comprehensible, then I will start worrying > about whether I understand it. So you did not understand the following:
#> In what way is LZ77 (the algorithm you wrote is used with the ZIP #> format) _not_ "some form" of the LZW algorithm? # #No. # #Not code wise. Not patent wise. Not in any way.
LZW is a completely different algorithm than LZ77. An implementation will be different code. The infamous LZW patent does not apply to LZ77.
It is difficult to understand ?
Arne
Peter Duniho - 05 Jul 2007 00:54 GMT > [...] > LZW is a completely different algorithm than LZ77. An implementation > will be different code. The infamous LZW patent does not apply to LZ77. You have a very strange concept of these absolute terms you're using: "absolutely untrue", "complete bullshit", "completely different algorithm", etc.
LZW is _not_ a COMPLETELY different algorithm. A COMPLETELY different algorithm would share absolutely zero similarities.
All of the algorithms spawned by Lempel and Ziv, including the LZW algorithm, share various similarities. Some have more similarities in common than others, but they are ALL "some form" of each other. They all share the same heritage, and in many ways address similar problems with similar approaches. All of the LZ-based algorithms, being dictionary-based, are much more similar to each other than they are to, for example, Huffman encoding.
The question of a patent is completely irrelevant, by the way. Even assuming that software patents make sense in the first place, it doesn't take much for a patent to be inapplicable to closely related code. Most software patents are written narrowly, for the very reason that it's too easy to invalidate a broadly-written patent. As such, relatively minor variations can results in two otherwise closely related algorithms not sharing patent protection (see MP3 versus other similar psychoacoustics-based audio compression algorithms, for example).
You seem to have this pathological need to find fault in whatever has been written, at least with respect to my own posts, regardless of how contrivedly narrow you have to interpret what was actually written, even to the point of completely ignoring whatever intent actually existed in what was written.
Frankly, I find _that_ to be "complete bullshit", and I'm sick and tired of it. I go to a lot of trouble to make what I write as correct as I can, and to make it clear where my first-hand knowledge of something is vague or incomplete. When someone posts a _valid_ correction to something I've written, I have no problem acknowledging my mistake, and I've posted my share of "mea culpas" here in this newsgroup and others.
I find your insistence on finding fault with my posts where no fault exists to be idiotic. I wish you would cut it out.
Pete
Arne Vajhøj - 05 Jul 2007 01:06 GMT > LZW is _not_ a COMPLETELY different algorithm. A COMPLETELY different > algorithm would share absolutely zero similarities. [quoted text clipped - 6 lines] > dictionary-based, are much more similar to each other than they are to, > for example, Huffman encoding. LZ77 and LZW are both dictionary based, but that does not make LZ77 a form of LZW.
> You seem to have this pathological need to find fault in whatever has > been written, at least with respect to my own posts, regardless of how > contrivedly narrow you have to interpret what was actually written, even > to the point of completely ignoring whatever intent actually existed in > what was written. Let us take a step back.
You started by writing:
#Two standard compression algorithms on which much (nearly all, #actually, as far as I know) of our lossless compression tools are built #on are Huffman encoding and the Lempel-Ziv-Welch algorithm.
I replied:
#Absolutely untrue. # #LZ78 (LZW) is used in traditional Unix compress. # #But ZIP and GZip uses LZ77.
That is not an interpretation. What you wrote was plain wrong.
The most common compression tools does not use LZW.
> Frankly, I find _that_ to be "complete bullshit", and I'm sick and tired > of it. I go to a lot of trouble to make what I write as correct as I > can, and to make it clear where my first-hand knowledge of something is > vague or incomplete. When someone posts a _valid_ correction to > something I've written, I have no problem acknowledging my mistake, and > I've posted my share of "mea culpas" here in this newsgroup and others. Well in this case you have tried to cover your mistake with various lame excuses:
#In what way is LZ77 (the algorithm you wrote is used with the ZIP #format) _not_ "some form" of the LZW algorithm?
instead of just admitting that you remembered wrong regarding LZW.
Arne
Peter Duniho - 05 Jul 2007 01:29 GMT > LZ77 and LZW are both dictionary based, but that does not make LZ77 > a form of LZW. Why not? Who are you that you get to define what "a form" is? Why is your definition any more important or correct than mine? Where is the "official" definition of "a form" on which you base your claim?
I have explained my basis for my usage of the phrase "some form" or "a form". You have not bothered to explain your basis, but even if you should happen to, why would your explanation take priority over mine with respect to interpreting what *I* wrote?
You have a pretty arrogant view of your own importance in how language should be used, especially when it comes to the intent of someone _else's_ use of language.
> [...] > Well in this case you have tried to cover your mistake with various > lame excuses: Baloney. I made no mistake, and I stand by my original post. I am not trying to "cover" anything. It is only your pathological need to find fault that has resulted in this inane sub-thread.
And inane it is. Frankly, I'm a bit embarassed to have even bothered feeding your troll-like behavior, and I'm done.
To anyone else who has rightly identified this as a useless sub-thread, I apologize for it and promise that my involvement with it, as well as more generally with Arne's continued insistence on finding fault where none exists, is over with. Life's too short to waste time on idiotic stuff like this.
Pete
Arne Vajhøj - 02 Jul 2007 03:49 GMT > But like I said at the start random data is hard if not damn near impossible > to compress. Some define random data as being data that are uncompressable ...
:-) Arne
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|