User:Snowolf/zim dumps issue

File info comparison
zimdump -F wikivoyage_en_all_2015-08.zim count-articles: 92455 uuid: 12da2968-97ef-9057-4199-f4f80949acbe article count: 92455 mime list pos: 80 url ptr pos: 193 title idx pos: 739833 cluster count: 48444 cluster ptr pos: 5625428 checksum pos: 972731548 checksum: c04c8a3ac5dfb122dfe3f4e885e7b9b7 main page: 44016 layout page: -

zimdump -F wikivoyage_en_all_2015-09.zim count-articles: 84897 uuid: 9213375a-53f4-819c-47ed-41fc87e7028f article count: 84897 mime list pos: 80 url ptr pos: 193 title idx pos: 679369 cluster count: 40711 cluster ptr pos: 5169080 checksum pos: 468245393 checksum: 05b9bbf3b6d0c955b6ee74a3f929d911 main page: 44192 layout page: -


 * The September dump has 7558 articles less than the August one, which seems odd as there's definitively not that many deletions performed in that time

Comparison of random article that hasn't changed: Gin Gin
The difference is not substantial, despite the pagebanner change.

zimdump -f "Gin Gin" -v -i wikivoyage_en_all_2015-08.zim url: Gin_Gin.html title:          Gin Gin idx:            13565 namespace:      A        type:            article mime-type:      text/html article size:   3275 cluster number: 67 cluster count:  128 cluster size:   2111843 cluster offset: 31138251 blob number:    15 compression:    lzma

zimdump -f "Gin Gin" -v -i wikivoyage_en_all_2015-09.zim url: Gin_Gin.html title:          Gin Gin idx:            13613 namespace:      A        type:            article mime-type:      text/html article size:   3261 cluster number: 67 cluster count:  117 cluster size:   2108872 cluster offset: 30691695 blob number:    29 compression:    lzma

Uncompressed size comparison
du -sh 08 1.4G   08

du -sh 09 931M   09

08
du -sh - 112K   - du -sh A 480M   A du -sh I 945M    I du -sh M 32K     M

09
du -sh - 110K   - du -sh A 482M   A du -sh I 449M    I du -sh M 32K     M

08
785K Sep 13 19:29 m%2fBest_Tradition_Field_Panorama_-_Mets.JPG 480K Sep 13 19:29 m%2fKandovan-baner.gif 455K Sep 13 19:29 m%2fAppalachian_Trail_Map_copy.png 391K Sep 13 19:29 m%2fFinger_Lakes_travel_banner.png 385K Sep 13 19:29 m%2fTrans-Siberian_Railway_banner_Crop_from_map.png 383K Sep 13 19:29 m%2fYpres_Wikivoyage_Banner.png 359K Sep 13 19:29 m%2fGreifswald_Wikivoyage_banner.png 357K Sep 13 19:29 m%2fParis_16e_Wikivoyage_Banner_.png 351K Sep 13 19:29 m%2fTepelena_Banner.png 350K Sep 13 19:29 m%2fTetovo_banner_Painted_Mosque.png

09
785K Sep 13 19:35 m%2fBest_Tradition_Field_Panorama_-_Mets.JPG 455K Sep 13 19:35 m%2fAppalachian_Trail_Map_copy.png 229K Sep 13 19:35 m%2fTransMilenio_Bogota_Map.png 197K Sep 13 19:35 m%2fSears_Tower_Skydeck_view_labeled.png 196K Sep 13 19:35 m%2fHancock_Center_Observatory_view_labeled.png 186K Sep 13 19:35 m%2fWest_Kootenays_WV_travel_map_EN.png 183K Sep 13 19:35 m%2fColumbia-Rockies_WV_travel_map.png 179K Sep 13 19:35 m%2fEast_kootenays_test.png 176K Sep 13 19:35 m%2fGlacierParkMontanaNPSMap.PNG 172K Sep 13 19:35 m%2fChesapeake_and_Ohio_Canal_park_map.png 168K Sep 13 19:35 m%2fHangolMudvolcano3.JPG

Statistics of the diff between 08/I and 09/I
grep -c "Only in 08/I" diff_I_folders.txt 8216

grep "Only in 08/I" diff_I_folders.txt | grep -c banner 7801

So that leaves 415 images that did not have banner in their names but were removed in the latest dump

"Missing" images
grep "Only in 08/I" diff_I_folders.txt | grep -i -v banner > non-banner.txt