Originally Published: Friday, 27 April 2001 Author: IRC Staff
Published to: interact_articles_irc_recap/IRC Recap Page: 1/1 - [Std View]

Data Munging with Dave Cross

Dave Cross, the author of "Data Munging with Perl", talked to us the other day about writing his book and about some of the material from his book. If you didn't make it or you would like to read over the log from the event, then here here it is.

<goodness> Welcome to the first of a series of Linux.com Live! Events called "Author Chats".
<goodness> Today we're pleased to have with us David Cross * dave_cross waves
<goodness> author of "Data Munging with Perl"...
<goodness> from Manning Publishers.
<goodness> Dave Cross is the owner and Managing Director of Magnum Solutions Ltd., an Internet and database consultancy based in London, UK. He has 12 years' experience working in the IT industry. He is an active member of the Perl community, the founder of the London Perl Mongers, and is also a regular columnist for Perlmonth, the online Perl magazine.
<goodness> Please welcome, David Cross!
<dave_cross> hello
<goodness> Dave, do you want to give us a quick summary of the book to start with?
<dave_cross> OK.
<dave_cross> As someone said earlier, munging is the act of changing data from one format to another.
<dave_cross> Munging is something that I've been doing for 12 years,
<dave_cross> and for the last 5 of those years, I've been using Perl to do most of my munging work.
<dave_cross> The book attempts to show how Perl is very good at carrying out munging tasks.
<dave_cross> It starts with a description of what munging is and why Perl is so good to use for munging.
<dave_cross> Chapter 2 is the extract that was posted on the linux.com site - it explains lots of useful general principles.
<dave_cross> Chapter 3 talks about useful Perl programming idiom - sorting, database access, that kind of thing.
<dave_cross> The rest of the book goes through a number of common data formats and discusses how to deal with them
<dave_cross> I cover record-based data (delimited and fixed width),
<dave_cross> binary data and hierarchical data like HTML and XML, and
<dave_cross> finally I talk about creating parsers to parse arbitrary data formats.
<dave_cross> (is that enough detail?)
<lcModerator> Question from PerlJam: Dave, what made you decide to write this book?
<dave_cross> Manning wanted to publish more Perl books to build on the success of Object Oriented Perl and Elements of Programming with Perl.
<dave_cross> I was working as a technical reviewer for them and they sent me a list of books they were considering
<dave_cross> to get my opinions on what might sell
<dave_cross> one of the books was the data munging one.
<dave_cross> I said that I'd like to have a go at that and they agreed.
<dave_cross> 15 months later i finished the book :)
<dave_cross> (having promised it to them in 4 months!)
<dave_cross> book writing takes a lot longer that I thought it would!
<dave_cross> I did think it was an important book to write.
<lcModerator> Question from robster: "Why is Perl better for data manipulation than, say Python?"
<dave_cross> Well, i know very little about python - so my opinions are a bit biased :)
<dave_cross> From what I've seen, Python is very similar to Perl in a lot a of ways
<dave_cross> but here are a couple of important differences.
<dave_cross> 1/ Python is a much more formalized language. You can't take so many short cuts.
<dave_cross> Perl is deliberately designed to match the way that programmers (or at least some programmers) write code.
<dave_cross> I find that I can write code in Perl much faster than i can in any other language
<dave_cross> 2/ Perl has a huge archive of pre-written code in the CPAN.
<dave_cross> Much of the book is about using CPAN modules to avoid re-writing the wheel.
<dave_cross> I think that's very important
<dave_cross> My first reason was a bit subjective, but I think the second one is a real 'killer app'.
<dave_cross> Does that answer the question?
<lcModerator> robster: Yes, Thank You :)
<lcModerator> Question: Have you ever built yourself a library of useful munging 'tools' over the years?
<dave_cross> No. Not really.
<dave_cross> I find that the CPAN modules (say Text::CSV), for example, are at about the right level to be re-used.
<dave_cross> Anything at a higher level probably needs to be rewritten for different projects.
<dave_cross> I should point out that I'm freelance so I don't work on the same project for very long.
<dave_cross> I try to leave behind reusable tools at each client
<dave_cross> probably in the form of Object modules.
<lcModerator> Question: Do you feel Perl is best suited towards these useful hacks rather than building large applications - What other languages do you use regularly?
<dave_cross> I don't know of any other language that has anything like the CPAN
<dave_cross> Whenever you're starting to work on a problem, it should be the first place that you look.
<dave_cross> 90% of the things that you look for will be there
<dave_cross> That's another point in Perl's favor. the community spirit is incredible.
<dave_cross> All the code on the CPAN has been donated for free.
<dave_cross> Other languages seem to be a bit more "corporate"
<dave_cross> Other communities seem to guard source code at lot more jealously.
<dave_cross> Anyway, I didn't come here to say negative things about other languages, but to say positive things about Perl :)
<dave_cross> Another question?
<lcModerator> What sort of data types do you find yourself munging most often?
<dave_cross> Ah! The "hacking" question
<dave_cross> I think that Perl is perfectly suited to writing systems as large as you want.
<dave_cross> You just have to impose standards, the same way that you do any other language.
<dave_cross> I have to confess that since I found Perl I haven't written much at all in other languages.
<dave_cross> I really haven't found the need.
<dave_cross> If speed is _really_ of the essence, then you might need to write come parts in a compiled language,
<dave_cross> but that's pretty rare in my experience.
<dave_cross> On data types:
<dave_cross> that varies from project to project
<dave_cross> currently I'm on a project that takes a lot of data from IBM mainframes
<dave_cross> fixed-width data is the order of the day,
<dave_cross> lots of use of pack and unpack
<dave_cross> on my previous project I was reading and writing XML.
<dave_cross> I also do a lot of database work. I used to specialize is Sybase, but in the last couple of years I've used many other database systems.
<dave_cross> I guess that demonstrates the flexibility of Perl :)
<lcModerator> Do you find that there's usually a CPAN module to access the data you are munging, or do you more often find yourself writing something from scratch using pack/unpack/regular expressions/etc?
<dave_cross> Another good reason for writing the book was as a reaction to Perl's current image.
<dave_cross> It's seen largely as a CGI language - when that's not what it was designed for (or even what it's best at).
<dave_cross> There's usually a CPAN module there at _some_ level
<dave_cross> the amount of work I need to do on top of it varies.
<dave_cross> Take XML for an example,
<dave_cross> if you're reading or writing some proprietary DTD then you'll need to write your own code based on XML::Parser
<dave_cross> but if you're doing RSS, then there's already an XML::RSS module which makes your live easier,
<dave_cross> and if I've got some fixed width data from an IBM mainframe, there's not much chance that someone else has dealt with _exactly_ that format,
<dave_cross> so I can quickly write something using unpack or a regex.
<lcModerator> Aside from CPAN, what's your favorite Perl feature?
<dave_cross> I've already mentioned the community - I really think that's important
<dave_cross> I'd recommend that anyone makes contact with their local Perl Mongers group,
<dave_cross> but if you meant a more 'technical' feature
<dave_cross> I'd have to think a bit harder.
<lcModerator> PerlJam: Dave, It's been my experience that many munging tasks involve getting data into or out of a database. Why is DBI given just a nod in Chapter 3 rather than a chapter of its own like HTML/XML?
<dave_cross> Because there's a whole (very good) book about DBI written by the author of the module.
<dave_cross> I couldn't really compete with that,
<dave_cross> or i could have done something like the database chapter in the Cookbook
<dave_cross> but I felt it was more important to cover stuff that wasn't covered in depth elsewhere.
<lcModerator> What advice would you give to someone who's looking to get started with Perl?
<dave_cross> Would have liked to have seen more DBI stuff? No-one has said that to me yet.
<dave_cross> Advice to beginners?
<dave_cross> Buy the _right_ book.
<dave_cross> Learning Perl or Beginning Perl or Elements of Programming with Perl
<dave_cross> Link up with local Perl Mongers
<dave_cross> Visit www.perlmonks.com
<dave_cross> Get involved with the community
<dave_cross> Oh. and read the FAQs!
<dave_cross> And (really important) use -w and 'use strict'!
<dave_cross> I'd say, read good Perl code,
<dave_cross> but it's difficult for a beginner to know what _is_ good code
<dave_cross> too many people see code by people like Matt Wright and pick up loads of bad habits.
<lcModerator> Do you think that understanding regular expressions is essential for complex munging tasks?
<dave_cross> I think that understanding regexes is important in order to get the best out of Perl.
<dave_cross> But many people pick up a regexes and then use them too often
<dave_cross> when perhaps substr or index of unpack might be more appropriate.
<dave_cross> I've seen code like if ($var =~ /^STRING$/)
<dave_cross> which is better written as if ($var eq 'STRING')
<dave_cross> but, of course, regexes are _very_ powerful and _very_ useful
<dave_cross> which is why there's a chapter about them in the book.
<dave_cross> I wrote a column about Perl books for perlmonth
<dave_cross> and one of the essential books I listed was "Mastering Regular Expressions"
<dave_cross> even though only a quarter of it is specific to Perl.
<lcModerator> Should a person interested in learning/using Perl also study, say, Python at the same time? Or do you suggest they get one language down pat, then add to their knowledge?
<dave_cross> I've never learnt two languages simultaneously, so I don't really know how difficult that would be.
<dave_cross> I'm sure there would be plenty of potential for confusion there.
<dave_cross> I know that I often find now, that if I program in a language that isn't Perl, I get frustrated because it isn't as flexible.
<dave_cross> For instance, I tried to write some C about a year ago and it was a disaster (and i used to be pretty good at C).
<lcModerator> How do you compare PHP and Perl in the specific area of web applications?
<dave_cross> I couldn't get to grips with variables without $, @ or % :)
<dave_cross> Randal has a good quote about PHP
<dave_cross> He say's it's like riding a bike with stabilizers on
<dave_cross> and I think that's pretty fair
<dave_cross> I can't really see what PHP has going for it,
<dave_cross> given that there are modules like HTML::Mason, Embperl and the Template Toolkit that allow you to embed Perl code in HMTL pages.
<dave_cross> I also don't understand why I'd want to learn a language that is _just_ for web pages,
<dave_cross> when another language can be used for that _and_ just about everything else
<dave_cross> I don't actually do all that much web work - so it seems a waste of time to learn it.
<lcModerator> There was an article on freshmeat a few weeks back about how some programmers in Perl get a bad rap, as a lot of the serious Perl coders do work for shall we say ADULT sites. They said they where shunned by fellow coders, yet were excellent coders. Do you see a difference really as to who you code for being an issue?
<dave_cross> I don't know of any serious Perl programmers who do that. but I know of some who would have no objections to doing it.
<dave_cross> Who I'm working for is _very_ important to me.
<dave_cross> Some of the work that I'm currently doing goes _very_close to being spam
<dave_cross> and if it stepped over the line that I'm comfortable with, then I'd leave.
<dave_cross> I'd never work for adult sites.
<dave_cross> I'd be very interested to read that article if someone could email me a URL.
<dave_cross> I'm surprised it didn't appear on one of the Perl news sites.
<dave_cross> Any more questions?
<lcModerator> What do you think of the Perl 6 effort?
<dave_cross> I'm very excited about Perl 6!
<dave_cross> It'll give me a chance to update the book and sell more copies :)
<dave_cross> But seriously...
<dave_cross> I was at the conference last year when Perl 6 was announced
<dave_cross> and I know that opinion was divided
<dave_cross> but I think that most of the community has now united behind the idea.
<dave_cross> It's difficult to have any concrete opinions until more of the details emerge.
<dave_cross> If anyone is interested in it, I'd encourage them to read Larry Wall's series of articles on perl.com.
<dave_cross> The first one was published a couple of weeks ago.
<dave_cross> I think that Perl 6 will address a lot of the issues that corporations have with using Perl,
<dave_cross> it will make it much easier to impose coding standards on projects.
<lcModerator> Any ideas on how to get scripting languages taught in traditional "Comp Sci" departments? Most people I know seem to self-teach "data munging".
<dave_cross> That's a very good question. I wish I knew the answer.
<dave_cross> I think there's two reasons why they aren't taught:
<dave_cross> 1/ Most lecturers don't know them. They consider 'scripting' below them.
<dave_cross> 2/ Most courses are very much driven by the job market. and there is far more call for Java or C++ programming.
<dave_cross> (or even, God help us, VB or C#).
<dave_cross> So, we need to do two things:
<dave_cross> 1/ Persuade lecturers that Perl is worthy of study. perhaps we should start calling it a "programming language"!
<dave_cross> 2/ Increase the market for Perl jobs. Get your company to employ more Perl programmers!
<dave_cross> Other than that - I don't really know.
<dave_cross> Maybe students could ask their lecturers?
<dave_cross> Maybe perl-using companies could ask colleges?
<dave_cross> Any more questions?
<lcModerator> I'll un-moderate the channel now...
<lcModerator> Any additional questions from anyone?
<PerlJam> I wish CPAN were moderated in some sense. Yeah it's good that there are people willing to contribute code, but how do you distinguish the good from the bad when there are 5 modules for doing a particular task?
<dave_cross> That _is_ something that is being addressed.
<dave_cross> People are talking about CPANTS
<dave_cross> The CPAN testing service,
<dave_cross> which will act as quality control for CPAN
<patcoll> have there be a rating system...
<dave_cross> that's _sorely_ needed.
<PerlJam> Yeah.
<patcoll> amigoodperlcodeornot.com
<PerlJam> A community rating system ala advogato would be nice too.
<dave_cross> There are a number of Perl people on advogato,
<dave_cross> and I know of a couple of groups who are thinking about certification schemes
<thrig> LWCH?
<dave_cross> ?
<thrig> "Larry Wall Certified Hacker"
<dave_cross> Something like that :)
<PerlJam> Just Another Certified Perl Hacker
<goodness> Hi - at this point we're wrapping up the "official" event,
<goodness> thanks to everyone for attending,
<goodness> and, of course, everyone is free to hang out as long as they like.
<goodness> Special thanks to David Cross for a great event!
<dave_cross> My pleasure.
<starlady> thanks dave :)
<dave_cross> Feel free to email me if you want to know any more...
<dave_cross> dave@dave.org.uk
<dave_cross> Thanks to everyone at linux.com for inviting me.